We introduce BPG, a framework for generating publication-quality, highly-customizable plots in the R statistical environment.
This open-source package includes multiple methods of displaying ...high-dimensional datasets and facilitates generation of complex multi-panel figures, making it suitable for complex datasets. A web-based interactive tool allows online figure customization, from which R code can be downloaded for integration with computational pipelines.
BPG provides a new approach for linking interactive and scripted data visualization and is available at http://labs.oicr.on.ca/boutros-lab/software/bpg or via CRAN at https://cran.r-project.org/web/packages/BoutrosLab.plotting.general.
Federated Learning is the current state-of-the-art in supporting secure multi-party machine learning (ML): data is maintained on the owner's device and the updates to the model are aggregated through ...a secure protocol. However, this process assumes a trusted centralized infrastructure for coordination, and clients must trust that the central service does not use the byproducts of client data. In addition to this, a group of malicious clients could also harm the performance of the model by carrying out a poisoning attack. As a response, we propose Biscotti: a fully decentralized peer to peer (P2P) approach to multi-party ML, which uses blockchain and cryptographic primitives to coordinate a privacy-preserving ML process between peering clients. Our evaluation demonstrates that Biscotti is scalable, fault tolerant, and defends against known attacks. For example, Biscotti is able to both protect the privacy of an individual client's update and maintain the performance of the global model at scale when 30 percent adversaries are present in the system.
A bedr way of genomic interval processing Haider, Syed; Waggott, Daryl; Lalonde, Emilie ...
Source code for biology and medicine,
12/2016, Letnik:
11, Številka:
1
Journal Article
Recenzirano
Odprti dostop
Next-generation sequencing is making it critical to robustly and rapidly handle genomic ranges within standard pipelines. Standard use-cases include annotating sequence ranges with gene or other ...genomic annotation, merging multiple experiments together and subsequently quantifying and visualizing the overlap. The most widely-used tools for these tasks work at the command-line (e.g. BEDTools) and the small number of available R packages are either slow or have distinct semantics and features from command-line interfaces.
To provide a robust R-based interface to standard command-line tools for genomic coordinate manipulation, we created bedr. This open-source R package can use either BEDTools or BEDOPS as a back-end and performs data-manipulation extremely quickly, creating R data structures that can be readily interfaced with existing computational pipelines. It includes data-visualization capabilities and a number of data-access functions that interface with standard databases like UCSC and COSMIC.
bedr package provides an open source solution to enable genomic interval data manipulation and restructuring in R programming language which is commonly used in bioinformatics, and therefore would be useful to bioinformaticians and genomic researchers.
Abstract
Introduction:
Large-scale interrogation of the genome has emerged as an attractive method for identifying useful characteristics of cancer biology; in particular, the study of copy number ...aberrations (CNA) has recently received tremendous attention. A number of different technologies have been developed to assess the copy-number landscape, allowing us to better understand the role of CNA in cancer cells. The OncoScan CNA platform (Affymetrix Inc.) has been particularly appealing for oncology due of its ability to work well with formalin-fixed, paraffin-embedded (FFPE) materials, which is the primary form for storage of clinical samples. In addition, its high resolution, rapid analysis time and ability to interrogate different genomic characteristics (CNA, loss of heterozygosity or mutation) make the OncoScan platform highly popular: it has been widely cited in the literature for use in biomarker discovery, clonal evolution and sub-clonal detection, as well as population-based analyses. While CNAs identified by the OncoScan platform have shown good concordance with fluorescence in-situ hybridization (FISH) results, to date, no studies have been conducted to thoroughly assess the reproducibility of the assay. In this study, we have assessed the reproducibility of the OncoScan platform using identical samples performed in replicates across multiple chip batches. Moreover, we have assessed the effect on reproducibility of DNA treatment, including elution in water or TE buffer, as well as in the use of varying amounts of DNA.
Methods:
Affymetrix OncoScan FFPE Express 3.0 SNP Arrays were performed using the optimal input DNA as recommended by the manufacturer as well as fewer input amounts for comparison. CNAs were called using BioDiscovery Nexus Copy Number™ software (http://www.biodiscovery.com/software/nexus-copy-number/) using the SNP-FASST2 algorithm with modified parameters (significance threshold of 1 x 10-9 and minimum number of probes per segment of 10).
Results:
Initial reproducibility analysis involving 12 samples repeated either 2, 4 or 6 times both within a single batch and across different batches has revealed that CNA calls were concordant between replicates for the majority of the genome (ranges between 81% to 100%), suggesting high precision of the assay. In addition, we are in the process of assessing and comparing mutation calls across replicates to gain a more in-depth understanding of the platform.
Conclusion:
This is the first study examining the reproducibility of OncoScan FFPE assays; initial results have suggested that the assay is precise and has the potential for robust biomarker discovery. Additional characterizations would be interesting for evaluating its use as a clinical tool in the long term.
Citation Format: Cindy Q Yao, Cheryl Crozier, Mary Anne Quintayo, Jane Bayani, Melanie Spears, Julie Livingstone, Esther Jung, Clement Fung, Victoria Sabine, Paul C Boutros, John MS Bartlett. Assessing reproducibility of copy number arrays to assist breast cancer biomarker discovery abstract. In: Proceedings of the Thirty-Seventh Annual CTRC-AACR San Antonio Breast Cancer Symposium: 2014 Dec 9-13; San Antonio, TX. Philadelphia (PA): AACR; Cancer Res 2015;75(9 Suppl):Abstract nr P2-03-17.
Rapid machine learning (ML) adoption across a range of industries has prompted numerous concerns. These range from privacy (how is my data being used?) to fairness (is this model's result ...representative?) and provenance (who is using my data and how can I restrict this usage?).
Now that ML is widely used, we believe it is time to rethink security, privacy, and incentives in the ML pipeline by re-considering control. We consider distributed multi-party ML proposals and identify their shortcomings. We then propose brokered learning, which distinguishes the curator (who determines the training set-up) from that of the broker coordinator (who runs the training process). We consider the implications of this setup and present evaluation results from implementing and deploying TorMentor, an example of a brokered learning system that implements the first distributed ML training system with anonymity guarantees.
Client-side cross-site scripting (DOM XSS) vulnerabilities in web applications are common, hard to identify, and difficult to prevent. Taint tracking is the most promising approach for detecting DOM ...XSS with high precision and recall, but is too computationally expensive for many practical uses.
We investigate whether machine learning (ML) classifiers can replace or augment taint tracking when detecting DOM XSS vulnerabilities. Through a large-scale web crawl, we collect over 18 billion JavaScript functions and use taint tracking to label over 180,000 functions as potentially vulnerable. With this data, we train a deep neural network (DNN) to analyze a JavaScript function and predict if it is vulnerable to DOM XSS. We experiment with a range of hyperparameters and present a low-latency, high-recall classifier that could serve as a pre-filter to taint tracking, reducing the cost of stand-alone taint tracking by 3.43 × while detecting 94.5% of unique vulnerabilities. We argue that this combination of a DNN and taint tracking is efficient enough for a range of use cases for which taint tracking by itself is not, including in-browser run-time DOM XSS detection and analyzing large codebases.
Abstract
Prostate cancer (CaP) remains the most common male malignancy worldwide, leading to over 300,000 deaths per year. In Western countries, most prostate tumours are diagnosed while they are ...confined to the prostate and have relatively indolent histology, as assessed by the Gleason Score (GS). CaP is a C-class tumour, characterized by large number of driver copy-number aberrations and genomic rearrangements. Therefore, while previous sequencing studies have focused largely on the coding regions of late-stage disease, herein we comprehensively characterized the copy-number profiles of 250 localized prostate cancers and analyzed the whole genomes of 124 matched tumour/normal pairs derived from patients with GS6 and GS7 prostate cancer. Using this – the largest whole-genome sequencing dataset of prostate cancer to date – we confirm the C-class character of the disease and identify strong genomic subtypes that stretch across multiple types of somatic alteration, including SNVs, CNAs and genomic rearrangements. We provide the first assessments of localized hyper-mutation phenomena (chromothripsis and kataegis) in prostate cancer, and identify specific genes driving higher levels of these hyper-mutations. We identify unexpected biases in the location and role of both non-coding SNVs and genomic rearrangements, including clear association with epigenetic processes, and with genome-wide profiling of methylation in 92 samples. Finally, we demonstrate a stark paucity of clinically-actionable mutations in localized GS6 and GS7 disease, even lacking those common in high-risk localized disease, indicating that novel therapeutic development against the recurrent targets identified here will be key to allowing less-aggressive, targeted treatment of early-stage disease.
Citation Format: Michael E. Fraser, Veronica Y. Sabelnykova, Takafumi N. Yamaguchi, Alice Meng, Lawrence E. Heisler, Junyan Zhang, Julie Livingstone, Vincent Huang, Andre P. Masella, Fouad Yousif, Michael Xie, Nicholas J. Harding, Xihui Lin, Haiying Kong, Stephenie D. Prokopec, Alejandro Berlin, Dominique Trudel, Xuemei Luo, Timothy E. Beck, Richard de Borja, Alister D'Costa, Robert E. Denroche, Natalie S. Fox, Emilie Lalonde, Ada Wong, Taryne Chong, Michelle Sam, Jeremy Johns, Lee Timms, Nicholas Buchner, Michele Orain, Valerie Picard, Helene Hovington, Kenneth C. Chu, Christine P'ng, Bryan Lo, Francis Nguyen, Kathleen E. Houlahan, Christopher Cooper, Shaylan K. Govind, Clement Fung, Louis Lacombe, Colin C. Collins, Yves Fradet, Bernard Tetu, Theodorus van der Kwast, John McPherson, Thomas J. Hudson, Rob G. Bristow, Paul Boutros. The mutational landscape of localized gleason 6 and 7 prostate cancer. abstract. In: Proceedings of the 106th Annual Meeting of the American Association for Cancer Research; 2015 Apr 18-22; Philadelphia, PA. Philadelphia (PA): AACR; Cancer Res 2015;75(15 Suppl):Abstract nr 2966. doi:10.1158/1538-7445.AM2015-2966
Anomaly detection requires detecting abnormal samples in large unlabeled datasets. While progress in deep learning and the advent of foundation models has produced powerful zero-shot anomaly ...detection methods, their deployment in practice is often hindered by the lack of labeled data -- without it, their detection performance cannot be evaluated reliably. In this work, we propose SWSA (Selection With Synthetic Anomalies): a general-purpose framework to select image-based anomaly detectors with a generated synthetic validation set. Our proposed anomaly generation method assumes access to only a small support set of normal images and requires no training or fine-tuning. Once generated, our synthetic validation set is used to create detection tasks that compose a validation framework for model selection. In an empirical study, we find that SWSA often selects models that match selections made with a ground-truth validation set, resulting in higher AUROCs than baseline methods. We also find that SWSA selects prompts for CLIP-based anomaly detection that outperform baseline prompt selection strategies on all datasets, including the challenging MVTec-AD and VisA datasets.
Distributed machine learning (ML) systems today use an unsophisticated threat model: data sources must trust a central ML process. We propose a brokered learning abstraction that allows data sources ...to contribute towards a globally-shared model with provable privacy guarantees in an untrusted setting. We realize this abstraction by building on federated learning, the state of the art in multi-party ML, to construct TorMentor: an anonymous hidden service that supports private multi-party ML. We define a new threat model by characterizing, developing and evaluating new attacks in the brokered learning setting, along with new defenses for these attacks. We show that TorMentor effectively protects data providers against known ML attacks while providing them with a tunable trade-off between model accuracy and privacy. We evaluate TorMentor with local and geo-distributed deployments on Azure/Tor. In an experiment with 200 clients and 14 MB of data per client, our prototype trained a logistic regression model using stochastic gradient descent in 65s. Code is available at: https://github.com/DistributedML/TorML