Current methods for comparing single-cell RNA sequencing datasets collected in multiple conditions focus on discrete regions of the transcriptional state space, such as clusters of cells. Here we ...quantify the effects of perturbations at the single-cell level using a continuous measure of the effect of a perturbation across the transcriptomic space. We describe this space as a manifold and develop a relative likelihood estimate of observing each cell in each of the experimental conditions using graph signal processing. This likelihood estimate can be used to identify cell populations specifically affected by a perturbation. We also develop vertex frequency clustering to extract populations of affected cells at the level of granularity that matches the perturbation response. The accuracy of our algorithm at identifying clusters of cells that are enriched or depleted in each condition is, on average, 57% higher than the next-best-performing algorithm tested. Gene signatures derived from these clusters are more accurate than those of six alternative algorithms in ground truth comparisons.
The high-dimensional data created by high-throughput technologies require visualization tools that reveal data structure and patterns in an intuitive form. We present PHATE, a visualization method ...that captures both local and global nonlinear structure using an information-geometric distance between data points. We compare PHATE to other tools on a variety of artificial and biological datasets, and find that it consistently preserves a range of patterns in data, including continual progressions, branches and clusters, better than other tools. We define a manifold preservation metric, which we call denoised embedding manifold preservation (DEMaP), and show that PHATE produces lower-dimensional embeddings that are quantitatively better denoised as compared to existing visualization methods. An analysis of a newly generated single-cell RNA sequencing dataset on human germ-layer differentiation demonstrates how PHATE reveals unique biological insight into the main developmental branches, including identification of three previously undescribed subpopulations. We also show that PHATE is applicable to a wide variety of data types, including mass cytometry, single-cell RNA sequencing, Hi-C and gut microbiome data.
Oxford Nanopore Technologies' (ONT) MinION and PromethION long-read sequencing technologies are emerging as genuine alternatives to established Next-Generation Sequencing technologies. A combination ...of the highly redundant file format and a rapid increase in data generation have created a significant problem both for immediate data storage on MinION-capable laptops, and for long-term storage on lab data servers.
We developed Picopore, a software suite offering three methods of compression. Picopore's lossless and deep lossless methods provide a 25% and 44% average reduction in size, respectively, without removing any data from the files. Picopore's raw method provides an 88% average reduction in size, while retaining biologically relevant data for the end-user. All methods have the capacity to run in real-time in parallel to a sequencing run, reducing demand for both immediate and long-term storage space.
Oxford Nanopore Technologies' (ONT's) MinION and PromethION long-read sequencing technologies are emerging as genuine alternatives to established Next-Generation Sequencing technologies. A ...combination of the highly redundant file format and a rapid increase in data generation have created a significant problem both for immediate data storage on MinION-capable laptops, and for long-term storage on lab data servers. We developed Picopore, a software suite offering three methods of compression. Picopore's lossless and deep lossless methods provide a 25% and 44% average reduction in size, respectively, without removing any data from the files. Picopore's raw method provides an 88% average reduction in size, while retaining biologically relevant data for the end-user. All methods have the capacity to run in real-time in parallel to a sequencing run, reducing demand for both immediate and long-term storage space.
Oxford Nanopore Technologies' (ONT's) MinION and PromethION long-read sequencing technologies are emerging as genuine alternatives to established Next-Generation Sequencing technologies. A ...combination of the highly redundant file format and a rapid increase in data generation have created a significant problem both for immediate data storage on MinION-capable laptops, and for long-term storage on lab data servers. We developed Picopore, a software suite offering three methods of compression. Picopore's lossless and deep lossless methods provide a 25% and 44% average reduction in size, respectively, without removing any data from the files. Picopore's raw method provides an 88% average reduction in size, while retaining biologically relevant data for the end-user. All methods have the capacity to run in real-time in parallel to a sequencing run, reducing demand for both immediate and long-term storage space.
Due to commonalities in pathophysiology, age-related macular degeneration (AMD) represents a uniquely accessible model to investigate therapies for neurodegenerative diseases, leading us to examine ...whether pathways of disease progression are shared across neurodegenerative conditions. Here we use single-nucleus RNA sequencing to profile lesions from 11 postmortem human retinas with age-related macular degeneration and 6 control retinas with no history of retinal disease. We create a machine-learning pipeline based on recent advances in data geometry and topology and identify activated glial populations enriched in the early phase of disease. Examining single-cell data from Alzheimer's disease and progressive multiple sclerosis with our pipeline, we find a similar glial activation profile enriched in the early phase of these neurodegenerative diseases. In late-stage age-related macular degeneration, we identify a microglia-to-astrocyte signaling axis mediated by interleukin-1β which drives angiogenesis characteristic of disease pathogenesis. We validated this mechanism using in vitro and in vivo assays in mouse, identifying a possible new therapeutic target for AMD and possibly other neurodegenerative conditions. Thus, due to shared glial states, the retina provides a potential system for investigating therapeutic approaches in neurodegenerative diseases.
Abstract
Systematic variation in the methylation of cytosines at CpG sites plays a critical role in early development of humans and other mammals. Of particular interest are regions of differential ...methylation between parental alleles, as these often dictate monoallelic gene expression, resulting in parent of origin specific control of the embryonic transcriptome and subsequent development, in a phenomenon known as genomic imprinting. Using long-read nanopore sequencing we show that, with an average genomic coverage of ∼10, it is possible to determine both the level of methylation of CpG sites and the haplotype from which each read arises. The long-read property is exploited to characterize, using novel methods, both methylation and haplotype for reads that have reduced basecalling precision compared to Sanger sequencing. We validate the analysis both through comparison of nanopore-derived methylation patterns with those from Reduced Representation Bisulfite Sequencing data and through comparison with previously reported data. Our analysis successfully identifies known imprinting control regions (ICRs) as well as some novel differentially methylated regions which, due to their proximity to hitherto unknown monoallelically expressed genes, may represent new ICRs.
As the biomedical community produces datasets that are increasingly complex and high dimensional, there is a need for more sophisticated computational tools to extract biological insights. We present ...Multiscale PHATE, a method that sweeps through all levels of data granularity to learn abstracted biological features directly predictive of disease outcome. Built on a coarse-graining process called diffusion condensation, Multiscale PHATE learns a data topology that can be analyzed at coarse resolutions for high-level summarizations of data and at fine resolutions for detailed representations of subsets. We apply Multiscale PHATE to a coronavirus disease 2019 (COVID-19) dataset with 54 million cells from 168 hospitalized patients and find that patients who die show CD16
CD66b
neutrophil and IFN-γ
granzyme B
Th17 cell responses. We also show that population groupings from Multiscale PHATE directly fed into a classifier predict disease outcome more accurately than naive featurizations of the data. Multiscale PHATE is broadly generalizable to different data types, including flow cytometry, single-cell RNA sequencing (scRNA-seq), single-cell sequencing assay for transposase-accessible chromatin (scATAC-seq), and clinical variables.
In recent years, modern technologies have enabled the collection of exponentially larger quantities of data in the biomedical domain and elsewhere. In particular, the advent of single-cell genomics ...has allowed for the collection of datasets containing hundreds of thousands of cells measured in tens of thousands of dimensions. This rapid expansion of common datasets beyond the possibility of manual annotation brings forth the need for large-scale exploratory data analysis. In this thesis, we will explore the problem of dimensionality reduction for visualization of high-dimensional datasets. Visualization of high-dimensional data is an essential task in exploratory data analysis, as the low-dimensional visualization of the data is used to understand, interrogate and present the results of many other analyses applied to the data. However, the repertoire of existing algorithms used for this task suffer from various algorithmic flaws leading to sub-optimal visualizations, including the trade-off between representing both local and global structure; the inherent sacrifices that must be made to reduce a dataset of intrinsic dimension greater than three to a form which can be interpreted by the human eye; and the computational complexity of the computations as the datasets increase in scale. Here, we use the framework provided by diffusion maps to present a new dimensionality reduction algorithm called PHATE, which seeks to address all three of these issues. In order to make the PHATE algorithm scalable, we present an approximation of the diffusion map through discrete partitions of the data called Compression-based Fast Diffusion Maps. Further, we use the insights gained from visualizing single-cell genomics data to present a manifold alignment algorithm called Harmonic Alignment, which allows for the correction of systemic differences between experiments, or the fusion of datasets collected from the same biological system using different assays. And finally, we present an extension of PHATE to longitudinal data, and demonstrate its utility for the purpose of machine learning interpretability by visualizing the hidden units of a neural network in training. While many open problems remain, the presentation of the methods herein chart a path towards a more systematic understanding of how we visualize high-dimensional data for exploratory data analysis.