The genome of the Severe Acute Respiratory Syndrome coronavirus 2 (SARS-CoV-2), the pathogen that causes coronavirus disease 2019 (COVID-19), has been sequenced at an unprecedented scale leading to a ...tremendous amount of viral genome sequencing data. To assist in tracing infection pathways and design preventive strategies, a deep understanding of the viral genetic diversity landscape is needed. We present here a set of genomic surveillance tools from population genetics which can be used to better understand the evolution of this virus in humans. To illustrate the utility of this toolbox, we detail an in depth analysis of the genetic diversity of SARS-CoV-2 in first year of the COVID-19 pandemic. We analyzed 329,854 high-quality consensus sequences published in the GISAID database during the pre-vaccination phase. We demonstrate that, compared to standard phylogenetic approaches, haplotype networks can be computed efficiently on much larger datasets. This approach enables real-time lineage identification, a clear description of the relationship between variants of concern, and efficient detection of recurrent mutations. Furthermore, time series change of Tajima's D by haplotype provides a powerful metric of lineage expansion. Finally, principal component analysis (PCA) highlights key steps in variant emergence and facilitates the visualization of genomic variation in the context of SARS-CoV-2 diversity. The computational framework presented here is simple to implement and insightful for real-time genomic surveillance of SARS-CoV-2 and could be applied to any pathogen that threatens the health of populations of humans and other organisms.
Abstract
In single-cell RNA sequencing analysis, several computational methods have been developed to map the cellular state space, but little has been done to map the gene space. A mapping that ...preserves gene-gene relationships within the dataset is particularly useful for characterizing cellular heterogeneity within cell types, where boundaries between cell subpopulations are often unclear or even arbitrary.
Here, we present gene signal pattern analysis, a new paradigm for analyzing single cells. We build a cell-cell graph and design a dictionary of diffusion wavelets, capturing a multiscale view of the cell space. We then transform genes by the dictionary and learn a reduced gene representation. Given the gap in prior research for this problem, we design nine alternative strategies and three benchmarks for evaluating preservation of gene-gene relationships, all of which are outperformed by diffusion wavelet-transformed signals. We also define, calculate, and evaluate localization, a key property of a gene signal on the cellular graph.
We demonstrate the utility of gene signal pattern analysis on T cells from a mouse model of peripheral tolerance in skin. The gene space mapping reveals a continuum of gene signals characterized by T cell subtypes and transcriptional programs related to effector function and proliferation. Furthermore, we built a multiscale manifold of 48 melanoma patient samples, demonstrating the ability of our method to characterize differences between responders and non-responders to checkpoint immunotherapy. Together, we show gene signal pattern analysis, through methodology from graph signal processing, spectral graph theory, and machine learning, represents an avenue for future research in scRNA-seq analysis.
Gruber Foundation
It is currently challenging to analyze single-cell data consisting of many cells and samples, and to address variations arising from batch effects and different sample preparations. For this purpose, ...we present SAUCIE, a deep neural network that combines parallelization and scalability offered by neural networks, with the deep representation of data that can be learned by them to perform many single-cell data analysis tasks. Our regularizations (penalties) render features learned in hidden layers of the neural network interpretable. On large, multi-patient datasets, SAUCIE's various hidden layers contain denoised and batch-corrected data, a low-dimensional visualization and unsupervised clustering, as well as other information that can be used to explore the data. We analyze a 180-sample dataset consisting of 11 million T cells from dengue patients in India, measured with mass cytometry. SAUCIE can batch correct and identify cluster-based signatures of acute dengue infection and create a patient manifold, stratifying immune response to dengue.
We propose a new graph neural network (GNN) module, based on relaxations of recently proposed geometric scattering transforms, which consist of a cascade of graph wavelet filters. Our learnable ...geometric scattering (LEGS) module enables adaptive tuning of the wavelets to encourage band-pass features to emerge in learned representations. The incorporation of our LEGS-module in GNNs enables the learning of longer-range graph relations compared to many popular GNNs, which often rely on encoding graph structure via smoothness or similarity between neighbors. Further, its wavelet priors result in simplified architectures with significantly fewer learned parameters compared to competing GNNs. We demonstrate the predictive performance of LEGS-based networks on graph classification benchmarks, as well as the descriptive quality of their learned features in biochemical graph data exploration tasks. Our results show that LEGS-based networks match or outperforms popular GNNs, as well as the original geometric scattering construction, on many datasets, in particular in biochemical domains, while retaining certain mathematical properties of handcrafted (non-learned) geometric scattering.
In modern relational machine learning it is common to encounter large graphs that arise via interactions or similarities between observations in many domains. Further, in many cases the target ...entities for analysis are actually signals on such graphs. We propose to compare and organize such datasets of graph signals by using an earth mover's distance (EMD) with a geodesic cost over the underlying graph. Typically, EMD is computed by optimizing over the cost of transporting one probability distribution to another over an underlying metric space. However, this is inefficient when computing the EMD between many signals. Here, we propose an unbalanced graph EMD that efficiently embeds the unbalanced EMD on an underlying graph into an L 1 space, whose metric we call unbalanced diffusion earth mover's distance (UDEMD). Next, we show how this gives distances between graph signals that are robust to noise. Finally, we apply this to organizing patients based on clinical notes, embedding cells modeled as signals on a gene graph, and organizing genes modeled as signals over a large cell graph. In each case, we show that UDEMD-based embeddings find accurate distances that are highly efficient compared to other methods.
Mass-tag cell barcoding (MCB) labels individual cell samples with unique combinatorial barcodes, after which they are pooled for processing and measurement as a single multiplexed sample. The MCB ...method eliminates variability between samples in antibody staining and instrument sensitivity, reduces antibody consumption and shortens instrument measurement time. Here we present an optimized MCB protocol. The use of palladium-based labeling reagents expands the number of measurement channels available for mass cytometry and reduces interference with lanthanide-based antibody measurement. An error-detecting combinatorial barcoding scheme allows cell doublets to be identified and removed from the analysis. A debarcoding algorithm that is single cell-based rather than population-based improves the accuracy and efficiency of sample deconvolution. This debarcoding algorithm has been packaged into software that allows rapid and unbiased sample deconvolution. The MCB procedure takes 3-4 h, not including sample acquisition time of ∼1 h per million cells.
Cellular circuits sense the environment, process signals, and compute decisions using networks of interacting proteins. To model such a system, the abundance of each activated protein species can be ...described as a stochastic function of the abundance of other proteins. High-dimensional single-cell technologies, such as mass cytometry, offer an opportunity to characterize signaling circuit-wide. However, the challenge of developing and applying computational approaches to interpret such complex data remains. Here, we developed computational methods, based on established statistical concepts, to characterize signaling network relationships by quantifying the strengths of network edges and deriving signaling response functions. In comparing signaling between naïve and antigen-exposed CD4
+
T lymphocytes, we find that although these two cell subtypes had similarly wired networks, naïve cells transmitted more information along a key signaling cascade than did antigen-exposed cells. We validated our characterization on mice lacking the extracellular-regulated mitogen-activated protein kinase (MAPK) ERK2, which showed stronger influence of pERK on pS6 (phosphorylated-ribosomal protein S6), in naïve cells as compared with antigen-exposed cells, as predicted. We demonstrate that by using cell-to-cell variation inherent in single-cell data, we can derive response functions underlying molecular circuits and drive the understanding of how cells process signals.
A computational method quantifies information flow in T cells.
Deciphering information flow in T cells
We can now measure the activation state of multiple components of biochemical signaling pathways in single cells. This ability reveals how information flows through such cellular regulatory pathways and how it is altered in disease. Krishnaswamy
et al.
applied statistical techniques to overcome the complexity and variation (or noise) in such single-cell measurements. They used these techniques to quantify information transfer between proteins that participate in antigen recognition in cells of the immune system. The methods should prove useful in analysis of other signaling circuits to enhance basic understanding and reveal potential therapeutic targets to fight disease.
Science
, this issue
10.1126/science.1250689
Abstract
Here we focus on understanding mechanisms that drive dynamic changes in gene expression and epigenetic marks that enable triple negative breast cancer cells to change states, and to thereby ...invade tissues and seed secondary tumors. The epithelial-to-mesenchymal transition (EMT) facilitates invasion and migration away from the primary tumor site. However, it is increasingly apparent that the reverse process, the mesenchymal-to-epithelial transition (MET), enhances metastatic colonization and growth via reacquisition of the epithelial phenotype. With no therapies currently available to stop metastatic tumor growth, we aim to uncover the mechanisms driving the MET towards identifying novel anti-metastatic therapies. We use the 3D in vitro mammosphere model system where single tumor-initiating cells residing in a partial-EMT state develop into a 3D organoid over 30 days. We sampled cells at 5 time points and performed scRNA-seq and scATAC-seq to analyze cell states. We develop a novel computational model of cellular development based on the theory of dynamic optimal transport (OT) and continuous normalizing flows. Our model TrajectoryNet is a neural ODE (ordinary differential equation) network that models the gradient of cell state with respect to time continuously over the input space and over time from cross-sectional single-cell data. TrajectoryNet interpolates between collected timepoints and learns a continuous realistic progression that describes cellular evolution in terms of gene expression and chromatin accessibility. Key to TrajectoryNet is a unique regularization to penalize the magnitude of the gradient over the flow. We prove this results in dynamic OT, thereby discouraging the neural network from taking circuitous or unrealistic paths. In contrast to TrajectoryNet, pseudotime, and RNA velocity are best at analyzing within a particular timepoint and do not handle large gaps in timepoints. We compare TrajectoryNet to RNA velocity and static OT and show that TrajectoryNet achieves better trajectories in terms of predicting withheld timepoints. Using TrajectoryNet, we identify a continuous ordering of events that occur during MET that show when and how the epithelial cell states begin to emerge. Such a continuous ordering can give rise to causal associations that can be inhibited to alter MET mechanisms. We also differentiate between trajectories that show self-renewal and maintenance of the tumor-initiating cells, and trajectories that revert to an epithelial state. Further we find that only ~10% of the initial seeded cells develop into mammospheres and identify which initial cells have the potential to seed secondary tumors. Hence, we can refine features (gene and epigenetic states) that define aggressive tumor-initiating cells in triple negative breast cancer, as well as their dynamics through the MET in order to find therapeutic targets.
Citation Format: Alexander Tong, Beatriz P. San Juan, Brandon Zhu, Christine L. Chaffer, Smita Krishnaswamy. Understanding the mesenchymal-to-epithelial transition and its drivers in triple-negative breast cancer with continuous normalizing flows abstract. In: Proceedings of the American Association for Cancer Research Annual Meeting 2021; 2021 Apr 10-15 and May 17-21. Philadelphia (PA): AACR; Cancer Res 2021;81(13_Suppl):Abstract nr 2839.