RNA sequencing experiments generate large amounts of information about expression levels of genes. Although they are mainly used for quantifying expression levels, they contain much more biologically ...important information such as copy number variants (CNVs). Here, we present CaSpER, a signal processing approach for identification, visualization, and integrative analysis of focal and large-scale CNV events in multiscale resolution using either bulk or single-cell RNA sequencing data. CaSpER integrates the multiscale smoothing of expression signal and allelic shift signals for CNV calling. The allelic shift signal measures the loss-of-heterozygosity (LOH) which is valuable for CNV identification. CaSpER employs an efficient methodology for the generation of a genome-wide B-allele frequency (BAF) signal profile from the reads and utilizes it for correction of CNVs calls. CaSpER increases the utility of RNA-sequencing datasets and complements other tools for complete characterization and visualization of the genomic and transcriptomic landscape of single cell and bulk RNA sequencing data.
Studies on genomic privacy have traditionally focused on identifying individuals using DNA variants. In contrast, molecular phenotype data, such as gene expression levels, are generally assumed to be ...free of such identifying information. Although there is no explicit genotypic information in phenotype data, adversaries can statistically link phenotypes to genotypes using publicly available genotype-phenotype correlations such as expression quantitative trait loci (eQTLs). This linking can be accurate when high-dimensional data (i.e., many expression levels) are used, and the resulting links can then reveal sensitive information (for example, the fact that an individual has cancer). Here we develop frameworks for quantifying the leakage of characterizing information from phenotype data sets. These frameworks can be used to estimate the leakage from large data sets before release. We also present a general three-step procedure for practically instantiating linking attacks and a specific attack using outlier gene expression levels that is simple yet accurate. Finally, we describe the effectiveness of this outlier attack under different scenarios.
Functional genomics experiments, such as RNA-seq, provide non-individual specific information about gene expression under different conditions such as disease and normal. There is great desire to ...share these data. However, privacy concerns often preclude sharing of the raw reads. To enable safe sharing, aggregated summaries such as read-depth signal profiles and levels of gene expression are used. Projects such as GTEx and ENCODE share these because they ostensibly do not leak much identifying information. Here, we attempt to quantify the validity of this statement, measuring the leakage of genomic deletions from signal profiles. We present information theoretic measures for the degree to which one can genotype these deletions. We then develop practical genotyping approaches and demonstrate how to use these to identify an individual within a large cohort in the context of linking attacks. Finally, we present an anonymization method removing much of the leakage from signal profiles.
The prediction of secondary structure, i.e. the set of canonical base pairs between nucleotides, is a first step in developing an understanding of the function of an RNA sequence. The most accurate ...computational methods predict conserved structures for a set of homologous RNA sequences. These methods usually suffer from high computational complexity. In this paper, TurboFold, a novel and efficient method for secondary structure prediction for multiple RNA sequences, is presented.
TurboFold takes, as input, a set of homologous RNA sequences and outputs estimates of the base pairing probabilities for each sequence. The base pairing probabilities for a sequence are estimated by combining intrinsic information, derived from the sequence itself via the nearest neighbor thermodynamic model, with extrinsic information, derived from the other sequences in the input set. For a given sequence, the extrinsic information is computed by using pairwise-sequence-alignment-based probabilities for co-incidence with each of the other sequences, along with estimated base pairing probabilities, from the previous iteration, for the other sequences. The extrinsic information is introduced as free energy modifications for base pairing in a partition function computation based on the nearest neighbor thermodynamic model. This process yields updated estimates of base pairing probability. The updated base pairing probabilities in turn are used to recompute extrinsic information, resulting in the overall iterative estimation procedure that defines TurboFold.TurboFold is benchmarked on a number of ncRNA datasets and compared against alternative secondary structure prediction methods. The iterative procedure in TurboFold is shown to improve estimates of base pairing probability with each iteration, though only small gains are obtained beyond three iterations. Secondary structures composed of base pairs with estimated probabilities higher than a significance threshold are shown to be more accurate for TurboFold than for alternative methods that estimate base pairing probabilities. TurboFold-MEA, which uses base pairing probabilities from TurboFold in a maximum expected accuracy algorithm for secondary structure prediction, has accuracy comparable to the best performing secondary structure prediction methods. The computational and memory requirements for TurboFold are modest and, in terms of sequence length and number of sequences, scale much more favorably than joint alignment and folding algorithms.
TurboFold is an iterative probabilistic method for predicting secondary structures for multiple RNA sequences that efficiently and accurately combines the information from the comparative analysis between sequences with the thermodynamic folding model. Unlike most other multi-sequence structure prediction methods, TurboFold does not enforce strict commonality of structures and is therefore useful for predicting structures for homologous sequences that have diverged significantly. TurboFold can be downloaded as part of the RNAstructure package at http://rna.urmc.rochester.edu.
The Warburg effect is a tumor-related phenomenon that could potentially be targeted therapeutically. Here, we showed that glioblastoma (GBM) cultures and patients' tumors harbored super-enhancers in ...several genes related to the Warburg effect. By conducting a transcriptome analysis followed by ChIP-Seq coupled with a comprehensive metabolite analysis in GBM models, we found that FDA-approved global (panobinostat, vorinostat) and selective (romidepsin) histone deacetylase (HDAC) inhibitors elicited metabolic reprogramming in concert with disruption of several Warburg effect-related super-enhancers. Extracellular flux and carbon-tracing analyses revealed that HDAC inhibitors blunted glycolysis in a c-Myc-dependent manner and lowered ATP levels. This resulted in the engagement of oxidative phosphorylation (OXPHOS) driven by elevated fatty acid oxidation (FAO), rendering GBM cells dependent on these pathways. Mechanistically, interference with HDAC1/-2 elicited a suppression of c-Myc protein levels and a concomitant increase in 2 transcriptional drivers of oxidative metabolism, PGC1α and PPARD, suggesting an inverse relationship. Rescue and ChIP experiments indicated that c-Myc bound to the promoter regions of PGC1α and PPARD to counteract their upregulation driven by HDAC1/-2 inhibition. Finally, we demonstrated that combination treatment with HDAC and FAO inhibitors extended animal survival in patient-derived xenograft model systems in vivo more potently than single treatments in the absence of toxicity.
Sequencing of thousands of samples provides genetic variants with allele frequencies spanning a very large spectrum and gives invaluable insight into genetic determinants of diseases. Protecting the ...genetic privacy of participants is challenging as only a few rare variants can easily re-identify an individual among millions. In certain cases, there are policy barriers against sharing genetic data from indigenous populations and stigmatizing conditions. We present SVAT, a method for secure outsourcing of variant annotation and aggregation, which are two basic steps in variant interpretation and detection of causal variants. SVAT uses homomorphic encryption to encrypt the data at the client-side. The data always stays encrypted while it is stored, in-transit, and most importantly while it is analyzed. SVAT makes use of a vectorized data representation to convert annotation and aggregation into efficient vectorized operations in a single framework. Also, SVAT utilizes a secure re-encryption approach so that multiple disparate genotype datasets can be combined for federated aggregation and secure computation of allele frequencies on the aggregated dataset. Overall, SVAT provides a secure, flexible, and practical framework for privacy-aware outsourcing of annotation, filtering, and aggregation of genetic variants. SVAT is publicly available for download from https://github.com/harmancilab/SVAT.
Introduction
Meningiomas are the most common primary intracranial tumor. Recently, various genetic classification systems for meningioma have been described. We sought to identify clinical drivers of ...different molecular changes in meningioma. As such, clinical and genomic consequences of smoking in patients with meningiomas remain unexplored.
Methods
88 tumor samples were analyzed in this study. Whole exome sequencing (WES) was used to assess somatic mutation burden. RNA sequencing data was used to identify differentially expressed genes (DEG) and genes sets (GSEA).
Results
Fifty-seven patients had no history of smoking, twenty-two were past smokers, and nine were current smokers. The clinical data showed no major differences in natural history across smoking status. WES revealed absence of
AKT1
mutation rate in current or past smokers compared to non-smokers (p = 0.046). Current smokers had increased mutation rate in
NOTCH2
compared to past and never smokers (p < 0.05). Mutational signature from current and past smokers showed disrupted DNA mismatch repair (cosine-similarity = 0.759 and 0.783). DEG analysis revealed the xenobiotic metabolic genes
UGT2A1
and
UGT2A2
were both significantly downregulated in current smokers compared to past (Log2FC = − 3.97, padj = 0.0347 and Log2FC = − 4.18, padj = 0.0304) and never smokers (Log2FC = − 3.86, padj = 0.0235 and Log2FC = − 4.20, padj = 0.0149). GSEA analysis of current smokers showed downregulation of xenobiotic metabolism and enrichment for G2M checkpoint, E2F targets, and mitotic spindle compared to past and never smokers (FDR < 25% each).
Conclusion
In this study, we conducted a comparative analysis of meningioma patients based on their smoking history, examining both their clinical trajectories and molecular changes. Meningiomas from current smokers were more likely to harbor
NOTCH2
mutations, and
AKT1
mutations were absent in current or past smokers. Moreover, both current and past smokers exhibited a mutational signature associated with DNA mismatch repair. Meningiomas from current smokers demonstrate downregulation of xenobiotic metabolic enzymes
UGT2A1
and
UGT2A2
, which are downregulated in other smoking related cancers. Furthermore, current smokers exhibited downregulation xenobiotic metabolic gene sets, as well as enrichment in gene sets related to mitotic spindle, E2F targets, and G2M checkpoint, which are hallmark pathways involved in cell division and DNA replication control. In aggregate, our results demonstrate novel alterations in meningioma molecular biology in response to systemic carcinogens.
RNA-sequencing has become a standard tool for analyzing gene activity in bulk samples and at the single-cell level. By increasing sample sizes and cell counts, this technique can uncover substantial ...information about cellular transcriptional states. Beyond quantification of gene expression, RNA-seq can be used for detecting variants, including single nucleotide polymorphisms, small insertions/deletions, and larger variants, such as copy number variants. Notably, joint analysis of variants with cellular transcriptional states may provide insights into the impact of mutations, especially for complex and heterogeneous samples. However, this analysis is often challenging due to a prohibitively high number of variants and cells, which are difficult to summarize and visualize. Further, there is a dearth of methods that assess and summarize the association between detected variants and cellular transcriptional states.
Here, we introduce XCVATR (eXpressed Clusters of Variant Alleles in Transcriptome pRofiles), a method that identifies variants and detects local enrichment of expressed variants within embedding of samples and cells in single-cell and bulk RNA-seq datasets. XCVATR visualizes local "clumps" of small and large-scale variants and searches for patterns of association between each variant and cellular states, as described by the coordinates of cell embedding, which can be computed independently using any type of distance metrics, such as principal component analysis or t-distributed stochastic neighbor embedding. Through simulations and analysis of real datasets, we demonstrate that XCVATR can detect enrichment of expressed variants and provide insight into the transcriptional states of cells and samples. We next sequenced 2 new single cell RNA-seq tumor samples and applied XCVATR. XCVATR revealed subtle differences in CNV impact on tumors.
XCVATR is publicly available to download from https://github.com/harmancilab/XCVATR .
There is a lack of approaches for identifying pathogenic genomic structural variants (SVs) although they play a crucial role in many diseases. We present a mechanism-agnostic machine learning-based ...workflow, called SVFX, to assign pathogenicity scores to somatic and germline SVs. In particular, we generate somatic and germline training models, which include genomic, epigenomic, and conservation-based features, for SV call sets in diseased and healthy individuals. We then apply SVFX to SVs in cancer and other diseases; SVFX achieves high accuracy in identifying pathogenic SVs. Predicted pathogenic SVs in cancer cohorts are enriched among known cancer genes and many cancer-related pathways.
Growing regulatory requirements set barriers around genetic data sharing and collaborations. Moreover, existing privacy-aware paradigms are challenging to deploy in collaborative settings. We present ...COLLAGENE, a tool base for building secure collaborative genomic data analysis methods. COLLAGENE protects data using shared-key homomorphic encryption and combines encryption with multiparty strategies for efficient privacy-aware collaborative method development. COLLAGENE provides ready-to-run tools for encryption/decryption, matrix processing, and network transfers, which can be immediately integrated into existing pipelines. We demonstrate the usage of COLLAGENE by building a practical federated GWAS protocol for binary phenotypes and a secure meta-analysis protocol. COLLAGENE is available at https://zenodo.org/record/8125935 .