Motivation: The elucidation of biological pathways enriched with differentially expressed genes has become an integral part of the analysis and interpretation of microarray data. Several statistical ...methods are commonly used in this context, but the question of the optimal approach has still not been resolved. Results: We present a logistic regression-based method (LRpath) for identifying predefined sets of biologically related genes enriched with (or depleted of) differentially expressed transcripts in microarray experiments. We functionally relate the odds of gene set membership with the significance of differential expression, and calculate adjusted P-values as a measure of statistical significance. The new approach is compared with Fisher's exact test and other relevant methods in a simulation study and in the analysis of two breast cancer datasets. Overall results were concordant between the simulation study and the experimental data analysis, and provide useful information to investigators seeking to choose the appropriate method. LRpath displayed robust behavior and improved statistical power compared with tested alternatives. It is applicable in experiments involving two or more sample types, and accepts significance statistics of the investigator's choice as input. Availability: An R function implementing LRpath can be downloaded from http://eh3.uc.edu/lrpath. Contact: mario.medvedovic@uc.edu Supplementary information: Supplementary data are available at Bioinformatics online and at http://eh3.uc.edu/lrpath.
Improved understanding of the multilayer regulation of the human genome has led to a greater appreciation of environmental, nutritional, and epigenetic risk factors for human disease. Chromatin ...remodeling, histone tail modifications, and DNA methylation are dynamic epigenetic changes responsive to external stimuli. Careful interpretation can provide insights for actionable public health through collaboration between population and basic scientists and through integration of multiple data sources. We review key findings in environmental epigenetics both in human population studies and in animal models, and discuss the implications of these results for risk assessment and public health protection. To ultimately succeed in identifying epigenetic mechanisms leading to complex phenotypes and disease, researchers must integrate the various animal models, human clinical approaches, and human population approaches while paying attention to life-stage sensitivity, to generate effective prescriptions for human health evaluation and disease prevention.
Abstract
Motivation
Single-cell sequencing enables exploring the pathways and processes of cells, and cell populations. However, there is a paucity of pathway enrichment methods designed to tolerate ...the high noise and low gene coverage of this technology. When gene expression data are noisy and signals are sparse, testing pathway enrichment based on the genes expression may not yield statistically significant results, which is particularly problematic when detecting the pathways enriched in less abundant cells that are vulnerable to disturbances.
Results
In this project, we developed a Weighted Concept Signature Enrichment Analysis specialized for pathway enrichment analysis from single-cell transcriptomics (scRNA-seq). Weighted Concept Signature Enrichment Analysis took a broader approach for assessing the functional relations of pathway gene sets to differentially expressed genes, and leverage the cumulative signature of molecular concepts characteristic of the highly differentially expressed genes, which we termed as the universal concept signature, to tolerate the high noise and low coverage of this technology. We then incorporated Weighted Concept Signature Enrichment Analysis into an R package called “IndepthPathway” for biologists to broadly leverage this method for pathway analysis based on bulk and single-cell sequencing data. Through simulating technical variability and dropouts in gene expression characteristic of scRNA-seq as well as benchmarking on a real dataset of matched single-cell and bulk RNAseq data, we demonstrate that IndepthPathway presents outstanding stability and depth in pathway enrichment results under stochasticity of the data, thus will substantially improve the scientific rigor of the pathway analysis for single-cell sequencing data.
Availability and implementation
The IndepthPathway R package is available through: https://github.com/wangxlab/IndepthPathway.
The incidence of human papillomavirus (HPV)-related oropharynx cancer has steadily increased over the past two decades and now represents a majority of oropharyngeal cancer cases. Integration of the ...HPV genome into the host genome is a common event during carcinogenesis that has clinically relevant effects if the viral early genes are transcribed. Understanding the impact of HPV integration on clinical outcomes of head and neck squamous cell carcinoma (HNSCC) is critical for implementing deescalated treatment approaches for HPV
HNSCC patients. RNA sequencing (RNA-seq) data from HNSCC tumors (
= 84) were used to identify and characterize expressed integration events, which were overrepresented near known head and neck, lung, and urogenital cancer genes. Five genes were recurrent, including
A significant number of genes detected to have integration events were found to interact with Tp63, ETS, and/or FOX1A. Patients with no detected integration had better survival than integration-positive and HPV
patients. Furthermore, integration-negative tumors were characterized by strongly heightened signatures for immune cells, including CD4
, CD3
, regulatory, CD8
T cells, NK cells, and B cells, compared with integration-positive tumors. Finally, genes with elevated expression in integration-negative specimens were strongly enriched with immune-related gene ontology terms, while upregulated genes in integration-positive tumors were enriched for keratinization, RNA metabolism, and translation.
These findings demonstrate the clinical relevancy of expressed HPV integration, which is characterized by a change in immune response and/or aberrant expression of the integration-harboring cancer-related genes, and suggest strong natural selection for tumor cells with expressed integration events in key carcinogenic genes.
.
Evidence supports a role for epigenetic mechanisms in the pathogenesis of late-onset Alzheimer's disease (LOAD), but little has been done on a genome-wide scale to identify potential sites involved ...in disease. This study investigates human postmortem frontal cortex genome-wide DNA methylation profiles between 12 LOAD and 12 cognitively normal age- and gender-matched subjects. Quantitative DNA methylation is determined at 27,578 CpG sites spanning 14,475 genes via the Illumina Infinium HumanMethylation27 BeadArray. Data are analyzed using parallel linear models adjusting for age and gender with empirical Bayes standard error methods. Gene-specific technical and functional validation is performed on an additional 13 matched pair samples, encompassing a wider age range. Analysis reveals 948 CpG sites representing 918 unique genes as potentially associated with LOAD disease status pending confirmation in additional study populations. Across these 948 sites the subtle mean methylation difference between cases and controls is 2.9%. The CpG site with a minimum false discovery rate located in the promoter of the gene Transmembrane Protein 59 (TMEM59) is 7.3% hypomethylated in cases. Methylation at this site is functionally associated with tissue RNA and protein levels of the TMEM59 gene product. The TMEM59 gene identified from our discovery approach was recently implicated in amyloid-β protein precursor post-translational processing, supporting a role for epigenetic change in LOAD pathology. This study demonstrates widespread, modest discordant DNA methylation in LOAD-diseased tissue independent from DNA methylation changes with age. Identification of epigenetic biomarkers of LOAD risk may allow for the development of novel diagnostic and therapeutic targets.
Revealing the gene targets of distal regulatory elements is challenging yet critical for interpreting regulome data. Experiment-derived enhancer-gene links are restricted to a small set of enhancers ...and/or cell types, while the accuracy of genome-wide approaches remains elusive due to the lack of a systematic evaluation. We combined multiple spatial and in silico approaches for defining enhancer locations and linking them to their target genes aggregated across >500 cell types, generating 1860 human genome-wide distal enhancer-to-target gene definitions (EnTDefs). To evaluate performance, we used gene set enrichment (GSE) testing on 87 independent ENCODE ChIP-seq datasets of 34 transcription factors (TFs) and assessed concordance of results with known TF Gene Ontology annotations, and other benchmarks.
The top ranked 741 (40%) EnTDefs significantly outperform the common, naïve approach of linking distal regions to the nearest genes, and the top 10 EnTDefs perform well when applied to ChIP-seq data of other cell types. The GSE-based ranking of EnTDefs is highly concordant with ranking based on overlap with curated benchmarks of enhancer-gene interactions. Both our top general EnTDef and cell-type-specific EnTDefs significantly outperform seven independent computational and experiment-based enhancer-gene pair datasets. We show that using our top EnTDefs for GSE with either genome-wide DNA methylation or ATAC-seq data is able to better recapitulate the biological processes changed in gene expression data performed in parallel for the same experiment than our lower-ranked EnTDefs.
Our findings illustrate the power of our approach to provide genome-wide interpretation regardless of cell type.
Piwi-interacting RNAs (piRNAs) are small non-coding RNAs that associate with PIWI proteins for transposon silencing via DNA methylation and are highly expressed and extensively studied in the ...germline. Mature germline piRNAs typically consist of 24-32 nucleotides, with a strong preference for a 5ʹ uridine signature, an adenosine signature at position 10, and a 2ʹ-O-methylation signature at the 3ʹ end. piRNA presence in somatic tissues, however, is not well characterized and requires further systematic evaluation. In the current study, we identified piRNAs and associated machinery from mouse somatic tissues representing the three germ layers. piRNA specificity was improved by combining small RNA size selection, sodium periodate treatment enrichment for piRNA over other small RNA, and small RNA next-generation sequencing. We identify PIWIL1, PIWIL2, and PIWIL4 expression in brain, liver, kidney, and heart. Of note, somatic piRNAs are shorter in length and tissue-specific, with increased occurrence of unique piRNAs in hippocampus and liver, compared to the germline. Hippocampus contains 5,494 piRNA-like peaks, the highest expression among all tested somatic tissues, followed by cortex (1,963), kidney (580), and liver (406). The study identifies 26 piRNA sequence species and 40 piRNA locations exclusive to all examined somatic tissues. Although piRNA expression has long been considered exclusive to the germline, our results support that piRNAs are expressed in several somatic tissues that may influence piRNA functions in the soma. Once confirmed, the PIWI/piRNA system may serve as a potential tool for future research in epigenome editing to improve human health by manipulating DNA methylation.
The molecular mechanisms underlying the sex differences in human muscle morphology and function remain to be elucidated. The sex differences in the skeletal muscle transcriptome in both the resting ...state and following anabolic stimuli, such as resistance exercise (RE), might provide insight to the contributors of sexual dimorphism of muscle phenotypes. We used microarrays to profile the transcriptome of the biceps brachii of young men and women who underwent an acute unilateral RE session following 12 weeks of progressive training. Bilateral muscle biopsies were obtained either at an early (4 h post-exercise) or late recovery (24 h post-exercise) time point. Muscle transcription profiles were compared in the resting state between men (n = 6) and women (n = 8), and in response to acute RE in trained exercised vs. untrained non-exercised control muscle for each sex and time point separately (4 h post-exercise, n = 3 males, n = 4 females; 24 h post-exercise, n = 3 males, n = 4 females). A logistic regression-based method (LRpath), following Bayesian moderated t-statistic (IMBT), was used to test gene functional groups and biological pathways enriched with differentially expressed genes.
This investigation identified extensive sex differences present in the muscle transcriptome at baseline and following acute RE. In the resting state, female muscle had a greater transcript abundance of genes involved in fatty acid oxidation and gene transcription/translation processes. After strenuous RE at the same relative intensity, the time course of the transcriptional modulation was sex-dependent. Males experienced prolonged changes while females exhibited a rapid restoration. Most of the biological processes involved in the RE-induced transcriptional regulation were observed in both males and females, but sex specificity was suggested for several signaling pathways including activation of notch signaling and TGF-beta signaling in females. Sex differences in skeletal muscle transcriptional regulation might implicate a mechanism behind disproportional muscle growth in males as compared with female counterparts after RE training at the same relative intensity.
Sex differences exist in skeletal muscle gene transcription both at rest and following acute RE, suggesting that sex is a significant modifier of the transcriptional regulation in skeletal muscle. The findings from the present study provide insight into the molecular mechanisms for sex differences in muscle phenotypes and for muscle transcriptional regulation associated with training adaptations to resistance exercise.
Human exposure to toxic chemicals presents a huge health burden. Key to understanding chemical toxicity is knowledge of the molecular target(s) of the chemicals. Because a comprehensive safety ...assessment for all chemicals is infeasible due to limited resources, a robust computational method for discovering targets of environmental exposures is a promising direction for public health research. In this study, we implemented a novel matrix completion algorithm named coupled matrix–matrix completion (CMMC) for predicting direct and indirect exposome-target interactions, which exploits the vast amount of accumulated data regarding chemical exposures and their molecular targets. Our approach achieved an AUC of 0.89 on a benchmark data set generated using data from the Comparative Toxicogenomics Database. Our case studies with bisphenol A and its analogues, PFAS, dioxins, PCBs, and VOCs show that CMMC can be used to accurately predict molecular targets of novel chemicals without any prior bioactivity knowledge. Our results demonstrate the feasibility and promise of computationally predicting environmental chemical-target interactions to efficiently prioritize chemicals in hazard identification and risk assessment.
During development, the mammary gland undergoes extensive remodeling driven by stem cells. Breast cancers are also hierarchically organized and driven by cancer stem cells characterized by ...CD44+CD24low/− or aldehyde dehydrogenase (ALDH) expression. These markers identify mesenchymal and epithelial populations both capable of tumor initiation. Less is known about these populations in non-cancerous mammary glands. From RNA sequencing, ALDH+ and ALDH−CD44+CD24− human mammary cells have epithelial-like and mesenchymal-like characteristics, respectively, with some co-expressing ALDH+ and CD44+CD24− by flow cytometry. At the single-cell level, these cells have the greatest mammosphere-forming capacity and express high levels of stemness and epithelial-to-mesenchymal transition-associated genes including ID1, SOX2, TWIST1, and ZEB2. We further identify single ALDH+ cells with a hybrid epithelial/mesenchymal phenotype that express genes associated with aggressive triple-negative breast cancers. These results highlight single-cell analyses to characterize tissue heterogeneity, even in marker-enriched populations, and identify genes and pathways that define this heterogeneity.
•Isolation and RNA-seq of ALDH+ and CD44+CD24− breast cells•Unlike in cancer, there is substantial overlap in ALDH+ and CD44+CD24− populations•Single-cell analysis of ALDH+ cells identifies unexpected subpopulation structure•Hybrid epithelial/mesenchymal ALDH+ cells have a cancer-like expression signature
In this article, Colacino and colleagues use flow-cytometry-sorted populations and single-cell analyses to investigate human mammary stem cells. They discover unexpected phenotypic and functional heterogeneity at the single-cell level, including a subpopulation of ALDH+ stem cells with a hybrid epithelial/mesenchymal phenotype and triple-negative breast cancer-like gene expression pattern.