The Cancer Genome Atlas (TCGA) has generated comprehensive molecular profiles. We aim to identify a set of genes whose expression patterns can distinguish diverse tumor types. Those features may ...serve as biomarkers for tumor diagnosis and drug development.
Using RNA-seq expression data, we undertook a pan-cancer classification of 9,096 TCGA tumor samples representing 31 tumor types. We randomly assigned 75% of samples into training and 25% into testing, proportionally allocating samples from each tumor type.
We could correctly classify more than 90% of the test set samples. Accuracies were high for all but three of the 31 tumor types, in particular, for READ (rectum adenocarcinoma) which was largely indistinguishable from COAD (colon adenocarcinoma). We also carried out pan-cancer classification, separately for males and females, on 23 sex non-specific tumor types (those unrelated to reproductive organs). Results from these gender-specific analyses largely recapitulated results when gender was ignored. Remarkably, more than 80% of the 100 most discriminative genes selected from each gender separately overlapped. Genes that were differentially expressed between genders included BNC1, FAT2, FOXA1, and HOXA11. FOXA1 has been shown to play a role for sexual dimorphism in liver cancer. The differentially discriminative genes we identified might be important for the gender differences in tumor incidence and survival.
We were able to identify many sets of 20 genes that could correctly classify more than 90% of the samples from 31 different tumor types using TCGA RNA-seq data. This accuracy is remarkable given the number of the tumor types and the total number of samples involved. We achieved similar results when we analyzed 23 non-sex-specific tumor types separately for males and females. We regard the frequency with which a gene appeared in those sets as measuring its importance for tumor classification. One third of the 50 most frequently appearing genes were pseudogenes; the degree of enrichment may be indicative of their importance in tumor classification. Lastly, we identified a few genes that might play a role in sexual dimorphism in certain cancers.
Due to the poor prognosis of advanced metastatic melanoma, it is crucial to find early biomarkers that help identify which melanomas will metastasize. By comparing the gene expression data from ...primary and cutaneous melanoma samples from The Cancer Genome Atlas (TCGA), we identified GPC6 among a set of genes whose expression levels can distinguish between primary melanoma and regional cutaneous/subcutaneous metastases. Glypicans are thought to play a role in tumor growth by regulating the signaling pathways of Wnt, Hedgehogs, fibroblast growth factors (FGFs), and bone morphogenetic proteins (BMPs). We showed that GPC6 expression was up-regulated in a melanoma cell line compared to normal melanocytes and in metastatic melanoma compared to primary melanoma. Furthermore, GPC6 expression was positively correlated with genes largely involved in cell adhesion and migration in both melanoma samples and in RNA-seq samples from other TCGA tumors. Our results suggest that GPC6 may play a role in tumor metastatic progression. In TCGA melanoma samples, we also showed that GPC6 expression was negatively correlated with miR-509-3p, which has previously been shown to function as a tumor suppressor in various cancer cell lines. We overexpressed miR-509-3p in A375 melanoma cells and showed that GPC6 expression was significantly suppressed. This result suggested that GPC6 was a putative target of miR-509-3p in melanoma. Together, our findings identified GPC6 as an early biomarker for melanoma metastatic progression, one that can be regulated by miR-509-3p.
Quantifying cell-type proportions and their corresponding gene expression profiles in tissue samples would enhance understanding of the contributions of individual cell types to the physiological ...states of the tissue. Current approaches that address tissue heterogeneity have drawbacks. Experimental techniques, such as fluorescence-activated cell sorting, and single cell RNA sequencing are expensive. Computational approaches that use expression data from heterogeneous samples are promising, but most of the current methods estimate either cell-type proportions or cell-type-specific expression profiles by requiring the other as input. Although such partial deconvolution methods have been successfully applied to tumor samples, the additional input required may be unavailable. We introduce a novel complete deconvolution method, CDSeq, that uses only RNA-Seq data from bulk tissue samples to simultaneously estimate both cell-type proportions and cell-type-specific expression profiles. Using several synthetic and real experimental datasets with known cell-type composition and cell-type-specific expression profiles, we compared CDSeq's complete deconvolution performance with seven other established deconvolution methods. Complete deconvolution using CDSeq represents a substantial technical advance over partial deconvolution approaches and will be useful for studying cell mixtures in tissue samples. CDSeq is available at GitHub repository (MATLAB and Octave code): https://github.com/kkang7/CDSeq.
Human cancer cell line profiling and drug sensitivity studies provide valuable information about the therapeutic potential of drugs and their possible mechanisms of action. The goal of those studies ...is to translate the findings from in vitro studies of cancer cell lines into in vivo therapeutic relevance and, eventually, patients' care. Tremendous progress has been made.
In this work, we built predictive models for 453 drugs using data on gene expression and drug sensitivity (IC
) from cancer cell lines. We identified many known drug-gene interactions and uncovered several potentially novel drug-gene associations. Importantly, we further applied these predictive models to ~ 17,000 bulk RNA-seq samples from The Cancer Genome Atlas (TCGA) and the Genotype-Tissue Expression (GTEx) database to predict drug sensitivity for both normal and tumor tissues. We created a web site for users to visualize and download our predicted data ( https://manticore.niehs.nih.gov/cancerRxTissue ). Using trametinib as an example, we showed that our approach can faithfully recapitulate the known tumor specificity of the drug.
We demonstrated that our approach can predict drugs that 1) are tumor-type specific; 2) elicit higher sensitivity from tumor compared to corresponding normal tissue; 3) elicit differential sensitivity across breast cancer subtypes. If validated, our prediction could have relevance for preclinical drug testing and in phase I clinical design.
Biological tissues consist of heterogenous populations of cells. Because gene expression patterns from bulk tissue samples reflect the contributions from all cells in the tissue, understanding the ...contribution of individual cell types to the overall gene expression in the tissue is fundamentally important. We recently developed a computational method, CDSeq, that can simultaneously estimate both sample-specific cell-type proportions and cell-type-specific gene expression profiles using only bulk RNA-Seq counts from multiple samples. Here we present an R implementation of CDSeq (CDSeqR) with significant performance improvement over the original implementation in MATLAB and an added new function to aid cell type annotation. The R package would be of interest for the broader R community. We developed a novel strategy to substantially improve computational efficiency in both speed and memory usage. In addition, we designed and implemented a new function for annotating the CDSeq estimated cell types using single-cell RNA sequencing (scRNA-seq) data. This function allows users to readily interpret and visualize the CDSeq estimated cell types. In addition, this new function further allows the users to annotate CDSeq-estimated cell types using marker genes. We carried out additional validations of the CDSeqR software using synthetic, real cell mixtures, and real bulk RNA-seq data from the Cancer Genome Atlas (TCGA) and the Genotype-Tissue Expression (GTEx) project. The existing bulk RNA-seq repositories, such as TCGA and GTEx, provide enormous resources for better understanding changes in transcriptomics and human diseases. They are also potentially useful for studying cell-cell interactions in the tissue microenvironment. Bulk level analyses neglect tissue heterogeneity, however, and hinder investigation of a cell-type-specific expression. The CDSeqR package may aid in silico dissection of bulk expression data, enabling researchers to recover cell-type-specific information.
Uterine fibroids are common. Symptoms are debilitating for many, leading to high medical and societal costs. Indirect data suggest that compared with white women, African Americans develop fibroids ...at least 10 years earlier on average, and their higher health burden has been well documented.
The objective of the study was to directly measure fibroid incidence and growth in a large, community-based cohort of young African-American women.
This observational, community-based, prospective study enrolled 1693 African-American women, aged 23–35 years with no prior diagnosis of fibroids. Standardized transvaginal ultrasound examinations at enrollment and after approximately 18 months were conducted to identify and measure fibroids ≥0.5 cm in diameter. Fibroid growth (change in natural log volume per 18 months) was analyzed with mixed-model regression (n = 344 fibroids from 251 women whose baseline ultrasound revealed already existing fibroids).
Among the 1123 fibroid-free women with follow-up data (88% were followed up), incidence was 9.4% (95% confidence interval, 7.7–11.2) and increased with age (Ptrend < .0001), from 6% (confidence interval, 3–9) for 23–25 year olds to 13% (confidence interval, 9–17) for 32–35 year olds. The chance of any new fibroid development was greater than twice as high for women with existing fibroids compared with women who were fibroid free at baseline (age-adjusted relative risk = 2.3 (confidence interval, 1.7–3.0). The uterine position of most incident fibroids (60%) was intramural corpus. Average fibroid growth was 89% per 18 months (confidence interval, 74–104%) but varied by baseline fibroid size (P < .0001). Fibroids ≥2 cm in diameter had average growth rates well under 100%. In contrast, small fibroids (<1 cm diameter) had an average growth rate of nearly 200% (188%, confidence interval, 145–238%). However, these small fibroids also had a high estimated rate of disappearance (23%).
This is the first study to directly measure age-specific fibroid incidence with a standardized ultrasound protocol and to measure fibroid growth in a large community-based sample. Findings indicate that very small fibroids are very dynamic in their growth, with rapid growth, but a high chance of loss. Larger fibroids grow more slowly. For example, a 2-cm fibroid is likely to take 4–5 years to double its diameter. Detailed data on fibroid incidence confirm an early onset in African-American women.
Tumor purity is the percent of cancer cells present in a sample of tumor tissue. The non-cancerous cells (immune cells, fibroblasts, etc.) have an important role in tumor biology. The ability to ...determine tumor purity is important to understand the roles of cancerous and non-cancerous cells in a tumor.
We applied a supervised machine learning method, XGBoost, to data from 33 TCGA tumor types to predict tumor purity using RNA-seq gene expression data.
Across the 33 tumor types, the median correlation between observed and predicted tumor-purity ranged from 0.75 to 0.87 with small root mean square errors, suggesting that tumor purity can be accurately predicted υσινγ expression data. We further confirmed that expression levels of a ten-gene set (CSF2RB, RHOH, C1S, CCDC69, CCL22, CYTIP, POU2AF1, FGR, CCL21, and IL7R) were predictive of tumor purity regardless of tumor type. We tested whether our set of ten genes could accurately predict tumor purity of a TCGA-independent data set. We showed that expression levels from our set of ten genes were highly correlated (ρ = 0.88) with the actual observed tumor purity.
Our analyses suggested that the ten-gene set may serve as a biomarker for tumor purity prediction using gene expression data.
Objective: Few studies of ADHD prevalence have used population-based samples, multiple informants, and Diagnostic and Statistical Manual of Mental Disorders (4th ed.; DSM-IV) criteria. Moreover, ...children who are asymptomatic while receiving ADHD medication often have been misclassified. Therefore, we conducted a population-based study to estimate the prevalence of ADHD in elementary school children using DSM-IV criteria. Method: We screened 7,587 children for ADHD. Teachers of 81% of the children completed a DSM-IV checklist. We then interviewed parents using a structured interview (DISC). Of these, 72% participated. Parent and teacher ratings were combined to determine ADHD status. We also estimated the proportion of cases attributable to other conditions. Results: Overall, 15.5% of our sample met DSM- (4th ed.; text rev., DSM-IV-TR) criteria for ADHD (95% CI 14.6%, 16.4%); 42% of cases reported no previous diagnosis. With additional information, other conditions explained 9% of cases. Conclusion: The prevalence of ADHD in this population-based sample was considerably higher than 3% to 7%. To compare study results, the DSM criteria need standardization.
Recent studies suggested that human/mammalian genomes are divided into large, discrete domains that are units of chromosome organization. CTCF, a CCCTC binding factor, has a diverse role in genome ...regulation including transcriptional regulation, chromosome-boundary insulation, DNA replication, and chromatin packaging. It remains unclear whether a subset of CTCF binding sites plays a functional role in establishing/maintaining chromatin topological domains.
We systematically analysed the genomic, transcriptomic and epigenetic profiles of the CTCF binding sites in 56 human cell lines from ENCODE. We identified ~24,000 CTCF sites (referred to as constitutive sites) that were bound in more than 90% of the cell lines. Our analysis revealed: 1) constitutive CTCF loci were located in constitutive open chromatin and often co-localized with constitutive cohesin loci; 2) most constitutive CTCF loci were distant from transcription start sites and lacked CpG islands but were enriched with the full-spectrum CTCF motifs: a recently reported 33/34-mer and two other potentially novel (22/26-mer); 3) more importantly, most constitutive CTCF loci were present in CTCF-mediated chromatin interactions detected by ChIA-PET and these pair-wise interactions occurred predominantly within, but not between, topological domains identified by Hi-C.
Our results suggest that the constitutive CTCF sites may play a role in organizing/maintaining the recently identified topological domains that are common across most human cells.