Estimating cell type composition of blood and tissue samples is a biological challenge relevant in both laboratory studies and clinical care. In recent years, a number of computational tools have ...been developed to estimate cell type abundance using gene expression data. Although these tools use a variety of approaches, they all leverage expression profiles from purified cell types to evaluate the cell type composition within samples. In this study, we compare 12 cell type quantification tools and evaluate their performance while using each of 10 separate reference profiles. Specifically, we have run each tool on over 4000 samples with known cell type proportions, spanning both immune and stromal cell types. A total of 12 of these represent in vitro synthetic mixtures and 300 represent in silico synthetic mixtures prepared using single-cell data. A final 3728 clinical samples have been collected from the Framingham cohort, for which cell populations have been quantified using electrical impedance cell counting. When tools are applied to the Framingham dataset, the tool Estimating the Proportions of Immune and Cancer cells (EPIC) produces the highest correlation, whereas Gene Expression Deconvolution Interactive Tool (GEDIT) produces the lowest error. The best tool for other datasets is varied, but CIBERSORT and GEDIT most consistently produce accurate results. We find that optimal reference depends on the tool used, and report suggested references to be used with each tool. Most tools return results within minutes, but on large datasets runtimes for CIBERSORT can exceed hours or even days. We conclude that deconvolution methods are capable of returning high-quality results, but that proper reference selection is critical.
5-Hydroxymethylcytosine (5-hmC) is an epigenetic marker of open chromatin and active gene expression. We profiled 5-hmC with Nano-hmC-Seal technology using 10 ng of plasma-derived cell-free DNA ...(cfDNA) in blood samples from patients with neuroblastoma to determine its utility as a biomarker.
For the Discovery cohort, 100 5-hmC profiles were generated from 34 well children and 32 patients (27 high-risk, 2 intermediate-risk, and 3 low-risk) at various time points during the course of their disease. An independent Validation cohort encompassed 5-hmC cfDNA profiles (
= 29) generated from 21 patients (20 high-risk and 1 intermediate-risk). Metastatic burden was classified as high, moderate, low, or none per Curie metaiodobenzylguanidine scores and percentage of tumor cells in bone marrow. Genes with differential 5-hmC levels between samples according to metastatic burden were identified using DESeq2.
Hierarchical clustering using 5-hmC levels of 347 genes identified from the Discovery cohort defined four clusters of samples that were confirmed in the Validation cohort and corresponded to high, high-moderate, moderate, and low/no metastatic burden. Samples from patients with increased metastatic burden had increased 5-hmC deposition on genes in neuronal stem cell maintenance and epigenetic regulatory pathways. Further, 5-hmC cfDNA profiles generated with 1,242 neuronal pathway genes were associated with subsequent relapse in the cluster of patients with predominantly low or no metastatic burden (sensitivity 65%, specificity 75.6%).
cfDNA 5-hmC profiles in children with neuroblastoma correlate with metastatic burden and warrants development as a biomarker of treatment response and outcome.
In studies of expression quantitative trait loci (eQTLs), it is of increasing interest to identify eGenes, the genes whose expression levels are associated with variation at a particular genetic ...variant. Detecting eGenes is important for follow-up analyses and prioritization because genes are the main entities in biological processes. To detect eGenes, one typically focuses on the genetic variant with the minimum p value among all variants in cis with a gene and corrects for multiple testing to obtain a gene-level p value. For performing multiple-testing correction, a permutation test is widely used. Because of growing sample sizes of eQTL studies, however, the permutation test has become a computational bottleneck in eQTL studies. In this paper, we propose an efficient approach for correcting for multiple testing and assess eGene p values by utilizing a multivariate normal distribution. Our approach properly takes into account the linkage-disequilibrium structure among variants, and its time complexity is independent of sample size. By applying our small-sample correction techniques, our method achieves high accuracy in both small and large studies. We have shown that our method consistently produces extremely accurate p values (accuracy > 98%) for three human eQTL datasets with different sample sizes and SNP densities: the Genotype-Tissue Expression pilot dataset, the multi-region brain dataset, and the HapMap 3 dataset.
Large-scale high throughput studies using microarray technology have established that copy number variation (CNV) throughout the genome is more frequent than previously thought. Such variation is ...known to play an important role in the presence and development of phenotypes such as HIV-1 infection and Alzheimer's disease. However, methods for analyzing the complex data produced and identifying regions of CNV are still being refined.
We describe the presence of a genome-wide technical artifact, spatial autocorrelation or 'wave', which occurs in a large dataset used to determine the location of CNV across the genome. By removing this artifact we are able to obtain both a more biologically meaningful clustering of the data and an increase in the number of CNVs identified by current calling methods without a major increase in the number of false positives detected. Moreover, removing this artifact is critical for the development of a novel model-based CNV calling algorithm - CNVmix - that uses cross-sample information to identify regions of the genome where CNVs occur. For regions of CNV that are identified by both CNVmix and current methods, we demonstrate that CNVmix is better able to categorize samples into groups that represent copy number gains or losses.
Removing artifactual 'waves' (which appear to be a general feature of array comparative genomic hybridization (aCGH) datasets) and using cross-sample information when identifying CNVs enables more biological information to be extracted from aCGH experiments designed to investigate copy number variation in normal individuals.
Disease-associated loci identified through genome-wide association studies (GWAS) frequently localize to non-coding sequence. We and others have demonstrated strong enrichment of such single ...nucleotide polymorphisms (SNPs) for expression quantitative trait loci (eQTLs), supporting an important role for regulatory genetic variation in complex disease pathogenesis. Herein we describe our initial efforts to develop a predictive model of disease-associated variants leveraging eQTL information. We first catalogued cis-acting eQTLs (SNPs within 100 kb of target gene transcripts) by meta-analyzing four studies of three blood-derived tissues (n = 586). At a false discovery rate < 5%, we mapped eQTLs for 6,535 genes; these were enriched for disease-associated genes (P < 10(-04)), particularly those related to immune diseases and metabolic traits. Based on eQTL information and other variant annotations (distance from target gene transcript, minor allele frequency, and chromatin state), we created multivariate logistic regression models to predict SNP membership in reported GWAS. The complete model revealed independent contributions of specific annotations as strong predictors, including evidence for an eQTL (odds ratio (OR) = 1.2-2.0, P < 10(-11)) and the chromatin states of active promoters, different classes of strong or weak enhancers, or transcriptionally active regions (OR = 1.5-2.3, P < 10(-11)). This complete prediction model including eQTL association information ultimately allowed for better discrimination of SNPs with higher probabilities of GWAS membership (6.3-10.0%, compared to 3.5% for a random SNP) than the other two models excluding eQTL information. This eQTL-based prediction model of disease relevance can help systematically prioritize non-coding GWAS SNPs for further functional characterization.
Nucleotide variation in eight effectively unlinked genes was surveyed in species-wide samples of the closely related outbreeding species Arabidopsis halleri and A. lyrata ssp. petraea and in three of ...these genes in A. lyrata ssp. lyrata and A. thaliana. Significant genetic differentiation was observed more frequently in A. l. petraea than in A. halleri. Average estimates of nucleotide variation were highest in A. l. petraea and lowest in A. l. lyrata, reflecting differences among species in effective population size. The low level of variation in A. l. lyrata is concordant with a bottleneck effect associated with its origin. The A. halleri/A. l. petraea speciation process was studied, considering the orthologous sequences of an outgroup species (A. thaliana). The high number of ancestral mutations relative to exclusive polymorphisms detected in A. halleri and A. l. petraea, the significant results of the multilocus Fay and Wu H tests, and haplotype sharing between the species indicate introgression subsequent to speciation. Average among-population variation in A. halleri and A. l. petraea was approximately 1.5- and 3-fold higher than that in the inbreeder A. thaliana. The detected reduction of variation in A. thaliana is less than that expected from differences in mating system alone, and therefore from selective processes related to differences in the effective recombination rate, but could be explained by differences in population structure.
To investigate genetic predispositions for MYCN-amplified neuroblastoma, we performed a meta-analysis of three genome-wide association studies totaling 615 MYCN-amplified high-risk neuroblastoma ...cases and 1869 MYCN-nonamplified non-high-risk neuroblastoma cases as controls using a fixed-effects model with inverse variance weighting. All statistical tests were two-sided. We identified a novel locus at 3p21.31 indexed by the single nucleotide polymorphism (SNP) rs80059929 (odds ratio OR = 2.95, 95% confidence interval CI = 2.17 to 4.02, Pmeta = 6.47 × 10-12) associated with MYCN-amplified neuroblastoma, which was replicated in 127 MYCN-amplified cases and 254 non-high-risk controls (OR = 2.30, 95% CI = 1.12 to 4.69, Preplication = .02). To confirm this signal is exclusive to MYCN-amplified tumors, we performed a second meta-analysis comparing 728 MYCN-nonamplified high-risk patients to identical controls. rs80059929 was not statistically significant in MYCN-nonamplified high-risk patients (OR = 1.24, 95% CI = 0.90 to 1.71, Pmeta = .19). SNP rs80059929 is within intron 16 in the KIF15 gene. Additionally, the previously reported LMO1 neuroblastoma risk locus was statistically significant only in patients with MYCN-nonamplified high-risk tumors (OR = 0.63, 95% CI = 0.53 to 0.75, Pmeta = 1.51 × 10-8; Pmeta = .95). Our results indicate that common genetic variation predisposes to different neuroblastoma genotypes, including the likelihood of somatic MYCN-amplification.
The importance of genetic ancestry characterization is increasing in genomic implementation efforts, and clinical pharmacogenomic guidelines are being published that include population-specific ...recommendations. Our aim was to test the ability of focused clinical pharmacogenomic SNP panels to estimate individual genetic ancestry (IGA) and implement population-specific pharmacogenomic clinical decision-support (CDS) tools. Principle components and STRUCTURE were utilized to assess differences in genetic composition and estimate IGA among 1572 individuals from 1000 Genomes, two independent cohorts of Caucasians and African Americans (AAs), plus a real-world validation population of patients undergoing pharmacogenomic genotyping. We found that clinical pharmacogenomic SNP panels accurately estimate IGA compared to genome-wide genotyping and identify AAs with ≥70 African ancestry (sensitivity >82%, specificity >80%, PPV >95%, NPV >47%). We also validated a new AA-specific warfarin dosing algorithm for patients with ≥70% African ancestry and implemented it at our institution as a novel CDS tool. Consideration of IGA to develop an institutional CDS tool was accomplished to enable population-specific pharmacogenomic guidance at the point-of-care. These capabilities were immediately applied for guidance of warfarin dosing in AAs versus Caucasians, but also provide a real-world model that can be extended to other populations and drugs as actionable genomic evidence accumulates.
Germline copy number variants (CNVs) and single-nucleotide polymorphisms (SNPs) form the basis of inter-individual genetic variation. Although the phenotypic effects of SNPs have been extensively ...investigated, the effects of CNVs is relatively less understood. To better characterize mechanisms by which CNVs affect cellular phenotype, we tested their association with variable CpG methylation in a genome-wide manner. Using paired CNV and methylation data from the 1000 genomes and HapMap projects, we identified genome-wide associations by methylation quantitative trait locus (mQTL) analysis. We found individual CNVs being associated with methylation of multiple CpGs and vice versa. CNV-associated methylation changes were correlated with gene expression. CNV-mQTLs were enriched for regulatory regions, transcription factor-binding sites (TFBSs), and were involved in long-range physical interactions with associated CpGs. Some CNV-mQTLs were associated with methylation of imprinted genes. Several CNV-mQTLs and/or associated genes were among those previously reported by genome-wide association studies (GWASs). We demonstrate that germline CNVs in the genome are associated with CpG methylation. Our findings suggest that structural variation together with methylation may affect cellular phenotype.