We have developed a set of web-based SNP selection tools (freely available at http://www.niehs.nih.gov/snpinfo) where investigators can specify genes or linkage regions and select SNPs based on GWAS ...results, linkage disequilibrium (LD), and predicted functional characteristics of both coding and non-coding SNPs. The algorithm uses GWAS SNP P-value data and finds all SNPs in high LD with GWAS SNPs, so that selection is from a much larger set of SNPs than the GWAS itself. The program can also identify and choose tag SNPs for SNPs not in high LD with any GWAS SNP. We incorporate functional predictions of protein structure, gene regulation, splicing and miRNA binding, and consider whether the alternative alleles of a SNP are likely to have differential effects on function. Users can assign weights for different functional categories of SNPs to further tailor SNP selection. The program accounts for LD structure of different populations so that a GWAS study from one ethnic group can be used to choose SNPs for one or more other ethnic groups. Finally, we provide an example using prostate cancer and demonstrate that this algorithm can select a small panel of SNPs that include many of the recently validated prostate cancer SNPs.
The Illumina HumanMethylation450 BeadChip has been extensively utilized in epigenome-wide association studies. This array and its successor, the MethylationEPIC array, use two types of ...probes-Infinium I (type I) and Infinium II (type II)-in order to increase genome coverage but differences in probe chemistries result in different type I and II distributions of methylation values. Ignoring the difference in distributions between the two probe types may bias downstream analysis.
Here, we developed a novel method, called Regression on Correlated Probes (RCP), which uses the existing correlation between pairs of nearby type I and II probes to adjust the beta values of all type II probes. We evaluate the effect of this adjustment on reducing probe design type bias, reducing technical variation in duplicate samples, improving accuracy of measurements against known standards, and retention of biological signal. We find that RCP is statistically significantly better than unadjusted data or adjustment with alternative methods including SWAN and BMIQ.
We incorporated the method into the R package ENmix, which is freely available from the Bioconductor website (https://www.bioconductor.org/packages/release/bioc/html/ENmix.html).
niulg@ucmail.uc.edu
Supplementary data are available at Bioinformatics online.
The Illumina HumanMethylation450 BeadChip is increasingly utilized in epigenome-wide association studies, however, this array-based measurement of DNA methylation is subject to measurement variation. ...Appropriate data preprocessing to remove background noise is important for detecting the small changes that may be associated with disease. We developed a novel background correction method, ENmix, that uses a mixture of exponential and truncated normal distributions to flexibly model signal intensity and uses a truncated normal distribution to model background noise. Depending on data availability, we employ three approaches to estimate background normal distribution parameters using (i) internal chip negative controls, (ii) out-of-band Infinium I probe intensities or (iii) combined methylated and unmethylated intensities. We evaluate ENmix against other available methods for both reproducibility among duplicate samples and accuracy of methylation measurement among laboratory control samples. ENmix out-performed other background correction methods for both these measures and substantially reduced the probe-design type bias between Infinium I and II probes. In reanalysis of existing EWAS data we show that ENmix can identify additional CpGs, and results in smaller P-value estimates for previously-validated CpGs. We incorporated the method into R package ENmix, which is freely available from Bioconductor website.
Abstract
Background
Peripheral blood DNA methylation may be associated with breast cancer, but studies of candidate genes and global and genome-wide DNA methylation have been inconsistent.
Methods
We ...performed an epigenome-wide study using Infinium HumanMethylation450 BeadChips with prospectively collected blood DNA samples from the Sister Study (1552 cases, 1224 subcohort). Differentially methylated cytosine-phosphate-guanine sites (dmCpGs) were identified using case-cohort proportional hazard models and replicated using deposited data from European Prospective Investigation into Cancer and Nutrition in Italy (EPIC-Italy) (n = 329). The correlation between methylation and time to diagnosis was examined using robust linear regression. Causal or consequential relationships of methylation to breast cancer were examined by Mendelian randomization using OncoArray 500 K single-nucleotide polymorphism data. All statistical tests were two-sided.
Results
We identified 9601 CpG markers associated with invasive breast cancer (false discovery rate = q < 0.01), with 510 meeting a strict Bonferroni correction threshold (10–7). A total of 2095 of these CpGs replicated in the independent EPIC-Italy dataset, including 144 meeting the Bonferroni threshold. Sister Study women who developed ductal carcinoma in situ had methylation similar to noncases. Most (1501, 71.6%) dmCpGs showed lower methylation in invasive cases. In case-only analysis, methylation was statistically significantly associated (false discovery rate = q < 0.05) with time to diagnosis for 892 (42.6%) of the dmCpGs. Analyses based on genetic association suggest that methylation differences are likely a consequence rather than a cause of breast cancer. Pathway analysis shows enrichment of breast cancer-related gene pathways, and dmCpGs are overrepresented in known breast cancer susceptibility genes.
Conclusions
Our findings suggest that the DNA methylation profile of blood starts to change in response to invasive breast cancer years before the tumor is clinically detected.
Illumina BeadChips are widely utilized in epigenome-wide association studies (EWAS). Several studies have reported that many probes on these arrays have poor reliability. Here, we compare different ...pre-processing methods to improve intra-class correlation coefficients (ICC). We describe the characteristics of ICC across the genome, within and between studies, and across different array platforms. Using technical duplicates from 128 subjects, we find that with raw data only 22.5% of the CpGs on 450 K array have 'acceptable' ICCs (>0.5). Data preprocessing steps, such as background correction and dye bias correction, can reduce technical noise and improve the percentage to 38.5%. Similar to previous studies, we found that ICC is associated with CpG methylation level such that 83% of CpGs with intermediate methylation (0.1< beta-value <0.9) have acceptable ICCs, whereas only 21% of CpGs with low or high methylation (beta-value <0.1 or >0.9) have acceptable ICCs. ICC is also correlated with CpG methylation variance; after mutual adjustment for beta-value and variance, only variance remains correlated. Many CpGs with poor ICCs (<0.5) are located in biologically important regulatory regions, including gene promoters and CpG islands. Poor ICC at these sites appears to be a consequence of low biologic variation among individuals rather than increased technical measurement variation. ICCs quality classifications are highly concordant across different array platforms and across different studies. We find that ICC can be reliably estimated with 30 pairs of duplicate samples. CpGs with acceptable ICC have higher study power and are more commonly reported in published epigenome-wide studies.
Full text
Available for:
BFBNIB, GIS, IJS, KISLJ, NUK, PNG, UL, UM, UPUK
Diet and host phylogeny drive the taxonomic and functional contents of the gut microbiome in mammals, yet it is unknown whether these patterns hold across all vertebrate lineages. Here, we assessed ...gut microbiomes from ∼900 vertebrate species, including 315 mammals and 491 birds, assessing contributions of diet, phylogeny, and physiology to structuring gut microbiomes. In most nonflying mammals, strong correlations exist between microbial community similarity, host diet, and host phylogenetic distance up to the host order level. In birds, by contrast, gut microbiomes are only very weakly correlated to diet or host phylogeny. Furthermore, while most microbes resident in mammalian guts are present in only a restricted taxonomic range of hosts, most microbes recovered from birds show little evidence of host specificity. Notably, among the mammals, bats host especially bird-like gut microbiomes, with little evidence for correlation to host diet or phylogeny. This suggests that host-gut microbiome phylosymbiosis depends on factors convergently absent in birds and bats, potentially associated with physiological adaptations to flight. Our findings expose major variations in the behavior of these important symbioses in endothermic vertebrates and may signal fundamental evolutionary shifts in the cost/benefit framework of the gut microbiome.
In this comprehensive survey of microbiomes of >900 species, including 315 mammals and 491 birds, we find a striking convergence of the microbiomes of birds and animals that fly. In nonflying mammals, diet and short-term evolutionary relatedness drive the microbiome, and many microbial species are specific to a particular kind of mammal, but flying mammals and birds break this pattern with many microbes shared across different species, with little correlation either with diet or with relatedness of the hosts. This finding suggests that adaptation to flight breaks long-held relationships between hosts and their microbes.
Summary
ipDMR is an R software tool for identification of differentially methylated regions (DMRs) using auto-correlated P-values for individual CpGs from epigenome-wide association analysis using ...array or bisulfite sequencing data. It summarizes P-values for adjacent CpGs, identifies association peaks and then extends peaks to find boundaries of DMRs. ipDMR uses BED format files as input and is easy to use. Simulations guided by real data found that ipDMR outperformed current available methods and provided slightly higher true positive rates and much lower false discovery rates.
Availability and implementation
ipDMR is available at https://bioconductor.org/packages/release/bioc/html/ENmix.html.
Supplementary information
Supplementary data are available at Bioinformatics online.
Epigenetic marks are extensively altered in cancer but may also change in normal tissues with age, which is the primary risk factor for most cancers. We conducted an epigenome-wide study to identify ...age-related methylation sites and examine their relationship to cancer and other underlying epigenetic marks. We analyzed 1006 blood DNA samples of women aged 35-76 years from the Sister Study and found that 7694 (28%) of the 27 578 CpGs assayed were associated with age (false discovery rate, q < 0.05). Using independent data sets, we confirmed 749 'high-confidence' age-related CpG (arCpGs) sites in normal blood. Based on The Cancer Genome Atlas data, we show that these age-related changes are largely concordant in a broad variety of normal tissues and that a significantly higher (71-91%, P < 10(-74)) than expected proportion of increasingly methylated arCpGs (IM-arCpGs) were overmethylated in a wide variety of tumor types. IM-arCpGs sites occurred almost exclusively at CpG islands and were disproportionately marked with the repressive H3K27me3 histone modification (P < 1 × 10(-) (50)). Genes containing these IM-arCpG sites were highly enriched for developmental and signaling pathways (P < 10(-) (10)). Our findings suggest that as cells acquire methylation at age-related sites, they have a lower threshold for malignant transformation that may explain in part the increase in cancer incidence with age.
Illumina DNA methylation arrays are high-throughput platforms for cost-effective genome-wide profiling of individual CpGs. Experimental and technical factors introduce appreciable measurement ...variation, some of which can be mitigated by careful "preprocessing" of raw data.
Here we describe the ENmix preprocessing pipeline and compare it to a set of seven published alternative pipelines (ChAMP, Illumina, SWAN, Funnorm, Noob, wateRmelon, and RnBeads). We use two large sets of duplicate sample measurements with 450 K and EPIC arrays, along with mixtures of isogenic methylated and unmethylated cell line DNA to compare raw data and that preprocessed via different pipelines.
Our evaluations show that the ENmix pipeline performs the best with significantly higher correlation and lower absolute difference between duplicate pairs, higher intraclass correlation coefficients (ICC) and smaller deviations from expected methylation level in mixture experiments. In addition to the pipeline function, ENmix software provides an integrated set of functions for reading in raw data files from mouse and human arrays, quality control, data preprocessing, visualization, detection of differentially methylated regions (DMRs), estimation of cell type proportions, and calculation of methylation age clocks. ENmix is computationally efficient, flexible and allows parallel computing. To facilitate further evaluations, we make all datasets and evaluation code publicly available.
Careful selection of robust data preprocessing methods is critical for DNA methylation array studies. ENmix outperformed other pipelines in our evaluations to minimize experimental variation and to improve data quality and study power.
DNA methylation-based predictors of various biological metrics have been widely published and are becoming valuable tools in epidemiologic studies of epigenetics and personalized medicine. However, ...generating these predictors from original source software and web servers is complex and time consuming. Furthermore, different predictors were often derived based on data from different types of arrays, where array differences and batch effects can make predictors difficult to compare across studies.
We integrate these published methods into a single R function to produce 158 previously published predictors for chronological age, biological age, exposures, lifestyle traits and serum protein levels using both classical and principal component-based methods. To mitigate batch and array differences, we also provide a modified RCP method (ref-RCP) that normalize input DNA methylation data to reference data prior to estimation. Evaluations in real datasets show that this approach improves estimate precision and comparability across studies.
The function was included in software package ENmix, and is freely available from Bioconductor website (https://www.bioconductor.org/packages/release/bioc/html/ENmix.html).