Epigenome-wide association studies of human disease and other quantitative traits are becoming increasingly common. A series of papers reporting age-related changes in DNA methylation profiles in ...peripheral blood have already been published. However, blood is a heterogeneous collection of different cell types, each with a very different DNA methylation profile.
Using a statistical method that permits estimating the relative proportion of cell types from DNA methylation profiles, we examine data from five previously published studies, and find strong evidence of cell composition change across age in blood. We also demonstrate that, in these studies, cellular composition explains much of the observed variability in DNA methylation. Furthermore, we find high levels of confounding between age-related variability and cellular composition at the CpG level.
Our findings underscore the importance of considering cell composition variability in epigenetic studies based on whole blood and other heterogeneous tissue sources. We also provide software for estimating and exploring this composition confounding for the Illumina 450k microarray.
Motivation: The recently released Infinium HumanMethylation450 array (the ‘450k’ array) provides a high-throughput assay to quantify DNA methylation (DNAm) at ∼450 000 loci across a range of genomic ...features. Although less comprehensive than high-throughput sequencing-based techniques, this product is more cost-effective and promises to be the most widely used DNAm high-throughput measurement technology over the next several years.
Results: Here we describe a suite of computational tools that incorporate state-of-the-art statistical techniques for the analysis of DNAm data. The software is structured to easily adapt to future versions of the technology. We include methods for preprocessing, quality assessment and detection of differentially methylated regions from the kilobase to the megabase scale. We show how our software provides a powerful and flexible development platform for future methods. We also illustrate how our methods empower the technology to make discoveries previously thought to be possible only with sequencing-based methods.
Availability and implementation:
http://bioconductor.org/packages/release/bioc/html/minfi.html.
Contact:
khansen@jhsph.edu; rafa@jimmy.harvard.edu
Supplementary information:
Supplementary data are available at Bioinformatics online.
Abstract
We apply two Bayesian hierarchical inference schemes to infer shear power spectra, shear maps and cosmological parameters from the Canada–France–Hawaii Telescope (CFHTLenS) weak lensing ...survey – the first application of this method to data. In the first approach, we sample the joint posterior distribution of the shear maps and power spectra by Gibbs sampling, with minimal model assumptions. In the second approach, we sample the joint posterior of the shear maps and cosmological parameters, providing a new, accurate and principled approach to cosmological parameter inference from cosmic shear data. As a first demonstration on data, we perform a two-bin tomographic analysis to constrain cosmological parameters and investigate the possibility of photometric redshift bias in the CFHTLenS data. Under the baseline ΛCDM (Λ cold dark matter) model, we constrain
$S_8 = \sigma _8(\Omega _\mathrm{m}/0.3)^{0.5} = 0.67 ^{\scriptscriptstyle + 0.03 }_{\scriptscriptstyle - 0.03 }$
(68 per cent), consistent with previous CFHTLenS analyses but in tension with Planck. Adding neutrino mass as a free parameter, we are able to constrain ∑m
ν < 4.6 eV (95 per cent) using CFHTLenS data alone. Including a linear redshift-dependent photo-z bias Δz = p
2(z − p
1), we find
$p_1=-0.25 ^{\scriptscriptstyle + 0.53 }_{\scriptscriptstyle - 0.60 }$
and
$p_2 = -0.15 ^{\scriptscriptstyle + 0.17 }_{\scriptscriptstyle - 0.15 }$
, and tension with Planck is only alleviated under very conservative prior assumptions. Neither the non-minimal neutrino mass nor photo-z bias models are significantly preferred by the CFHTLenS (two-bin tomography) data.
Heterogeneity and latent variables are now widely recognized as major sources of bias and variability in high-throughput experiments. The most well-known source of latent variation in genomic ...experiments are batch effects-when samples are processed on different days, in different groups or by different people. However, there are also a large number of other variables that may have a major impact on high-throughput measurements. Here we describe the sva package for identifying, estimating and removing unwanted sources of variation in high-throughput experiments. The sva package supports surrogate variable estimation with the sva function, direct adjustment for known batch effects with the ComBat function and adjustment for batch and latent variables in prediction problems with the fsva function.
DNA methylation (DNAm) is important in brain development and is potentially important in schizophrenia. We characterized DNAm in prefrontal cortex from 335 non-psychiatric controls across the ...lifespan and 191 patients with schizophrenia and identified widespread changes in the transition from prenatal to postnatal life. These DNAm changes manifest in the transcriptome, correlate strongly with a shifting cellular landscape and overlap regions of genetic risk for schizophrenia. A quarter of published genome-wide association studies (GWAS)-suggestive loci (4,208 of 15,930, P < 10(-100)) manifest as significant methylation quantitative trait loci (meQTLs), including 59.6% of GWAS-positive schizophrenia loci. We identified 2,104 CpGs that differ between schizophrenia patients and controls that were enriched for genes related to development and neurodifferentiation. The schizophrenia-associated CpGs strongly correlate with changes related to the prenatal-postnatal transition and show slight enrichment for GWAS risk loci while not corresponding to CpGs differentiating adolescence from later adult life. These data implicate an epigenetic component to the developmental origins of this disorder.
Abstract
Motivation
Most genetic variants implicated in complex diseases by genome-wide association studies (GWAS) are non-coding, making it challenging to understand the causative genes involved in ...disease. Integrating external information such as quantitative trait locus (QTL) mapping of molecular traits (e.g. expression, methylation) is a powerful approach to identify the subset of GWAS signals explained by regulatory effects. In particular, expression QTLs (eQTLs) help pinpoint the responsible gene among the GWAS regions that harbor many genes, while methylation QTLs (mQTLs) help identify the epigenetic mechanisms that impact gene expression which in turn affect disease risk. In this work, we propose multiple-trait-coloc (moloc), a Bayesian statistical framework that integrates GWAS summary data with multiple molecular QTL data to identify regulatory effects at GWAS risk loci.
Results
We applied moloc to schizophrenia (SCZ) and eQTL/mQTL data derived from human brain tissue and identified 52 candidate genes that influence SCZ through methylation. Our method can be applied to any GWAS and relevant functional data to help prioritize disease associated genes.
Availability and implementation: moloc is available for download as an R package (https://github.com/clagiamba/moloc). We also developed a web site to visualize the biological findings (icahn.mssm.edu/moloc). The browser allows searches by gene, methylation probe and scenario of interest.
Supplementary information
Supplementary data are available at Bioinformatics online.
Statistical methods development for differential expression analysis of RNA sequencing (RNA-seq) requires software tools to assess accuracy and error rate control. Since true differential expression ...status is often unknown in experimental datasets, artificially constructed datasets must be utilized, either by generating costly spike-in experiments or by simulating RNA-seq data.
Polyester is an R package designed to simulate RNA-seq data, beginning with an experimental design and ending with collections of RNA-seq reads. Its main advantage is the ability to simulate reads indicating isoform-level differential expression across biological replicates for a variety of experimental designs. Data generated by Polyester is a reasonable approximation to real RNA-seq data and standard differential expression workflows can recover differential expression set in the simulation by the user.
Polyester is freely available from Bioconductor (http://bioconductor.org/).
jtleek@gmail.com
Supplementary data are available at Bioinformatics online.
Hierarchical cosmic shear power spectrum inference Alsing, Justin; Heavens, Alan; Jaffe, Andrew H ...
Monthly notices of the Royal Astronomical Society,
02/2016, Letnik:
455, Številka:
4
Journal Article
Recenzirano
Odprti dostop
We develop a Bayesian hierarchical modelling approach for cosmic shear power spectrum inference, jointly sampling from the posterior distribution of the cosmic shear field and its (tomographic) power ...spectra. Inference of the shear power spectrum is a powerful intermediate product for a cosmic shear analysis, since it requires very few model assumptions and can be used to perform inference on a wide range of cosmological models a posteriori without loss of information. We show that joint posterior for the shear map and power spectrum can be sampled effectively by Gibbs sampling, iteratively drawing samples from the map and power spectrum, each conditional on the other. This approach neatly circumvents difficulties associated with complicated survey geometry and masks that plague frequentist power spectrum estimators, since the power spectrum inference provides prior information about the field in masked regions at every sampling step. We demonstrate this approach for inference of tomographic shear E-mode, B-mode and EB-cross power spectra from a simulated galaxy shear catalogue with a number of important features; galaxies distributed on the sky and in redshift with photometric redshift uncertainties, realistic random ellipticity noise for every galaxy and a complicated survey mask. The obtained posterior distributions for the tomographic power spectrum coefficients recover the underlying simulated power spectra for both E- and B-modes.