ABSTRACT
Bringing a high-dimensional data set into science-ready shape is a formidable challenge that often necessitates data compression. Compression has accordingly become a key consideration for ...contemporary cosmology, affecting public data releases, and reanalyses searching for new physics. However, data compression optimized for a particular model can suppress signs of new physics, or even remove them altogether. We therefore provide a solution for exploring new physics during data compression. In particular, we store additional agnostic compressed data points, selected to enable precise constraints of non-standard physics at a later date. Our procedure is based on the maximal compression of the MOPED algorithm, which optimally filters the data with respect to a baseline model. We select additional filters, based on a generalized principal component analysis, which are carefully constructed to scout for new physics at high precision and speed. We refer to the augmented set of filters as MOPED-PC. They enable an analytic computation of Bayesian Evidence that may indicate the presence of new physics, and fast analytic estimates of best-fitting parameters when adopting a specific non-standard theory, without further expensive MCMC analysis. As there may be large numbers of non-standard theories, the speed of the method becomes essential. Should no new physics be found, then our approach preserves the precision of the standard parameters. As a result, we achieve very rapid and maximally precise constraints of standard and non-standard physics, with a technique that scales well to large dimensional data sets.
We used the 10x Genomics Visium platform to define the spatial topography of gene expression in the six-layered human dorsolateral prefrontal cortex. We identified extensive layer-enriched expression ...signatures and refined associations to previous laminar markers. We overlaid our laminar expression signatures on large-scale single nucleus RNA-sequencing data, enhancing spatial annotation of expression-driven clusters. By integrating neuropsychiatric disorder gene sets, we showed differential layer-enriched expression of genes associated with schizophrenia and autism spectrum disorder, highlighting the clinical relevance of spatially defined expression. We then developed a data-driven framework to define unsupervised clusters in spatial transcriptomics data, which can be applied to other tissues or brain regions in which morphological architecture is not as well defined as cortical laminae. Last, we created a web application for the scientific community to explore these raw and summarized data to augment ongoing neuroscience and spatial transcriptomics research ( http://research.libd.org/spatialLIBD ).
The hippocampus formation, although prominently implicated in schizophrenia pathogenesis, has been overlooked in large-scale genomics efforts in the schizophrenic brain. We performed RNA-seq in ...hippocampi and dorsolateral prefrontal cortices (DLPFCs) from 551 individuals (286 with schizophrenia). We identified substantial regional differences in gene expression and found widespread developmental differences that were independent of cellular composition. We identified 48 and 245 differentially expressed genes (DEGs) associated with schizophrenia within the hippocampus and DLPFC, with little overlap between the brain regions. 124 of 163 (76.6%) of schizophrenia GWAS risk loci contained eQTLs in any region. Transcriptome-wide association studies in each region identified many novel schizophrenia risk features that were brain region-specific. Last, we identified potential molecular correlates of in vivo evidence of altered prefrontal-hippocampal functional coherence in schizophrenia. These results underscore the complexity and regional heterogeneity of the transcriptional correlates of schizophrenia and offer new insights into potentially causative biology.
•Dorsolateral prefrontal cortex and hippocampus gene expression across development•Novel region-specific schizophrenia genetic risk features•Decreased regional functional coherence in schizophrenia•Public brain gene expression and eQTL resource at http://eqtl.brainseq.org/phase2
Collado-Torres et al. describe the BrainSeq Phase II gene expression resource encompassing two brain regions from 551 genotyped individuals spanning the entire human lifespan (286 with schizophrenia). This resource can answer region-specific questions about development and schizophrenia and its genetic risk.
Bisulfite sequencing is a powerful tool for profiling genomic methylation, an epigenetic modification critical in the understanding of cancer, psychiatric disorders, and many other conditions. Raw ...data generated by whole genome bisulfite sequencing (WGBS) requires several computational steps before it is ready for statistical analysis, and particular care is required to process data in a timely and memory-efficient manner. Alignment to a reference genome is one of the most computationally demanding steps in a WGBS workflow, taking several hours or even days with commonly used WGBS-specific alignment software. This naturally motivates the creation of computational workflows that can utilize GPU-based alignment software to greatly speed up the bottleneck step. In addition, WGBS produces raw data that is large and often unwieldy; a lack of memory-efficient representation of data by existing pipelines renders WGBS impractical or impossible to many researchers. We present BiocMAP, a Bioconductor-friendly methylation analysis pipeline consisting of two modules, to address the above concerns. The first module performs computationally-intensive read alignment using Arioc, a GPU-accelerated short-read aligner. Since GPUs are not always available on the same computing environments where traditional CPU-based analyses are convenient, the second module may be run in a GPU-free environment. This module extracts and merges DNA methylation proportions--the fractions of methylated cytosines across all cells in a sample at a given genomic site. Bioconductor-based output objects in R utilize an on-disk data representation to drastically reduce required main memory and make WGBS projects computationally feasible to more researchers. BiocMAP is implemented using Nextflow and available at http://research.libd.org/BiocMAP/. To enable reproducible analysis across a variety of typical computing environments, BiocMAP can be containerized with Docker or Singularity, and executed locally or with the SLURM or SGE scheduling engines. By providing Bioconductor objects, BiocMAP's output can be integrated with powerful analytical open source software for analyzing methylation data.
During the past 5 years, high-throughput technologies have been successfully used by epidemiology studies, but almost all have focused on sequence variation through genome-wide association studies ...(GWAS). Today, the study of other genomic events is becoming more common in large-scale epidemiological studies. Many of these, unlike the single-nucleotide polymorphism studied in GWAS, are continuous measures. In this context, the exercise of searching for regions of interest for disease is akin to the problems described in the statistical 'bump hunting' literature.
New statistical challenges arise when the measurements are continuous rather than categorical, when they are measured with uncertainty, and when both biological signal, and measurement errors are characterized by spatial correlation along the genome. Perhaps the most challenging complication is that continuous genomic data from large studies are measured throughout long periods, making them susceptible to 'batch effects'. An example that combines all three characteristics is genome-wide DNA methylation measurements. Here, we present a data analysis pipeline that effectively models measurement error, removes batch effects, detects regions of interest and attaches statistical uncertainty to identified regions.
We illustrate the usefulness of our approach by detecting genomic regions of DNA methylation associated with a continuous trait in a well-characterized population of newborns. Additionally, we show that addressing unexplained heterogeneity like batch effects reduces the number of false-positive regions.
Our framework offers a comprehensive yet flexible approach for identifying genomic regions of biological interest in large epidemiological studies using quantitative high-throughput methods.
We employed Illumina 450 K Infinium microarrays to profile DNA methylation (DNAm) in neuronal nuclei separated by fluorescence-activated sorting from the postmortem orbitofrontal cortex (OFC) of ...heroin users who died from heroin overdose (
= 37), suicide completers (
= 22) with no evidence of heroin use and from control subjects who did not abuse illicit drugs and died of non-suicide causes (
= 28). We identified 1298 differentially methylated CpG sites (DMSs) between heroin users and controls, and 454 DMSs between suicide completers and controls (
< 0.001). DMSs and corresponding genes (DMGs) in heroin users showed significant differences in the preferential context of hyper and hypo DM. HyperDMSs were enriched in gene bodies and exons but depleted in promoters, whereas hypoDMSs were enriched in promoters and enhancers. In addition, hyperDMGs showed preference for genes expressed specifically by glutamatergic as opposed to GABAergic neurons and enrichment for axonogenesis- and synaptic-related gene ontology categories, whereas hypoDMGs were enriched for transcription factor activity- and gene expression regulation-related terms. Finally, we found that the DNAm-based "epigenetic age" of neurons from heroin users was younger than that in controls. Suicide-related results were more difficult to interpret. Collectively, these findings suggest that the observed DNAm differences could represent functionally significant marks of heroin-associated plasticity in the OFC.
Autism spectrum disorder (ASD) is genetically heterogeneous with convergent symptomatology, suggesting common dysregulated pathways. In this study, we analyzed brain transcriptional changes in five ...mouse models of Pitt-Hopkins syndrome (PTHS), a syndromic form of ASD caused by mutations in the TCF4 gene, but not the TCF7L2 gene. Analyses of differentially expressed genes (DEGs) highlighted oligodendrocyte (OL) dysregulation, which we confirmed in two additional mouse models of syndromic ASD (Pten
and Mecp2
). The PTHS mouse models showed cell-autonomous reductions in OL numbers and myelination, functionally confirming OL transcriptional signatures. We also integrated PTHS mouse model DEGs with human idiopathic ASD postmortem brain RNA-sequencing data and found significant enrichment of overlapping DEGs and common myelination-associated pathways. Notably, DEGs from syndromic ASD mouse models and reduced deconvoluted OL numbers distinguished human idiopathic ASD cases from controls across three postmortem brain data sets. These results implicate disruptions in OL biology as a cellular mechanism in ASD pathology.