Chromatin immunoprecipitation experiments followed by sequencing (ChIP-seq) detect protein-DNA binding events and chemical modifications of histone proteins. Challenges in the standard ChIP-seq ...protocol have motivated recent enhancements in this approach, such as reducing the number of cells that are required and increasing the resolution. Complementary experimental approaches - for example, DNaseI hypersensitive site mapping and analysis of chromatin interactions that are mediated by particular proteins - provide additional information about DNA-binding proteins and their function. These data are now being used to identify variability in the functions of DNA-binding proteins across genomes and individuals. In this Review, I describe the latest advances in methods to detect and functionally characterize DNA-bound proteins.
The intestinal epithelium serves critical physiologic functions that are shared among all vertebrates. However, it is unknown how the transcriptional regulatory mechanisms underlying these functions ...have changed over the course of vertebrate evolution. We generated genome-wide mRNA and accessible chromatin data from adult intestinal epithelial cells (IECs) in zebrafish, stickleback, mouse, and human species to determine if conserved IEC functions are achieved through common transcriptional regulation. We found evidence for substantial common regulation and conservation of gene expression regionally along the length of the intestine from fish to mammals and identified a core set of genes comprising a vertebrate IEC signature. We also identified transcriptional start sites and other putative regulatory regions that are differentially accessible in IECs in all 4 species. Although these sites rarely showed sequence conservation from fish to mammals, surprisingly, they drove highly conserved IEC expression in a zebrafish reporter assay. Common putative transcription factor binding sites (TFBS) found at these sites in multiple species indicate that sequence conservation alone is insufficient to identify much of the functionally conserved IEC regulatory information. Among the rare, highly sequence-conserved, IEC-specific regulatory regions, we discovered an ancient enhancer upstream from her6/HES1 that is active in a distinct population of Notch-positive cells in the intestinal epithelium. Together, these results show how combining accessible chromatin and mRNA datasets with TFBS prediction and in vivo reporter assays can reveal tissue-specific regulatory information conserved across 420 million years of vertebrate evolution. We define an IEC transcriptional regulatory network that is shared between fish and mammals and establish an experimental platform for studying how evolutionarily distilled regulatory information commonly controls IEC development and physiology.
The IBDs, Crohn's disease and ulcerative colitis, are chronic inflammatory conditions of the gastrointestinal tract resulting from an aberrant immune response to enteric microbiota in genetically ...susceptible individuals. Disease presentation and progression within and across IBDs, especially Crohn's disease, are highly heterogeneous in location, severity of inflammation and other phenotypes. Current clinical classifications fail to accurately predict disease course and response to therapies. Genome-wide association studies have identified >240 loci that confer risk of IBD, but the clinical utility of these findings remains unclear, and mechanisms by which the genetic variants contribute to disease are largely unknown. In the past 5 years, the profiling of genome-wide gene expression, epigenomic features and gut microbiota composition in intestinal tissue and faecal samples has uncovered distinct molecular signatures that define IBD subtypes, including within Crohn's disease and ulcerative colitis. In this Review, we summarize studies in both adult and paediatric patients that have identified different IBD subtypes, which in some cases have been associated with distinct clinical phenotypes. We posit that genome-scale molecular phenotyping in large cohorts holds great promise not only to further our understanding of the diverse molecular causes of IBD but also for improving clinical trial design to develop more personalized disease management and treatment.
Network inference is used to model transcriptional, signaling, and metabolic interactions among genes, proteins, and metabolites that identify biological pathways influencing disease pathogenesis. ...Advances in machine learning (ML)-based inference models exhibit the predictive capabilities of capturing latent patterns in genomic data. Such models are emerging as an alternative to the statistical models identifying causative factors driving complex diseases. We present CoVar, an ML-based framework that builds upon the properties of existing inference models, to find the central genes driving perturbed gene expression across biological states. Unlike differentially expressed genes (DEGs) that capture changes in individual gene expression across conditions, CoVar focuses on identifying variational genes that undergo changes in their expression network interaction profiles, providing insights into changes in the regulatory dynamics, such as in disease pathogenesis. Subsequently, it finds core genes from among the nearest neighbors of these variational genes, which are central to the variational activity and influence the coordinated regulatory processes underlying the observed changes in gene expression. Through the analysis of simulated as well as yeast expression data perturbed by the deletion of the mitochondrial genome, we show that CoVar captures the intrinsic variationality and modularity in the expression data, identifying key driver genes not found through existing differential analysis methodologies.
Mapping DNase I hypersensitive (HS) sites is an accurate method of identifying the location of genetic regulatory elements, including promoters, enhancers, silencers, insulators, and locus control ...regions. We employed high-throughput sequencing and whole-genome tiled array strategies to identify DNase I HS sites within human primary CD4
+ T cells. Combining these two technologies, we have created a comprehensive and accurate genome-wide open chromatin map. Surprisingly, only 16%–21% of the identified 94,925 DNase I HS sites are found in promoters or first exons of known genes, but nearly half of the most open sites are in these regions. In conjunction with expression, motif, and chromatin immunoprecipitation data, we find evidence of cell-type-specific characteristics, including the ability to identify transcription start sites and locations of different chromatin marks utilized in these cells. In addition, and unexpectedly, our analyses have uncovered detailed features of nucleosome structure.
Identifying the locations of transcription factor binding sites is critical for understanding how gene transcription is regulated across different cell types and conditions. Chromatin accessibility ...experiments such as DNaseI sequencing (DNase-seq) and Assay for Transposase Accessible Chromatin sequencing (ATAC-seq) produce genome-wide data that include distinct 'footprint' patterns at binding sites. Nearly all existing computational methods to detect footprints from these data assume that footprint signals are highly homogeneous across footprint sites. Additionally, a comprehensive and systematic comparison of footprinting methods for specifically identifying which motif sites for a specific factor are bound has not been performed.
Using DNase-seq data from the ENCODE project, we show that a large degree of previously uncharacterized site-to-site variability exists in footprint signal across motif sites for a transcription factor. To model this heterogeneity in the data, we introduce a novel, supervised learning footprinter called Detecting Footprints Containing Motifs (DeFCoM). We compare DeFCoM to nine existing methods using evaluation sets from four human cell-lines and eighteen transcription factors and show that DeFCoM outperforms current methods in determining bound and unbound motif sites. We also analyze the impact of several biological and technical factors on the quality of footprint predictions to highlight important considerations when conducting footprint analyses and assessing the performance of footprint prediction methods. Finally, we show that DeFCoM can detect footprints using ATAC-seq data with similar accuracy as when using DNase-seq data.
Python code available at https://bitbucket.org/bryancquach/defcom.
bquach@email.unc.edu or tsfurey@email.unc.edu.
Supplementary data are available at Bioinformatics online.
Genetics Driving Epigenetics Furey, Terrence S.; Sethupathy, Praveen
Science (American Association for the Advancement of Science),
11/2013, Letnik:
342, Številka:
6159
Journal Article
Recenzirano
Humans vary according to a plethora of traits, such as height, hair color, behavior, and susceptibility to disease. Both genetics (nature) and environment (nurture) contribute to this variation. ...Recent large-scale genetic studies have identified thousands of specific DNA variations in the human population that are associated with different traits. However, these studies do not answer a key question: By what means do most DNA variants alter cellular behavior and contribute to differences in specific traits, such as height? A trio of papers in this issue by Kasowski et al. on page 750 (1), Kilpinen et al. on page 744 (2), and McVicker et al. on page 747 (3) provide a framework for exploring the mechanistic link between genetic and trait variation in the human population. Specifically, they find that DNA variants influence a layer of gene regulation called epigenetics through the sequence-specific activity of transcription factors.
Abstract
Motivation
Analysis of open chromatin regions across multiple samples from two or more distinct conditions can determine altered gene regulatory patterns associated with biological ...phenotypes and complex traits. The ATAC-seq assay allows for tractable genome-wide open chromatin profiling of large numbers of samples. Stable, broadly applicable genomic annotations of open chromatin regions are not available. Thus, most studies first identify open regions using peak calling methods for each sample independently. These are then heuristically combined to obtain a consensus peak set. Reconciling sample-specific peak results post hoc from larger cohorts is particularly challenging, and informative spatial features specific to open chromatin signals are not leveraged effectively.
Results
We propose a novel method, ROCCO, that determines consensus open chromatin regions across multiple samples simultaneously. ROCCO employs robust summary statistics and solves a constrained optimization problem formulated to account for both enrichment and spatial dependence of open chromatin signal data. We show this formulation admits attractive theoretical and conceptual properties as well as superior empirical performance compared to current methodology.
Availability and implementation
Source code, documentation, and usage demos for ROCCO are available on GitHub at: https://github.com/nolan-h-hamilton/ROCCO. ROCCO can also be installed as a stand-alone binary utility using pip/PyPI.
Regulation of gene transcription in diverse cell types is determined largely by varied sets of cis-elements where transcription factors bind. Here we demonstrate that data from a single ...high-throughput DNase I hypersensitivity assay can delineate hundreds of thousands of base-pair resolution in vivo footprints in human cells that precisely mark individual transcription factor-DNA interactions. These annotations provide a unique resource for the investigation of cis-regulatory elements. We find that footprints for specific transcription factors correlate with ChIP-seq enrichment and can accurately identify functional versus nonfunctional transcription factor motifs. We also find that footprints reveal a unique evolutionary conservation pattern that differentiates functional footprinted bases from surrounding DNA. Finally, detailed analysis of CTCF footprints suggests multiple modes of binding and a novel DNA binding motif upstream of the primary binding site.
Tag sequencing using high-throughput sequencing technologies are now regularly employed to identify specific sequence features, such as transcription factor binding sites (ChIP-seq) or regions of ...open chromatin (DNase-seq). To intuitively summarize and display individual sequence data as an accurate and interpretable signal, we developed F-Seq, a software package that generates a continuous tag sequence density estimation allowing identification of biologically meaningful sites whose output can be displayed directly in the UCSC Genome Browser. Availability: The software is written in the Java language and is available on all major computing platforms for download at http://www.genome.duke.edu/labs/furey/software/fseq. Contact: terry.furey@duke.edu