Genomic datasets are often interpreted in the context of large-scale reference databases. One approach is to identify significantly overlapping gene sets, which works well for gene-centric data. ...However, many types of high-throughput data are based on genomic regions. Locus Overlap Analysis (LOLA) provides easy and automatable enrichment analysis for genomic region sets, thus facilitating the interpretation of functional genomics and epigenomics data.
R package available in Bioconductor and on the following website: http://lola.computational-epigenetics.org.
Methods for single-cell genome and transcriptome sequencing have contributed to our understanding of cellular heterogeneity, whereas methods for single-cell epigenomics are much less ...established. Here, we describe a whole-genome bisulfite sequencing (WGBS) assay that enables DNA methylation mapping in very small cell populations (μWGBS) and single cells (scWGBS). Our assay is optimized for profiling many samples at low coverage, and we describe a bioinformatic method that analyzes collections of single-cell methylomes to infer cell-state dynamics. Using these technological advances, we studied epigenomic cell-state dynamics in three in vitro models of cellular differentiation and pluripotency, where we observed characteristic patterns of epigenome remodeling and cell-to-cell heterogeneity. The described method enables single-cell analysis of DNA methylation in a broad range of biological systems, including embryonic development, stem cell differentiation, and cancer. It can also be used to establish composite methylomes that account for cell-to-cell heterogeneity in complex tissue samples.
Display omitted
•High-throughput bisulfite sequencing assay for low-input and single-cell samples•Single-cell methylomes for in vitro models (K562 AzaC, HL60 VitD, and mES 2i/ATRA/EB)•Bioinformatic method for inferring cell-state dynamics from sparse methylome data•Identification of genomic region types with consistent changes among single cells
Farlik et al. describe a method for DNA methylation sequencing in very small cell populations (μWGBS) and single cells (scWGBS). Furthermore, they present a bioinformatic method for analyzing low-coverage methylome data and apply this technique to inferring epigenomic cell-state dynamics in pluripotent and differentiating cells.
Most genome-wide assays provide averages across large numbers of cells, but recent technological advances promise to overcome this limitation. Pioneering single-cell assays are now available for ...genome, epigenome, transcriptome, proteome, and metabolome profiling. Here, we describe how these different dimensions can be combined into multi-omics assays that provide comprehensive profiles of the same cell.
Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is widely used to map histone marks and transcription factor binding throughout the genome. Here we present ChIPmentation, a method ...that combines chromatin immunoprecipitation with sequencing library preparation by Tn5 transposase ('tagmentation'). ChIPmentation introduces sequencing-compatible adaptors in a single-step reaction directly on bead-bound chromatin, which reduces time, cost and input requirements, thus providing a convenient and broadly useful alternative to existing ChIP-seq protocols.
Transcription factor fusion proteins can transform cells by inducing global changes of the transcriptome, often creating a state of oncogene addiction. Here, we investigate the role of epigenetic ...mechanisms in this process, focusing on Ewing sarcoma cells that are dependent on the EWS-FLI1 fusion protein. We established reference epigenome maps comprising DNA methylation, seven histone marks, open chromatin states, and RNA levels, and we analyzed the epigenome dynamics upon downregulation of the driving oncogene. Reduced EWS-FLI1 expression led to widespread epigenetic changes in promoters, enhancers, and super-enhancers, and we identified histone H3K27 acetylation as the most strongly affected mark. Clustering of epigenetic promoter signatures defined classes of EWS-FLI1-regulated genes that responded differently to low-dose treatment with histone deacetylase inhibitors. Furthermore, we observed strong and opposing enrichment patterns for E2F and AP-1 among EWS-FLI1-correlated and anticorrelated genes. Our data describe extensive genome-wide rewiring of epigenetic cell states driven by an oncogenic fusion protein.
Display omitted
•Reference epigenome maps identify widespread epigenetic change in Ewing sarcoma cells•EWS-FLI1-regulated genes fall into clusters with characteristic chromatin signatures•Transcriptome response to HDAC inhibitors depends on promoter-specific histone marks•EWS-FLI1 induces global changes in H3K27ac and genome-wide enhancer reprogramming
EWS-FLI1 is an oncogenic fusion protein and the main driver of Ewing sarcoma. Tomazou et al. establish comprehensive epigenome maps for an EWS-FLI1-dependent cell line. Based on these data, they identify clusters of epigenetically regulated genes and a unique enhancer signature that is associated with EWS-FLI1 oncogene addiction.
Functional genomics experiments, like ChIP-Seq or ATAC-Seq, produce results that are summarized as a region set. There is no way to objectively evaluate the effectiveness of region set similarity ...metrics. We present Bedshift, a tool for perturbing BED files by randomly shifting, adding, and dropping regions from a reference file. The perturbed files can be used to benchmark similarity metrics, as well as for other applications. We highlight differences in behavior between metrics, such as that the Jaccard score is most sensitive to added or dropped regions, while coverage score is most sensitive to shifted regions.
Abstract
Motivation
Genomic data is frequently stored as segments or intervals. Because this data type is so common, interval-based comparisons are fundamental to genomic analysis. As the volume of ...available genomic data grows, developing efficient and scalable methods for searching interval data is necessary.
Results
We present a new data structure, the Augmented Interval List (AIList), to enumerate intersections between a query interval q and an interval set R. An AIList is constructed by first sorting R as a list by the interval start coordinate, then decomposing it into a few approximately flattened components (sublists), and then augmenting each sublist with the running maximum interval end. The query time for AIList is O(log2N+n+m), where n is the number of overlaps between R and q, N is the number of intervals in the set R and m is the average number of extra comparisons required to find the n overlaps. Tested on real genomic interval datasets, AIList code runs 5–18 times faster than standard high-performance code based on augmented interval-trees, nested containment lists or R-trees (BEDTools). For large datasets, the memory-usage for AIList is 4–60% of other methods. The AIList data structure, therefore, provides a significantly improved fundamental operation for highly scalable genomic data analysis.
Availability and implementation
An implementation of the AIList data structure with both construction and search algorithms is available at http://ailist.databio.org.
Supplementary information
Supplementary data are available at Bioinformatics online.
Complex patterns of cell-type-specific gene expression are thought to be achieved by combinatorial binding of transcription factors (TFs) to sequence elements in regulatory regions. Predicting ...cell-type-specific expression in mammals has been hindered by the oftentimes unknown location of distal regulatory regions. To alleviate this bottleneck, we used DNase-seq data from 19 diverse human cell types to identify proximal and distal regulatory elements at genome-wide scale. Matched expression data allowed us to separate genes into classes of cell-type-specific up-regulated, down-regulated, and constitutively expressed genes. CG dinucleotide content and DNA accessibility in the promoters of these three classes of genes displayed substantial differences, highlighting the importance of including these aspects in modeling gene expression. We associated DNase I hypersensitive sites (DHSs) with genes, and trained classifiers for different expression patterns. TF sequence motif matches in DHSs provided a strong performance improvement in predicting gene expression over the typical baseline approach of using proximal promoter sequences. In particular, we achieved competitive performance when discriminating up-regulated genes from different cell types or genes up- and down-regulated under the same conditions. We identified previously known and new candidate cell-type-specific regulators. The models generated testable predictions of activating or repressive functions of regulators. DNase I footprints for these regulators were indicative of their direct binding to DNA. In summary, we successfully used information of open chromatin obtained by a single assay, DNase-seq, to address the problem of predicting cell-type-specific gene expression in mammalian organisms directly from regulatory sequence.
Abstract Motivation Gene set enrichment (GSE) analysis allows for an interpretation of gene expression through pre-defined gene set databases and is a critical step in understanding different ...phenotypes. With the rapid development of single-cell RNA sequencing (scRNA-seq) technology, GSE analysis can be performed on fine-grained gene expression data to gain a nuanced understanding of phenotypes of interest. However, with the cellular heterogeneity in single-cell gene profiles, current statistical GSE analysis methods sometimes fail to identify enriched gene sets. Meanwhile, deep learning has gained traction in applications like clustering and trajectory inference in single-cell studies due to its prowess in capturing complex data patterns. However, its use in GSE analysis remains limited, due to interpretability challenges. Results In this paper, we present DeepGSEA, an explainable deep gene set enrichment analysis approach which leverages the expressiveness of interpretable, prototype-based neural networks to provide an in-depth analysis of GSE. DeepGSEA learns the ability to capture GSE information through our designed classification tasks, and significance tests can be performed on each gene set, enabling the identification of enriched sets. The underlying distribution of a gene set learned by DeepGSEA can be explicitly visualized using the encoded cell and cellular prototype embeddings. We demonstrate the performance of DeepGSEA over commonly used GSE analysis methods by examining their sensitivity and specificity with four simulation studies. In addition, we test our model on three real scRNA-seq datasets and illustrate the interpretability of DeepGSEA by showing how its results can be explained. Availability and implementation https://github.com/Teddy-XiongGZ/DeepGSEA
Abstract
Summary
Databases of large-scale genome projects now contain thousands of genomic interval datasets. These data are a critical resource for understanding the function of DNA. However, our ...ability to examine and integrate interval data of this scale is limited. Here, we introduce the integrated genome database (IGD), a method and tool for searching genome interval datasets more than three orders of magnitude faster than existing approaches, while using only one hundredth of the memory. IGD uses a novel linear binning method that allows us to scale analysis to billions of genomic regions.
Availabilityand implementation
https://github.com/databio/IGD.
Supplementary information
Supplementary data are available at Bioinformatics online.