Simultaneous profiling of transcriptome and chromatin accessibility within single cells is a powerful approach to dissect gene regulatory programs in complex tissues. However, current tools are ...limited by modest throughput. We now describe an ultra high-throughput method, Paired-seq, for parallel analysis of transcriptome and accessible chromatin in millions of single cells. We demonstrate the utility of Paired-seq for analyzing the dynamic and cell-type-specific gene regulatory programs in complex tissues by applying it to mouse adult cerebral cortex and fetal forebrain. The joint profiles of a large number of single cells allowed us to deconvolute the transcriptome and open chromatin landscapes in the major cell types within these brain tissues, infer putative target genes of candidate enhancers, and reconstruct the trajectory of cellular lineages within the developing forebrain.
Single-cell Hi-C (scHi-C) analysis has been increasingly used to map chromatin architecture in diverse tissue contexts, but computational tools to define chromatin loops at high resolution from ...scHi-C data are still lacking. Here, we describe Single-Nucleus Analysis Pipeline for Hi-C (SnapHiC), a method that can identify chromatin loops at high resolution and accuracy from scHi-C data. Using scHi-C data from 742 mouse embryonic stem cells, we benchmark SnapHiC against a number of computational tools developed for mapping chromatin loops and interactions from bulk Hi-C. We further demonstrate its use by analyzing single-nucleus methyl-3C-seq data from 2,869 human prefrontal cortical cells, which uncovers cell type-specific chromatin loops and predicts putative target genes for noncoding sequence variants associated with neuropsychiatric disorders. Our results indicate that SnapHiC could facilitate the analysis of cell type-specific chromatin architecture and gene regulatory programs in complex tissues.
Lineage-specific epigenomic changes during human corticogenesis have been difficult to study owing to challenges with sample availability and tissue heterogeneity. For example, previous studies using ...single-cell RNA sequencing identified at least 9 major cell types and up to 26 distinct subtypes in the dorsal cortex alone
. Here we characterize cell-type-specific cis-regulatory chromatin interactions, open chromatin peaks, and transcriptomes for radial glia, intermediate progenitor cells, excitatory neurons, and interneurons isolated from mid-gestational samples of the human cortex. We show that chromatin interactions underlie several aspects of gene regulation, with transposable elements and disease-associated variants enriched at distal interacting regions in a cell-type-specific manner. In addition, promoters with increased levels of chromatin interactivity-termed super-interactive promoters-are enriched for lineage-specific genes, suggesting that interactions at these loci contribute to the fine-tuning of transcription. Finally, we develop CRISPRview, a technique that integrates immunostaining, CRISPR interference, RNAscope, and image analysis to validate cell-type-specific cis-regulatory elements in heterogeneous populations of primary cells. Our findings provide insights into cell-type-specific gene expression patterns in the developing human cortex and advance our understanding of gene regulation and lineage specification during this crucial developmental window.
Hi-C and chromatin immunoprecipitation (ChIP) have been combined to identify long-range chromatin interactions genome-wide at reduced cost and enhanced resolution, but extracting information from the ...resulting datasets has been challenging. Here we describe a computational method, MAPS, Model-based Analysis of PLAC-seq and HiChIP, to process the data from such experiments and identify long-range chromatin interactions. MAPS adopts a zero-truncated Poisson regression framework to explicitly remove systematic biases in the PLAC-seq and HiChIP datasets, and then uses the normalized chromatin contact frequencies to identify significant chromatin interactions anchored at genomic regions bound by the protein of interest. MAPS shows superior performance over existing software tools in the analysis of chromatin interactions from multiple PLAC-seq and HiChIP datasets centered on different transcriptional factors and histone marks. MAPS is freely available at https://github.com/ijuric/MAPS.
Clustering of protein sequences is of key importance in predicting the structure and function of newly sequenced proteins and is also of use for their annotation. With the advent of multiple ...high-throughput sequencing technologies, new protein sequences are becoming available at an extraordinary rate. The rapid growth rate has impeded deployment of existing protein clustering/annotation tools which depend largely on pairwise sequence alignment.
In this paper, we propose an alignment-free clustering approach, coreClust, for annotating protein sequences using detected conserved regions. The proposed algorithm uses Min-Wise Independent Hashing for identifying similar conserved regions. Min-Wise Independent Hashing works by generating a (w,c)-sketch for each document and comparing these sketches. Our algorithm fits well within the MapReduce framework, permitting scalability. We show that coreClust generates results comparable to existing known methods. In particular, we show that the clusters generated by our algorithm capture the subfamilies of the Pfam domain families for which the sequences in a cluster have a similar domain architecture. We show that for a data set of 90,000 sequences (about 250,000 domain regions), the clusters generated by our algorithm give a 75% average weighted F1 score, our accuracy metric, when compared to the clusters generated by a semi-exhaustive pairwise alignment algorithm.
The new clustering algorithm can be used to generate meaningful clusters of conserved regions. It is a scalable method that when paired with our prior work, NADDA for detecting conserved regions, provides a complete end-to-end pipeline for annotating protein sequences.
Identifying conserved regions in protein sequences is a fundamental operation, occurring in numerous sequence-driven analysis pipelines. It is used as a way to decode domain-rich regions within ...proteins, to compute protein clusters, to annotate sequence function, and to compute evolutionary relationships among protein sequences. A number of approaches exist for identifying and characterizing protein families based on their domains, and because domains represent conserved portions of a protein sequence, the primary computation involved in protein family characterization is identification of such conserved regions. However, identifying conserved regions from large collections (millions) of protein sequences presents significant challenges.
In this paper we present a new, alignment-free method for detecting conserved regions in protein sequences called NADDA (No-Alignment Domain Detection Algorithm). Our method exploits the abundance of exact matching short subsequences (k-mers) to quickly detect conserved regions, and the power of machine learning is used to improve the prediction accuracy of detection. We present a parallel implementation of NADDA using the MapReduce framework and show that our method is highly scalable.
We have compared NADDA with Pfam and InterPro databases. For known domains annotated by Pfam, accuracy is 83%, sensitivity 96%, and specificity 44%. For sequences with new domains not present in the training set an average accuracy of 63% is achieved when compared to Pfam. A boost in results in comparison with InterPro demonstrates the ability of NADDA to capture conserved regions beyond those present in Pfam. We have also compared NADDA with ADDA and MKDOM2, assuming Pfam as ground-truth. On average NADDA shows comparable accuracy, more balanced sensitivity and specificity, and being alignment-free, is significantly faster. Excluding the one-time cost of training, runtimes on a single processor were 49s, 10,566s, and 456s for NADDA, ADDA, and MKDOM2, respectively, for a data set comprised of approximately 2500 sequences.
High-throughput chromatin conformation capture technologies, such as Hi-C and Micro-C, have enabled genome-wide view of chromatin spatial organization. Most recently, Hi-C-derived enrichment-based ...technologies, including HiChIP and PLAC-seq, offer attractive alternatives due to their high signal-to-noise ratio and low cost. While a series of computational tools have been developed for Hi-C data, methods tailored for HiChIP and PLAC-seq data are still under development. Here we present HPTAD, a computational method to identify topologically associating domains (TADs) from HiChIP and PLAC-seq data. We performed comprehensive benchmark analysis to demonstrate its superior performance over existing TAD callers designed for Hi-C data. HPTAD is freely available at https://github.com/yunliUNC/HPTAD.
•SnapHiC2 uses sliding windows and reduces both computing time and memory by 70%.•SnapHiC2 can identify 5 Kb resolution chromatin loops with high sensitivity and accuracy.•SnapHiC2 can help to ...predict putative target genes of GWAS variants in a cell-type-specific manner.
Single cell Hi-C (scHi-C) technologies enable the study of chromatin spatial organization directly from complex tissues at single cell resolution. However, the identification of chromatin loops from single cells is challenging, largely due to the extremely sparse data. Our recently developed SnapHiC pipeline provides the first tool to map chromatin loops from scHi-C data, but it is computationally intensive. Here we introduce SnapHiC2, which adapts a sliding window approximation when imputing missing contacts in each single cell and reduces both memory usage and computational time by 70%. SnapHiC2 can identify 5 Kb resolution chromatin loops with high sensitivity and accuracy and help to suggest target genes for GWAS variants in a cell-type-specific manner. SnapHiC2 is freely available at: https://github.com/HuMingLab/SnapHiC/releases/tag/v0.2.2.
HiChIP and PLAC-Seq are emerging technologies for studying genome-wide long-range chromatin interactions mediated by the protein of interest, enabling more sensitive and cost-efficient interrogation ...of protein-centric chromatin conformation. However, due to the unbalanced read distribution introduced by protein immunoprecipitation, existing reproducibility measures developed for Hi-C data are not appropriate for the analysis of HiChIP and PLAC-Seq data. Here, we present HPRep, a stratified and weighted correlation metric derived from normalized contact counts, to quantify reproducibility in HiChIP and PLAC-Seq data. We applied HPRep to multiple real datasets and demonstrate that HPRep outperforms existing reproducibility measures developed for Hi-C data. Specifically, we applied HPRep to H3K4me3 PLAC-Seq data from mouse embryonic stem cells and mouse brain tissues as well as H3K27ac HiChIP data from human lymphoblastoid cell line GM12878 and leukemia cell line K562, showing that HPRep can more clearly separate among pseudo-replicates, real replicates, and non-replicates. Furthermore, in an H3K4me3 PLAC-Seq dataset consisting of 11 samples from four human brain cell types, HPRep demonstrated the expected clustering of data that could not be achieved by existing methods developed for Hi-C data, highlighting the need for a reproducibility metric tailored to HiChIP and PLAC-Seq data.
•Frequently Interacting Regions (FIREs) can be used to identify tissue and cell-type- specific cis-regulatory regions.•Our FIREcaller R package identifies FIREs and superFIREs (i.e., clustered ...FIREs).•FIREcaller also detects differential FIREs.
Hi-C experiments have been widely adopted to study chromatin spatial organization, which plays an essential role in genome function. We have recently identified frequently interacting regions (FIREs) and found that they are closely associated with cell-type-specific gene regulation. However, computational tools for detecting FIREs from Hi-C data are still lacking. In this work, we present FIREcaller, a stand-alone, user-friendly R package for detecting FIREs from Hi-C data. FIREcaller takes raw Hi-C contact matrices as input, performs within-sample and cross-sample normalization, and outputs continuous FIRE scores, dichotomous FIREs, and super-FIREs. Applying FIREcaller to Hi-C data from various human tissues, we demonstrate that FIREs and super-FIREs identified, in a tissue-specific manner, are closely related to gene regulation, are enriched for enhancer-promoter (E-P) interactions, tend to overlap with regions exhibiting epigenomic signatures of cis-regulatory roles, and aid the interpretation or GWAS variants. The FIREcaller package is implemented in R and freely available at https://yunliweb.its.unc.edu/FIREcaller.