Abstract
Motivation
Eukaryotic chromosomes adapt a complex and highly dynamic three-dimensional (3D) structure, which profoundly affects different cellular functions and outcomes including changes in ...epigenetic landscape and in gene expression. Making the scenario even more complex, cancer cells harbor chromosomal abnormalities e.g. copy number variations (CNVs) and translocations altering their genomes both at the sequence level and at the level of 3D organization. High-throughput chromosome conformation capture techniques (e.g. Hi-C), which are originally developed for decoding the 3D structure of the chromatin, provide a great opportunity to simultaneously identify the locations of genomic rearrangements and to investigate the 3D genome organization in cancer cells. Even though Hi-C data has been used for validating known rearrangements, computational methods that can distinguish rearrangement signals from the inherent biases of Hi-C data and from the actual 3D conformation of chromatin, and can precisely detect rearrangement locations de novo have been missing.
Results
In this work, we characterize how intra and inter-chromosomal Hi-C contacts are distributed for normal and rearranged chromosomes to devise a new set of algorithms (i) to identify genomic segments that correspond to CNV regions such as amplifications and deletions (HiCnv), (ii) to call inter-chromosomal translocations and their boundaries (HiCtrans) from Hi-C experiments and (iii) to simulate Hi-C data from genomes with desired rearrangements and abnormalities (AveSim) in order to select optimal parameters for and to benchmark the accuracy of our methods. Our results on 10 different cancer cell lines with Hi-C data show that we identify a total number of 105 amplifications and 45 deletions together with 90 translocations, whereas we identify virtually no such events for two karyotypically normal cell lines. Our CNV predictions correlate very well with whole genome sequencing data among chromosomes with CNV events for a breast cancer cell line (r = 0.89) and capture most of the CNVs we simulate using Avesim. For HiCtrans predictions, we report evidence from the literature for 30 out of 90 translocations for eight of our cancer cell lines. Furthermore, we show that our tools identify and correctly classify relatively understudied rearrangements such as double minutes and homogeneously staining regions. Considering the inherent limitations of existing techniques for karyotyping (i.e. missing balanced rearrangements and those near repetitive regions), the accurate identification of CNVs and translocations in a cost-effective and high-throughput setting is still a challenge. Our results show that the set of tools we develop effectively utilize moderately sequenced Hi-C libraries (100-300 million reads) to identify known and de novo chromosomal rearrangements/abnormalities in well-established cancer cell lines. With the decrease in required number of cells and the increase in attainable resolution, we believe that our framework will pave the way towards comprehensive mapping of genomic rearrangements in primary cells from cancer patients using Hi-C.
Availability and implementation
CNV calling: https://github.com/ay-lab/HiCnv, Translocation calling: https://github.com/ay-lab/HiCtrans and Hi-C simulation: https://github.com/ay-lab/AveSim.
Supplementary information
Supplementary data are available at Bioinformatics online.
Fit-Hi-C is a programming application to compute statistical confidence estimates for Hi-C contact maps to identify significant chromatin contacts. By fitting a monotonically non-increasing spline, ...Fit-Hi-C captures the relationship between genomic distance and contact probability without any parametric assumption. The spline fit together with the correction of contact probabilities with respect to bin- or locus-specific biases accounts for previously characterized covariates impacting Hi-C contact counts. Fit-Hi-C is best applied for the study of mid-range (e.g., 20 kb-2 Mb for human genome) intra-chromosomal contacts; however, with the latest reimplementation, named FitHiC2, it is possible to perform genome-wide analysis for high-resolution Hi-C data, including all intra-chromosomal distances and inter-chromosomal contacts. FitHiC2 also offers a merging filter module, which eliminates indirect/bystander interactions, leading to significant reduction in the number of reported contacts without sacrificing recovery of key loops such as those between convergent CTCF binding sites. Here, we describe how to apply the FitHiC2 protocol to three use cases: (i) 5-kb resolution Hi-C data of chromosome 5 from GM12878 (a human lymphoblastoid cell line), (ii) 40-kb resolution whole-genome Hi-C data from IMR90 (human lung fibroblast), and (iii) budding yeast whole-genome Hi-C data at a single restriction cut site (EcoRI) resolution. The procedure takes ~12 h with preprocessing when all use cases are run sequentially (~4 h when run parallel). With the recent improvements in its implementation, FitHiC2 (8 processors and 16 GB memory) is also scalable to genome-wide analysis of the highest resolution (1 kb) Hi-C data available to date (~48 h with 32 GB peak memory). FitHiC2 is available through Bioconda, GitHub and the Python Package Index.
HiChIP/PLAC-seq is increasingly becoming popular for profiling 3D chromatin contacts among regulatory elements and for annotating functions of genetic variants. Here we describe FitHiChIP, a ...computational method for loop calling from HiChIP/PLAC-seq data, which jointly models the non-uniform coverage and genomic distance scaling of contact counts to compute statistical significance estimates. We also develop a technique to filter putative bystander loops that can be explained by stronger adjacent loops. Compared to existing methods, FitHiChIP performs better in recovering contacts reported by Hi-C, promoter capture Hi-C and ChIA-PET experiments and in capturing previously validated promoter-enhancer interactions. FitHiChIP loop calls are reproducible among replicates and are consistent across different experimental settings. Our work also provides a framework for differential HiChIP analysis with an option to utilize ChIP-seq data for further characterizing differential loops. Even though designed for HiChIP, FitHiChIP is also applicable to other conformation capture assays.
Our current understanding of how DNA is packed in the nucleus is most accurate at the fine scale of individual nucleosomes and at the large scale of chromosome territories. However, accurate modeling ...of DNA architecture at the intermediate scale of ∼50 kb-10 Mb is crucial for identifying functional interactions among regulatory elements and their target promoters. We describe a method, Fit-Hi-C, that assigns statistical confidence estimates to mid-range intra-chromosomal contacts by jointly modeling the random polymer looping effect and previously observed technical biases in Hi-C data sets. We demonstrate that our proposed approach computes accurate empirical null models of contact probability without any distribution assumption, corrects for binning artifacts, and provides improved statistical power relative to a previously described method. High-confidence contacts identified by Fit-Hi-C preferentially link expressed gene promoters to active enhancers identified by chromatin signatures in human embryonic stem cells (ESCs), capture 77% of RNA polymerase II-mediated enhancer-promoter interactions identified using ChIA-PET in mouse ESCs, and confirm previously validated, cell line-specific interactions in mouse cortex cells. We observe that insulators and heterochromatin regions are hubs for high-confidence contacts, while promoters and strong enhancers are involved in fewer contacts. We also observe that binding peaks of master pluripotency factors such as NANOG and POU5F1 are highly enriched in high-confidence contacts for human ESCs. Furthermore, we show that pairs of loci linked by high-confidence contacts exhibit similar replication timing in human and mouse ESCs and preferentially lie within the boundaries of topological domains for human and mouse cell lines.
We present MUSTACHE, a new method for multi-scale detection of chromatin loops from Hi-C and Micro-C contact maps. MUSTACHE employs scale-space theory, a technical advance in computer vision, to ...detect blob-shaped objects in contact maps. MUSTACHE is scalable to kilobase-resolution maps and reports loops that are highly consistent between replicates and between Hi-C and Micro-C datasets. Compared to other loop callers, such as HiCCUPS and SIP, MUSTACHE recovers a higher number of published ChIA-PET and HiChIP loops as well as loops linking promoters to regulatory elements. Overall, MUSTACHE enables an efficient and comprehensive analysis of chromatin loops. Available at: https://github.com/ay-lab/mustache .
The rapidly increasing quantity of genome-wide chromosome conformation capture data presents great opportunities and challenges in the computational modeling and interpretation of the ...three-dimensional genome. In particular, with recent trends towards higher-resolution high-throughput chromosome conformation capture (Hi-C) data, the diversity and complexity of biological hypotheses that can be tested necessitates rigorous computational and statistical methods as well as scalable pipelines to interpret these datasets. Here we review computational tools to interpret Hi-C data, including pipelines for mapping, filtering, and normalization, and methods for confidence estimation, domain calling, visualization, and three-dimensional modeling.
Since the advent of the chromosome conformation capture technology, our understanding of the human genome 3D organization has grown rapidly and we now know that human interphase chromosomes are ...folded into multiple layers of hierarchical structures and each layer can play a critical role in transcriptional regulation. Alterations in any one of these finely-tuned layers can lead to unwanted cascade of molecular events and ultimately drive the manifestation of diseases and phenotypes. Here we discuss, starting from chromosome level organization going down to single nucleotide changes, recent studies linking diseases or phenotypes to changes in the 3D genome architecture.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
The compartmental organization of mammalian genomes and its changes play important roles in distinct biological processes. Here, we introduce dcHiC, which utilizes a multivariate distance measure to ...identify significant changes in compartmentalization among multiple contact maps. Evaluating dcHiC on four collections of bulk and single-cell contact maps from in vitro mouse neural differentiation (n = 3), mouse hematopoiesis (n = 10), human LCLs (n = 20) and post-natal mouse brain development (n = 3 stages), we show its effectiveness and sensitivity in detecting biologically relevant changes, including those orthogonally validated. dcHiC reported regions with dynamically regulated genes associated with cell identity, along with correlated changes in chromatin states, subcompartments, replication timing and lamin association. With its efficient implementation, dcHiC enables high-resolution compartment analysis as well as standalone browser visualization, differential interaction identification and time-series clustering. dcHiC is an essential addition to the Hi-C analysis toolbox for the ever-growing number of bulk and single-cell contact maps. Available at: https://github.com/ay-lab/dcHiC .
Recent technological advances allow the measurement, in a single Hi-C experiment, of the frequencies of physical contacts among pairs of genomic loci at a genome-wide scale. The next challenge is to ...infer, from the resulting DNA-DNA contact maps, accurate 3D models of how chromosomes fold and fit into the nucleus. Many existing inference methods rely on multidimensional scaling (MDS), in which the pairwise distances of the inferred model are optimized to resemble pairwise distances derived directly from the contact counts. These approaches, however, often optimize a heuristic objective function and require strong assumptions about the biophysics of DNA to transform interaction frequencies to spatial distance, and thereby may lead to incorrect structure reconstruction.
We propose a novel approach to infer a consensus 3D structure of a genome from Hi-C data. The method incorporates a statistical model of the contact counts, assuming that the counts between two loci follow a Poisson distribution whose intensity decreases with the physical distances between the loci. The method can automatically adjust the transfer function relating the spatial distance to the Poisson intensity and infer a genome structure that best explains the observed data.
We compare two variants of our Poisson method, with or without optimization of the transfer function, to four different MDS-based algorithms-two metric MDS methods using different stress functions, a non-metric version of MDS and ChromSDE, a recently described, advanced MDS method-on a wide range of simulated datasets. We demonstrate that the Poisson models reconstruct better structures than all MDS-based methods, particularly at low coverage and high resolution, and we highlight the importance of optimizing the transfer function. On publicly available Hi-C data from mouse embryonic stem cells, we show that the Poisson methods lead to more reproducible structures than MDS-based methods when we use data generated using different restriction enzymes, and when we reconstruct structures at different resolutions.
A Python implementation of the proposed method is available at http://cbio.ensmp.fr/pastis.
Common genetic polymorphisms associated with COVID-19 illness can be utilized for discovering molecular pathways and cell types driving disease pathogenesis. Given the importance of immune cells in ...the pathogenesis of COVID-19 illness, here we assessed the effects of COVID-19-risk variants on gene expression in a wide range of immune cell types. Transcriptome-wide association study and colocalization analysis revealed putative causal genes and the specific immune cell types where gene expression is most influenced by COVID-19-risk variants. Notable examples include OAS1 in non-classical monocytes, DTX1 in B cells, IL10RB in NK cells, CXCR6 in follicular helper T cells, CCR9 in regulatory T cells and ARL17A in T
2 cells. By analysis of transposase accessible chromatin and H3K27ac-based chromatin-interaction maps of immune cell types, we prioritized potentially functional COVID-19-risk variants. Our study highlights the potential of COVID-19 genetic risk variants to impact the function of diverse immune cell types and influence severe disease manifestations.