The repair outcomes at site-specific DNA double-strand breaks (DSBs) generated by the RNA-guided DNA endonuclease Cas9 determine how gene function is altered. Despite the widespread adoption ...of CRISPR-Cas9 technology to induce DSBs for genome engineering, the resulting repair products have not been examined in depth. Here, the DNA repair profiles of 223 sites in the human genome demonstrate that the pattern of DNA repair following Cas9 cutting at each site is nonrandom and consistent across experimental replicates, cell lines, and reagent delivery methods. Furthermore, the repair outcomes are determined by the protospacer sequence rather than genomic context, indicating that DNA repair profiling in cell lines can be used to anticipate repair outcomes in primary cells. Chemical inhibition of DNA-PK enabled dissection of the DNA repair profiles into contributions from c-NHEJ and MMEJ. Finally, this work elucidates a strategy for using “error-prone” DNA-repair machinery to generate precise edits.
Display omitted
•DNA repair profiles of 223 sites in the human genome reveal nonrandom outcomes•The protospacer sequence, not genomic context, determines Cas9 DSB repair outcomes•Each DNA repair profile is composed of specific contributions from c-NHEJ and MMEJ•DNA repair profiling can be used to anticipate repair outcomes in primary cells
van Overbeek, Capurso et al. demonstrate that repair outcomes are nonrandom at S. pyogenes Cas9-mediated DSBs and are determined by the protospacer sequence rather than genomic context. DNA repair profiling reveals specific contributions of c-NHEJ and MMEJ at each site and an approach to generate precise edits without exogenous template.
RNA-guided CRISPR-Cas9 endonucleases are widely used for genome engineering, but our understanding of Cas9 specificity remains incomplete. Here, we developed a biochemical method (SITE-Seq), using ...Cas9 programmed with single-guide RNAs (sgRNAs), to identify the sequence of cut sites within genomic DNA. Cells edited with the same Cas9-sgRNA complexes are then assayed for mutations at each cut site using amplicon sequencing. We used SITE-Seq to examine Cas9 specificity with sgRNAs targeting the human genome. The number of sites identified depended on sgRNA sequence and nuclease concentration. Sites identified at lower concentrations showed a higher propensity for off-target mutations in cells. The list of off-target sites showing activity in cells was influenced by sgRNP delivery, cell type and duration of exposure to the nuclease. Collectively, our results underscore the utility of combining comprehensive biochemical identification of off-target sites with independent cell-based measurements of activity at those sites when assessing nuclease activity and specificity.
Repair of a chromosomal double-strand break (DSB) by gene conversion depends on the ability of the broken ends to encounter a donor sequence. To understand how chromosomal location of a target ...sequence affects DSB repair, we took advantage of genome-wide Hi-C analysis of yeast chromosomes to create a series of strains in which an induced site-specific DSB in budding yeast is repaired by a 2-kb donor sequence inserted at different locations. The efficiency of repair, measured by cell viability or competition between each donor and a reference site, showed a strong correlation (r = 0.85 and 0.79) with the contact frequencies of each donor with the DSB repair site. Repair efficiency depends on the distance between donor and recipient rather than any intrinsic limitation of a particular donor site. These results further demonstrate that the search for homology is the rate-limiting step in DSB repair and suggest that cells often fail to repair a DSB because they cannot locate a donor before other, apparently lethal, processes arise. The repair efficiency of a donor locus can be improved by four factors: slower 5′ to 3′ resection of the DSB ends, increased abundance of replication protein factor A (RPA), longer shared homology, or presence of a recombination enhancer element adjacent to a donor.
Applying supervised learning/classification techniques to epigenomic data may reveal properties that differentiate histone modifications. Previous analyses sought to classify nucleosomes containing ...histone H2A/H4 arginine 3 symmetric dimethylation (H2A/H4R3me2s) or H2A.Z using human CD4+ T-cell chromatin immunoprecipitation sequencing (ChIP-Seq) data. However, these efforts only achieved modest accuracy with limited biological interpretation. Here, we investigate the impact of using appropriate data pre-processing -deduplication, normalization, and position- (peak-) finding to identify stable nucleosome positions - in conjunction with advanced classification algorithms, notably discriminatory motif feature selection and random forests. Performance assessments are based on accuracy and interpretative yield.
We achieved dramatically improved accuracy using histone modification features (99.0%; previous attempts, 68.3%) and DNA sequence features (94.1%; previous attempts, <60%). Furthermore, the algorithms elicited interpretable features that withstand permutation testing, including: the histone modifications H4K20me3 and H3K9me3, which are components of heterochromatin; and the motif TCCATT, which is part of the consensus sequence of satellite II and III DNA. Downstream analysis demonstrates that satellite II and III DNA in the human genome is occupied by stable nucleosomes containing H2A/H4R3me2s, H4K20me3, and/or H3K9me3, but not 18 other histone methylations. These results are consistent with the recent biochemical finding that H4R3me2s provides a binding site for the DNA methyltransferase (Dnmt3a) that methylates satellite II and III DNA.
Classification algorithms applied to appropriately pre-processed ChIP-Seq data can accurately discriminate between histone modifications. Algorithms that facilitate interpretation, such as discriminatory motif feature selection, have the added potential to impart information about underlying biological mechanism.
Most existing methods for sequence-based classification use exhaustive feature generation, employing, for example, all k-mer patterns. The motivation behind such (enumerative) approaches is to ...minimize the potential for overlooking important features. However, there are shortcomings to this strategy. First, practical constraints limit the scope of exhaustive feature generation to patterns of length ≤ k, such that potentially important, longer (> k) predictors are not considered. Second, features so generated exhibit strong dependencies, which can complicate understanding of derived classification rules. Third, and most importantly, numerous irrelevant features are created. These concerns can compromise prediction and interpretation. While remedies have been proposed, they tend to be problem-specific and not broadly applicable. Here, we develop a generally applicable methodology, and an attendant software pipeline, that is predicated on discriminatory motif finding. In addition to the traditional training and validation partitions, our framework entails a third level of data partitioning, a discovery partition. A discriminatory motif finder is used on sequences and associated class labels in the discovery partition to yield a (small) set of features. These features are then used as inputs to a classifier in the training partition. Finally, performance assessment occurs on the validation partition. Important attributes of our approach are its modularity (any discriminatory motif finder and any classifier can be deployed) and its universality (all data, including sequences that are unaligned and/or of unequal length, can be accommodated). We illustrate our approach on two nucleosome occupancy datasets and a protein solubility dataset, previously analyzed using enumerative feature generation. Our method achieves excellent performance results, with and without optimization of classifier tuning parameters. A Python pipeline implementing the approach is available at http://www.epibiostat.ucsf.edu/biostat/sen/dmfs/.
The genomes of multicellular organisms are extensively folded into 3D chromosome territories within the nucleus
. Advanced 3D genome-mapping methods that combine proximity ligation and ...high-throughput sequencing (such as chromosome conformation capture, Hi-C)
, and chromatin immunoprecipitation techniques (such as chromatin interaction analysis by paired-end tag sequencing, ChIA-PET)
, have revealed topologically associating domains
with frequent chromatin contacts, and have identified chromatin loops mediated by specific protein factors for insulation and regulation of transcription
. However, these methods rely on pairwise proximity ligation and reflect population-level views, and thus cannot reveal the detailed nature of chromatin interactions. Although single-cell Hi-C
potentially overcomes this issue, this method may be limited by the sparsity of data that is inherent to current single-cell assays. Recent advances in microfluidics have opened opportunities for droplet-based genomic analysis
but this approach has not yet been adapted for chromatin interaction analysis. Here we describe a strategy for multiplex chromatin-interaction analysis via droplet-based and barcode-linked sequencing, which we name ChIA-Drop. We demonstrate the robustness of ChIA-Drop in capturing complex chromatin interactions with single-molecule precision, which has not been possible using methods based on population-level pairwise contacts. By applying ChIA-Drop to Drosophila cells, we show that chromatin topological structures predominantly consist of multiplex chromatin interactions with high heterogeneity; ChIA-Drop also reveals promoter-centred multivalent interactions, which provide topological insights into transcription.
The spatial organization of the genome influences cellular function, notably gene regulation. Recent studies have assessed the three-dimensional (3D) co-localization of functional annotations (e.g. ...centromeres, long terminal repeats) using 3D genome reconstructions from Hi-C (genome-wide chromosome conformation capture) data; however, corresponding assessments for continuous functional genomic data (e.g. chromatin immunoprecipitation-sequencing (ChIP-seq) peak height) are lacking. Here, we demonstrate that applying bump hunting via the patient rule induction method (PRIM) to ChIP-seq data superposed on a Saccharomyces cerevisiae 3D genome reconstruction can discover 'functional 3D hotspots', regions in 3-space for which the mean ChIP-seq peak height is significantly elevated. For the transcription factor Swi6, the top hotspot by P-value contains MSB2 and ERG11 - known Swi6 target genes on different chromosomes. We verify this finding in a number of ways. First, this top hotspot is relatively stable under PRIM across parameter settings. Second, this hotspot is among the top hotspots by mean outcome identified by an alternative algorithm, k-Nearest Neighbor (k-NN) regression. Third, the distance between MSB2 and ERG11 is smaller than expected (by resampling) in two other 3D reconstructions generated via different normalization and reconstruction algorithms. This analytic approach can discover functional 3D hotspots and potentially reveal novel regulatory interactions.
Cancer gene discovery has relied extensively on analyzing tumors for gains and losses to reveal the location of oncogenes and tumor suppressor genes, respectively. Deletions of
1p36 are extremely ...common genetic lesions in human cancer, occurring in malignancies of epithelial, neural, and hematopoietic origin. Although this suggests that
1p36 harbors a gene that drives tumorigenesis when inactivated, the identity of this tumor suppressor has remained elusive. Here we use chromosome engineering to generate mouse models with gain and loss of a region corresponding to human
1p36. This approach functionally identifies
chromodomain
helicase
DNA binding domain
5 (Chd5) as a tumor suppressor that controls proliferation, apoptosis, and senescence via the p19
Arf/p53 pathway. We demonstrate that Chd5 functions as a tumor suppressor in vivo and implicate deletion of
CHD5 in human cancer. Identification of this tumor suppressor provides new avenues for exploring innovative clinical interventions for cancer.
Asthma is a common disease with a complex risk architecture including both genetic and environmental factors. We performed a meta-analysis of North American genome-wide association studies of asthma ...in 5,416 individuals with asthma (cases) including individuals of European American, African American or African Caribbean, and Latino ancestry, with replication in an additional 12,649 individuals from the same ethnic groups. We identified five susceptibility loci. Four were at previously reported loci on 17q21, near IL1RL1, TSLP and IL33, but we report for the first time, to our knowledge, that these loci are associated with asthma risk in three ethnic groups. In addition, we identified a new asthma susceptibility locus at PYHIN1, with the association being specific to individuals of African descent (P = 3.9 × 10−9). These results suggest that some asthma susceptibility loci are robust to differences in ancestry when sufficiently large samples sizes are investigated, and that ancestry-specific associations also contribute to the complex genetic architecture of asthma.