Nucleosome organization has been suggested to affect local mutation rates in the genome. However, the lack of de novo mutation and high-resolution nucleosome data has limited the investigation of ...this hypothesis. Additionally, analyses using indirect mutation rate measurements have yielded contradictory and potentially confounding results. Here, we combine data on >300,000 human de novo mutations with high-resolution nucleosome maps and find substantially elevated mutation rates around translationally stable ('strong') nucleosomes. We show that the mutational mechanisms affected by strong nucleosomes are low-fidelity replication, insufficient mismatch repair and increased double-strand breaks. Strong nucleosomes preferentially locate within young SINE/LINE transposons, suggesting that when subject to increased mutation rates, transposons are then more rapidly inactivated. Depletion of strong nucleosomes in older transposons suggests frequent positioning changes during evolution. The findings have important implications for human genetics and genome evolution.
There are ∼650,000 Alu elements in transcribed regions of the human genome. These elements contain cryptic splice sites, so they are in constant danger of aberrant incorporation into mature ...transcripts. Despite posing a major threat to transcriptome integrity, little is known about the molecular mechanisms preventing their inclusion. Here, we present a mechanism for protecting the human transcriptome from the aberrant exonization of transposable elements. Quantitative iCLIP data show that the RNA-binding protein hnRNP C competes with the splicing factor U2AF65 at many genuine and cryptic splice sites. Loss of hnRNP C leads to formation of previously suppressed Alu exons, which severely disrupt transcript function. Minigene experiments explain disease-associated mutations in Alu elements that hamper hnRNP C binding. Thus, by preventing U2AF65 binding to Alu elements, hnRNP C plays a critical role as a genome-wide sentinel protecting the transcriptome. The findings have important implications for human evolution and disease.
Display omitted
► Quantitative iCLIP reveals genome-wide competition of hnRNP C and U2AF65 ► hnRNP C is a global repressor of aberrant exonization of thousands of Alu elements ► Disease-associated mutations in Alu elements hinder hnRNP C-dependent repression ► Selection reinforces strong hnRNP C binding to contain Alu exonization
The RNA-binding protein hnRNP C prevents the formation of aberrant Alu exons by blocking the binding of the splicing factor U2AF65 to potential Alu splice sites. Breakdown of this system leads to expression of thousands of harmful exons and to human disease.
Prognostic modelling is important in clinical practice and epidemiology for patient management and research. Electronic health records (EHR) provide large quantities of data for such models, but ...conventional epidemiological approaches require significant researcher time to implement. Expert selection of variables, fine-tuning of variable transformations and interactions, and imputing missing values are time-consuming and could bias subsequent analysis, particularly given that missingness in EHR is both high, and may carry meaning. Using a cohort of 80,000 patients from the CALIBER programme, we compared traditional modelling and machine-learning approaches in EHR. First, we used Cox models and random survival forests with and without imputation on 27 expert-selected, preprocessed variables to predict all-cause mortality. We then used Cox models, random forests and elastic net regression on an extended dataset with 586 variables to build prognostic models and identify novel prognostic factors without prior expert input. We observed that data-driven models used on an extended dataset can outperform conventional models for prognosis, without data preprocessing or imputing missing values. An elastic net Cox regression based with 586 unimputed variables with continuous values discretised achieved a C-index of 0.801 (bootstrapped 95% CI 0.799 to 0.802), compared to 0.793 (0.791 to 0.794) for a traditional Cox model comprising 27 expert-selected variables with imputation for missing values. We also found that data-driven models allow identification of novel prognostic variables; that the absence of values for particular variables carries meaning, and can have significant implications for prognosis; and that variables often have a nonlinear association with mortality, which discretised Cox models and random forests can elucidate. This demonstrates that machine-learning approaches applied to raw EHR data can be used to build models for use in research and clinical practice, and identify novel predictive variables and their effects to inform future research.
Although the proteins that read the gene regulatory code, transcription factors (TFs), have been largely identified, it is not well known which sequences TFs can recognize. We have analyzed the ...sequence-specific binding of human TFs using high-throughput SELEX and ChIP sequencing. A total of 830 binding profiles were obtained, describing 239 distinctly different binding specificities. The models represent the majority of human TFs, approximately doubling the coverage compared to existing systematic studies. Our results reveal additional specificity determinants for a large number of factors for which a partial specificity was known, including a commonly observed A- or T-rich stretch that flanks the core motifs. Global analysis of the data revealed that homodimer orientation and spacing preferences, and base-stacking interactions, have a larger role in TF-DNA binding than previously appreciated. We further describe a binding model incorporating these features that is required to understand binding of TFs to DNA.
Display omitted
► High-resolution binding profiles representing most human transcription factors ► High-throughput SELEX can identify long and dimeric sites ► Full-length protein and DNA-binding domain specificities are similar ► Adjacent bases affect TF-DNA binding more than previously thought
High-throughput SELEX is used to determine high-resolution binding profiles representing most human transcription factors. Base-stacking interactions, and dimer orientation and spacing preferences, have a larger role in TF-DNA binding than previously appreciated.
The CRISPR-Cas9 system has successfully been adapted to edit the genome of various organisms. However, our ability to predict the editing outcome at specific sites is limited. Here, we examined indel ...profiles at over 1,000 genomic sites in human cells and uncovered general principles guiding CRISPR-mediated DNA editing. We find that precision of DNA editing (i.e., recurrence of a specific indel) varies considerably among sites, with some targets showing one highly preferred indel and others displaying numerous infrequent indels. Editing precision correlates with editing efficiency and a preference for single-nucleotide homologous insertions. Precise targets and editing outcome can be predicted based on simple rules that mainly depend on the fourth nucleotide upstream of the protospacer adjacent motif (PAM). Indel profiles are robust, but they can be influenced by chromatin features. Our findings have important implications for clinical applications of CRISPR technology and reveal general patterns of broken end joining that can provide insights into DNA repair mechanisms.
Display omitted
•The outcome of CRISPR-mediated editing can be predicted•Not all target sites are edited in a predictable manner•The precision of DNA editing is mainly determined by the fourth nucleotide upstream of the PAM site•Chromatin states affect editing of imprecise, but not precise, target sites
Chakrabarti, Henser-Brownhill, Monserrat et al. show that the genome-editing outcome can be predicted based on simple rules that mainly depend on the target site sequence. Since editing precision varies considerably across sites, careful selection of a predictable target is critical to induce a desired modification in a cell-type-independent manner.
DNA is subject to constant chemical modification and damage, which eventually results in variable mutation rates throughout the genome. Although detailed molecular mechanisms of DNA damage and repair ...are well understood, damage impact and execution of repair across a genome remain poorly defined.
To bridge the gap between our understanding of DNA repair and mutation distributions, we developed a novel method, AP-seq, capable of mapping apurinic sites and 8-oxo-7,8-dihydroguanine bases at approximately 250-bp resolution on a genome-wide scale. We directly demonstrate that the accumulation rate of apurinic sites varies widely across the genome, with hot spots acquiring many times more damage than cold spots. Unlike single nucleotide variants (SNVs) in cancers, damage burden correlates with marks for open chromatin notably H3K9ac and H3K4me2. Apurinic sites and oxidative damage are also highly enriched in transposable elements and other repetitive sequences. In contrast, we observe a reduction at chromatin loop anchors with increased damage load towards inactive compartments. Less damage is found at promoters, exons, and termination sites, but not introns, in a seemingly transcription-independent but GC content-dependent manner. Leveraging cancer genomic data, we also find locally reduced SNV rates in promoters, coding sequence, and other functional elements.
Our study reveals that oxidative DNA damage accumulation and repair differ strongly across the genome, but culminate in a previously unappreciated mechanism that safeguards the regulatory and coding regions of genes from mutations.
RNA-binding proteins are key players in the regulation of gene expression. In this Progress article, we discuss state-of-the-art technologies that can be used to study individual RNA-binding proteins ...or large complexes such as the ribosome. We also describe how these approaches can be used to study interactions with different types of RNAs, including nascent transcripts, mRNAs, microRNAs and ribosomal RNAs, in order to investigate transcription, RNA processing and translation. Finally, we highlight current challenges in data analysis and the future steps that are needed to obtain a quantitative and high-resolution picture of protein-RNA interactions on a genome-wide scale.
The structure of messenger RNA is important for post-transcriptional regulation, mainly because it affects binding of trans-acting factors. However, little is known about the in vivo structure of ...full-length mRNAs. Here we present hiCLIP, a biochemical technique for transcriptome-wide identification of RNA secondary structures interacting with RNA-binding proteins (RBPs). Using this technique to investigate RNA structures bound by Staufen 1 (STAU1) in human cells, we uncover a dominance of intra-molecular RNA duplexes, a depletion of duplexes from coding regions of highly translated mRNAs, an unexpected prevalence of long-range duplexes in 3' untranslated regions (UTRs), and a decreased incidence of single nucleotide polymorphisms in duplex-forming regions. We also discover a duplex spanning 858 nucleotides in the 3' UTR of the X-box binding protein 1 (XBP1) mRNA that regulates its cytoplasmic splicing and stability. Our study reveals the fundamental role of mRNA secondary structures in gene expression and introduces hiCLIP as a widely applicable method for discovering new, especially long-range, RNA duplexes.
Mutations causing amyotrophic lateral sclerosis (ALS) strongly implicate ubiquitously expressed regulators of RNA processing. To understand the molecular impact of ALS-causing mutations on neuronal ...development and disease, we analysed transcriptomes during in vitro differentiation of motor neurons (MNs) from human control and patient-specific VCP mutant induced-pluripotent stem cells (iPSCs). We identify increased intron retention (IR) as a dominant feature of the splicing programme during early neural differentiation. Importantly, IR occurs prematurely in VCP mutant cultures compared with control counterparts. These aberrant IR events are also seen in independent RNAseq data sets from SOD1- and FUS-mutant MNs. The most significant IR is seen in the SFPQ transcript. The SFPQ protein binds extensively to its retained intron, exhibits lower nuclear abundance in VCP mutant cultures and is lost from nuclei of MNs in mouse models and human sporadic ALS. Collectively, we demonstrate SFPQ IR and nuclear loss as molecular hallmarks of familial and sporadic ALS.
The regulation of alternative splicing involves interactions between RNA-binding proteins and pre-mRNA positions close to the splice sites. T-cell intracellular antigen 1 (TIA1) and TIA1-like 1 ...(TIAL1) locally enhance exon inclusion by recruiting U1 snRNP to 5' splice sites. However, effects of TIA proteins on splicing of distal exons have not yet been explored. We used UV-crosslinking and immunoprecipitation (iCLIP) to find that TIA1 and TIAL1 bind at the same positions on human RNAs. Binding downstream of 5' splice sites was used to predict the effects of TIA proteins in enhancing inclusion of proximal exons and silencing inclusion of distal exons. The predictions were validated in an unbiased manner using splice-junction microarrays, RT-PCR, and minigene constructs, which showed that TIA proteins maintain splicing fidelity and regulate alternative splicing by binding exclusively downstream of 5' splice sites. Surprisingly, TIA binding at 5' splice sites silenced distal cassette and variable-length exons without binding in proximity to the regulated alternative 3' splice sites. Using transcriptome-wide high-resolution mapping of TIA-RNA interactions we evaluated the distal splicing effects of TIA proteins. These data are consistent with a model where TIA proteins shorten the time available for definition of an alternative exon by enhancing recognition of the preceding 5' splice site. Thus, our findings indicate that changes in splicing kinetics could mediate the distal regulation of alternative splicing.