G-quadruplex (G4) DNA structures are extremely stable four-stranded secondary structures held together by noncanonical G-G base pairs. Genome-wide chromatin immunoprecipitation was used to determine ...the in vivo binding sites of the multifunctional
Saccharomyces cerevisiae Pif1 DNA helicase, a potent unwinder of G4 structures in vitro. G4 motifs were a significant subset of the high-confidence Pif1-binding sites. Replication slowed in the vicinity of these motifs, and they were prone to breakage in Pif1-deficient cells, whereas non-G4 Pif1-binding sites did not show this behavior. Introducing many copies of G4 motifs caused slow growth in replication-stressed Pif1-deficient cells, which was relieved by spontaneous mutations that eliminated their ability to form G4 structures, bind Pif1, slow DNA replication, and stimulate DNA breakage. These data suggest that G4 structures form in vivo and that they are resolved by Pif1 to prevent replication fork stalling and DNA breakage.
Display omitted
►
S. cerevisiae Pif1 DNA helicase binds G-quadruplex (G4) motifs in vivo ► In the absence of Pif1, replication forks slow and chromosomes break near G4 motifs ► Extra G4 motifs cause slow growth in replication-stressed Pif1-deficient cells ► G4 structures form in vivo and decrease genome integrity unless unwound by Pif1
Motivation: All residues in a protein are not equally important. Some are essential for the proper structure and function of the protein, whereas others can be readily replaced. Conservation analysis ...is one of the most widely used methods for predicting these functionally important residues in protein sequences. Results: We introduce an information-theoretic approach for estimating sequence conservation based on Jensen–Shannon divergence. We also develop a general heuristic that considers the estimated conservation of sequentially neighboring sites. In large-scale testing, we demonstrate that our combined approach outperforms previous conservation-based measures in identifying functionally important residues; in particular, it is significantly better than the commonly used Shannon entropy measure. We find that considering conservation at sequential neighbors improves the performance of all methods tested. Our analysis also reveals that many existing methods that attempt to incorporate the relationships between amino acids do not lead to better identification of functionally important sites. Finally, we find that while conservation is highly predictive in identifying catalytic sites and residues near bound ligands, it is much less effective in identifying residues in protein–protein interfaces. Availability: Data sets and code for all conservation measures evaluated are available at http://compbio.cs.princeton.edu/conservation/ Contact: mona@cs.princeton.edu Supplementary information: Supplementary data are available at Bioinformatics online.
Predicting mutation-induced changes in protein thermodynamic stability (ΔΔG) is of great interest in protein engineering, variant interpretation, and protein biophysics. We introduce ThermoNet, a ...deep, 3D-convolutional neural network (3D-CNN) designed for structure-based prediction of ΔΔGs upon point mutation. To leverage the image-processing power inherent in CNNs, we treat protein structures as if they were multi-channel 3D images. In particular, the inputs to ThermoNet are uniformly constructed as multi-channel voxel grids based on biophysical properties derived from raw atom coordinates. We train and evaluate ThermoNet with a curated data set that accounts for protein homology and is balanced with direct and reverse mutations; this provides a framework for addressing biases that have likely influenced many previous ΔΔG prediction methods. ThermoNet demonstrates performance comparable to the best available methods on the widely used Ssym test set. In addition, ThermoNet accurately predicts the effects of both stabilizing and destabilizing mutations, while most other methods exhibit a strong bias towards predicting destabilization. We further show that homology between Ssym and widely used training sets like S2648 and VariBench has likely led to overestimated performance in previous studies. Finally, we demonstrate the practical utility of ThermoNet in predicting the ΔΔGs for two clinically relevant proteins, p53 and myoglobin, and for pathogenic and benign missense variants from ClinVar. Overall, our results suggest that 3D-CNNs can model the complex, non-linear interactions perturbed by mutations, directly from biophysical properties of atoms.
Heart development is exquisitely sensitive to the precise temporal regulation of thousands of genes that govern developmental decisions during differentiation. However, we currently lack a detailed ...understanding of how chromatin and gene expression patterns are coordinated during developmental transitions in the cardiac lineage. Here, we interrogated the transcriptome and several histone modifications across the genome during defined stages of cardiac differentiation. We find distinct chromatin patterns that are coordinated with stage-specific expression of functionally related genes, including many human disease-associated genes. Moreover, we discover a novel preactivation chromatin pattern at the promoters of genes associated with heart development and cardiac function. We further identify stage-specific distal enhancer elements and find enriched DNA binding motifs within these regions that predict sets of transcription factors that orchestrate cardiac differentiation. Together, these findings form a basis for understanding developmentally regulated chromatin transitions during lineage commitment and the molecular etiology of congenital heart disease.
Advances over the past 25 years have revealed much about how the structural properties of membranes and associated proteins are linked to the thermodynamics and kinetics of membrane protein (MP) ...folding. At the same time biochemical progress has outlined how cellular proteostasis networks mediate MP folding and manage misfolding in the cell. When combined with results from genomic sequencing, these studies have established paradigms for how MP folding and misfolding are linked to the molecular etiologies of a variety of diseases. This emerging framework has paved the way for the development of a new class of small molecule “pharmacological chaperones” that bind to and stabilize misfolded MP variants, some of which are now in clinical use. In this review, we comprehensively outline current perspectives on the folding and misfolding of integral MPs as well as the mechanisms of cellular MP quality control. Based on these perspectives, we highlight new opportunities for innovations that bridge our molecular understanding of the energetics of MP folding with the nuanced complexity of biological systems. Given the many linkages between MP misfolding and human disease, we also examine some of the exciting opportunities to leverage these advances to address emerging challenges in the development of therapeutics and precision medicine.
Genomic regions with gene regulatory enhancer activity turnover rapidly across mammals. In contrast, gene expression patterns and transcription factor binding preferences are largely conserved ...between mammalian species. Based on this conservation, we hypothesized that enhancers active in different mammals would exhibit conserved sequence patterns in spite of their different genomic locations. To investigate this hypothesis, we evaluated the extent to which sequence patterns that are predictive of enhancers in one species are predictive of enhancers in other mammalian species by training and testing two types of machine learning models. We trained support vector machine (SVM) and convolutional neural network (CNN) classifiers to distinguish enhancers defined by histone marks from the genomic background based on DNA sequence patterns in human, macaque, mouse, dog, cow, and opossum. The classifiers accurately identified many adult liver, developing limb, and developing brain enhancers, and the CNNs outperformed the SVMs. Furthermore, classifiers trained in one species and tested in another performed nearly as well as classifiers trained and tested on the same species. We observed similar cross-species conservation when applying the models to human and mouse enhancers validated in transgenic assays. This indicates that many short sequence patterns predictive of enhancers are largely conserved. The sequence patterns most predictive of enhancers in each species matched the binding motifs for a common set of TFs enriched for expression in relevant tissues, supporting the biological relevance of the learned features. Thus, despite the rapid change of active enhancer locations between mammals, cross-species enhancer prediction is often possible. Our results suggest that short sequence patterns encoding enhancer activity have been maintained across more than 180 million years of mammalian evolution.
Deep neural networks (DNNs) have achieved state-of-the-art performance in identifying gene regulatory sequences, but they have provided limited insight into the biology of regulatory elements due to ...the difficulty of interpreting the complex features they learn. Several models of how combinatorial binding of transcription factors, i.e. the regulatory grammar, drives enhancer activity have been proposed, ranging from the flexible TF billboard model to the stringent enhanceosome model. However, there is limited knowledge of the prevalence of these (or other) sequence architectures across enhancers. Here we perform several hypothesis-driven analyses to explore the ability of DNNs to learn the regulatory grammar of enhancers. We created synthetic datasets based on existing hypotheses about combinatorial transcription factor binding site (TFBS) patterns, including homotypic clusters, heterotypic clusters, and enhanceosomes, from real TF binding motifs from diverse TF families. We then trained deep residual neural networks (ResNets) to model the sequences under a range of scenarios that reflect real-world multi-label regulatory sequence prediction tasks. We developed a gradient-based unsupervised clustering method to extract the patterns learned by the ResNet models. We demonstrated that simulated regulatory grammars are best learned in the penultimate layer of the ResNets, and the proposed method can accurately retrieve the regulatory grammar even when there is heterogeneity in the enhancer categories and a large fraction of TFBS outside of the regulatory grammar. However, we also identify common scenarios where ResNets fail to learn simulated regulatory grammars. Finally, we applied the proposed method to mouse developmental enhancers and were able to identify the components of a known heterotypic TF cluster. Our results provide a framework for interpreting the regulatory rules learned by ResNets, and they demonstrate that the ability and efficiency of ResNets in learning the regulatory grammar depends on the nature of the prediction task.
Quantification of the tolerance of protein sites to genetic variation has become a cornerstone of variant interpretation. We hypothesize that the constraint on missense variation at individual amino ...acid sites is largely shaped by direct interactions with 3D neighboring sites. To quantify this constraint, we introduce a framework called COntact Set MISsense tolerance (or COSMIS) and comprehensively map the landscape of 3D mutational constraint on 6.1 million amino acid sites covering 16,533 human proteins. We show that 3D mutational constraint is pervasive and that the level of constraint is strongly associated with disease relevance both at the site and the protein level. We demonstrate that COSMIS performs significantly better at variant interpretation tasks than other population-based constraint metrics while also providing structural insight into the functional roles of constrained sites. We anticipate that COSMIS will facilitate the interpretation of protein-coding variation in evolution and prioritization of sites for mechanistic investigation.
Non-coding gene regulatory enhancers are essential to transcription in mammalian cells. As a result, a large variety of experimental and computational strategies have been developed to identify ...cis-regulatory enhancer sequences. Given the differences in the biological signals assayed, some variation in the enhancers identified by different methods is expected; however, the concordance of enhancers identified by different methods has not been comprehensively evaluated. This is critically needed, since in practice, most studies consider enhancers identified by only a single method. Here, we compare enhancer sets from eleven representative strategies in four biological contexts.
All sets we evaluated overlap significantly more than expected by chance; however, there is significant dissimilarity in their genomic, evolutionary, and functional characteristics, both at the element and base-pair level, within each context. The disagreement is sufficient to influence interpretation of candidate SNPs from GWAS studies, and to lead to disparate conclusions about enhancer and disease mechanisms. Most regions identified as enhancers are supported by only one method, and we find limited evidence that regions identified by multiple methods are better candidates than those identified by a single method. As a result, we cannot recommend the use of any single enhancer identification strategy in all settings.
Our results highlight the inherent complexity of enhancer biology and identify an important challenge to mapping the genetic architecture of complex disease. Greater appreciation of how the diverse enhancer identification strategies in use today relate to the dynamic activity of gene regulatory regions is needed to enable robust and reproducible results.
Lysine acetylation regulates transcription by targeting histones and nonhistone proteins. Here we report that the central regulator of transcription, RNA polymerase II, is subject to acetylation in ...mammalian cells. Acetylation occurs at eight lysines within the C-terminal domain (CTD) of the largest polymerase subunit and is mediated by p300/KAT3B. CTD acetylation is specifically enriched downstream of the transcription start sites of polymerase-occupied genes genome-wide, indicating a role in early stages of transcription initiation or elongation. Mutation of lysines or p300 inhibitor treatment causes the loss of epidermal growth-factor-induced expression of c-Fos and Egr2, immediate-early genes with promoter-proximally paused polymerases, but does not affect expression or polymerase occupancy at housekeeping genes. Our studies identify acetylation as a new modification of the mammalian RNA polymerase II required for the induction of growth factor response genes.