Transcription factors (TFs) bind DNA in a sequence-specific manner and thereby serve as the protein anchors and determinants of 3D genome organization. Conversely, chromatin conformation shapes TF ...activity, for example, by looping TF-bound enhancers to distally located target genes. Despite considerable effort, our understanding of the mechanistic relation between TFs and 3D genome organization remains limited, in large part due to this interdependency. In this review, we summarize the evidence for the diverse mechanisms by which TFs and their activity shape the 3D genome and vice versa. We further highlight outstanding questions and potential approaches for untangling the complex relation between TF activity and the 3D genome.
Transcription factors shape 3D genome organization, which in turn affects transcription factor activity. Kim and Shendure review our current understanding of the mechanisms underlying this interdependency, discuss implications for transcriptional regulation, and propose approaches to address the key questions that remain.
The human gene catalogue is essentially complete, but we lack an equivalently vetted inventory of bona fide human enhancers. Hundreds of thousands of candidate enhancers have been nominated via ...biochemical annotations; however, only a handful of these have been validated and confidently linked to their target genes. Here we review emerging technologies for discovering, characterizing and validating human enhancers at scale. We furthermore propose a new framework for operationally defining enhancers that accommodates the heterogeneous and complementary results that are emerging from reporter assays, biochemical measurements and CRISPR screens.
Algorithms that accurately predict gene structure from primary sequence alone were transformative for annotating the human genome. Can we also predict the expression levels of genes based solely on ...genome sequence? Here, we sought to apply deep convolutional neural networks toward that goal. Surprisingly, a model that includes only promoter sequences and features associated with mRNA stability explains 59% and 71% of variation in steady-state mRNA levels in human and mouse, respectively. This model, termed Xpresso, more than doubles the accuracy of alternative sequence-based models and isolates rules as predictive as models relying on chromatic immunoprecipitation sequencing (ChIP-seq) data. Xpresso recapitulates genome-wide patterns of transcriptional activity, and its residuals can be used to quantify the influence of enhancers, heterochromatic domains, and microRNAs. Model interpretation reveals that promoter-proximal CpG dinucleotides strongly predict transcriptional activity. Looking forward, we propose cell-type-specific gene-expression predictions based solely on primary sequences as a grand challenge for the field.
Display omitted
•Deep neural networks strongly predict gene expression levels solely from DNA sequence•Models help infer transcriptional and post-transcriptional gene regulatory mechanisms•Predictive power is competitive with models using thousands of biochemical datasets•CpG dinucleotide content at core promoters is strongly predictive of mRNA abundance
Agarwal and Shendure show that deep neural networks can strongly predict mRNA abundance solely from promoter sequence. The residuals of these predictions facilitate inferences about the regulatory influence of enhancers, heterochromatic domains, and microRNAs. Model interpretation reveals that CpG dinucleotide content at core promoters is associated with transcriptional activity.
New technologies have recently enabled saturation mutagenesis and functional analysis of nearly all possible variants of regulatory elements or proteins of interest in single experiments. Here we ...discuss the past, present, and future of such multiplexed (functional) assays for variant effects (MAVEs). MAVEs provide detailed insight into sequence-function relationships, and they may prove critical for the prospective clinical interpretation of genetic variants.
Abstract
Combined Annotation-Dependent Depletion (CADD) is a widely used measure of variant deleteriousness that can effectively prioritize causal variants in genetic analyses, particularly highly ...penetrant contributors to severe Mendelian disorders. CADD is an integrative annotation built from more than 60 genomic features, and can score human single nucleotide variants and short insertion and deletions anywhere in the reference assembly. CADD uses a machine learning model trained on a binary distinction between simulated de novo variants and variants that have arisen and become fixed in human populations since the split between humans and chimpanzees; the former are free of selective pressure and may thus include both neutral and deleterious alleles, while the latter are overwhelmingly neutral (or, at most, weakly deleterious) by virtue of having survived millions of years of purifying selection. Here we review the latest updates to CADD, including the most recent version, 1.4, which supports the human genome build GRCh38. We also present updates to our website that include simplified variant lookup, extended documentation, an Application Program Interface and improved mechanisms for integrating CADD scores into other tools or applications. CADD scores, software and documentation are available at https://cadd.gs.washington.edu.
Splicing of genomic exons into mRNAs is a critical prerequisite for the accurate synthesis of human proteins. Genetic variants impacting splicing underlie a substantial proportion of genetic disease, ...but are challenging to identify beyond those occurring at donor and acceptor dinucleotides. To address this, various methods aim to predict variant effects on splicing. Recently, deep neural networks (DNNs) have been shown to achieve better results in predicting splice variants than other strategies.
It has been unclear how best to integrate such process-specific scores into genome-wide variant effect predictors. Here, we use a recently published experimental data set to compare several machine learning methods that score variant effects on splicing. We integrate the best of those approaches into general variant effect prediction models and observe the effect on classification of known pathogenic variants.
We integrate two specialized splicing scores into CADD (Combined Annotation Dependent Depletion; cadd.gs.washington.edu ), a widely used tool for genome-wide variant effect prediction that we previously developed to weight and integrate diverse collections of genomic annotations. With this new model, CADD-Splice, we show that inclusion of splicing DNN effect scores substantially improves predictions across multiple variant categories, without compromising overall performance.
While splice effect scores show superior performance on splice variants, specialized predictors cannot compete with other variant scores in general variant interpretation, as the latter account for nonsense and missense effects that do not alter splicing. Although only shown here for splice scores, we believe that the applied approach will generalize to other specific molecular processes, providing a path for the further improvement of genome-wide variant effect prediction.
Genome-wide knockout studies, noncoding deletion scans, and other large-scale studies require a simple and lightweight framework that can quickly discover and score thousands of candidate CRISPR ...guides targeting an arbitrary DNA sequence. While several CRISPR web applications exist, there is a need for a high-throughput tool to rapidly discover and process hundreds of thousands of CRISPR targets.
Here, we introduce FlashFry, a fast and flexible command-line tool for characterizing large numbers of CRISPR target sequences. With FlashFry, users can specify an unconstrained number of mismatches to putative off-targets, richly annotate discovered sites, and tag potential guides with commonly used on-target and off-target scoring metrics. FlashFry runs at speeds comparable to commonly used genome-wide sequence aligners, and output is provided as an easy-to-manipulate text file.
FlashFry is a fast and convenient command-line tool to discover and score CRISPR targets within large DNA sequences.
Gene expression programs change over time, differentiation and development, and in response to stimuli. However, nearly all techniques for profiling gene expression in single cells do not directly ...capture transcriptional dynamics. In the present study, we present a method for combined single-cell combinatorial indexing and messenger RNA labeling (sci-fate), which uses combinatorial cell indexing and 4-thiouridine labeling of newly synthesized mRNA to concurrently profile the whole and newly synthesized transcriptome in each of many single cells. We used sci-fate to study the cortisol response in >6,000 single cultured cells. From these data, we quantified the dynamics of the cell cycle and glucocorticoid receptor activation, and explored their intersection. Finally, we developed software to infer and analyze cell-state transitions. We anticipate that sci-fate will be broadly applicable to quantitatively characterize transcriptional dynamics in diverse systems.
Although we can increasingly measure transcription, chromatin, methylation, and other aspects of molecular biology at single-cell resolution, most assays survey only one aspect of cellular biology. ...Here we describe sci-CAR, a combinatorial indexing-based coassay that jointly profiles chromatin accessibility and mRNA (CAR) in each of thousands of single cells. As a proof of concept, we apply sci-CAR to 4825 cells, including a time series of dexamethasone treatment, as well as to 11,296 cells from the adult mouse kidney. With the resulting data, we compare the pseudotemporal dynamics of chromatin accessibility and gene expression, reconstruct the chromatin accessibility profiles of cell types defined by RNA profiles, and link cis-regulatory sites to their target genes on the basis of the covariance of chromatin accessibility and transcription across large numbers of single cells.
Microsatellite instability (MSI), the spontaneous loss or gain of nucleotides from repetitive DNA tracts, is a diagnostic phenotype for gastrointestinal, endometrial, and colorectal tumors, yet the ...landscape of instability events across a wider variety of cancer types remains poorly understood. To explore MSI across malignancies, we examined 5,930 cancer exomes from 18 cancer types at more than 200,000 microsatellite loci and constructed a genomic classifier for MSI. We identified MSI-positive tumors in 14 of the 18 cancer types. We also identified loci that were more likely to be unstable in particular cancer types, resulting in specific instability signatures that involved cancer-associated genes, suggesting that instability patterns reflect selective pressures and can potentially identify novel cancer drivers. We also observed a correlation between survival outcomes and the overall burden of unstable microsatellites, suggesting that MSI may be a continuous, rather than discrete, phenotype that is informative across cancer types. These analyses offer insight into conserved and cancer-specific properties of MSI and reveal opportunities for improved methods of clinical MSI diagnosis and cancer gene discovery.