Protein engineering through machine-learning-guided directed evolution enables the optimization of protein functions. Machine-learning approaches predict how sequence maps to function in a ...data-driven manner without requiring a detailed model of the underlying physics or biological pathways. Such methods accelerate directed evolution by learning from the properties of characterized variants and using that information to select sequences that are likely to exhibit improved properties. Here we introduce the steps required to build machine-learning sequence-function models and to use those models to guide engineering, making recommendations at each stage. This review covers basic concepts relevant to the use of machine learning for protein engineering, as well as the current literature and applications of this engineering paradigm. We illustrate the process with two case studies. Finally, we look to future opportunities for machine learning to enable the discovery of unknown protein functions and uncover the relationship between protein sequence and function.
Biocatalysis is a promising approach to sustainably synthesize pharmaceuticals, complex natural products, and commodity chemicals at scale. However, the adoption of biocatalysis is limited by our ...ability to select enzymes that will catalyze their natural chemical transformation on non-natural substrates. While machine learning and in silico directed evolution are well-posed for this predictive modeling challenge, efforts to date have primarily aimed to increase activity against a single known substrate, rather than to identify enzymes capable of acting on new substrates of interest. To address this need, we curate 6 different high-quality enzyme family screens from the literature that each measure multiple enzymes against multiple substrates. We compare machine learning-based compound-protein interaction (CPI) modeling approaches from the literature used for predicting drug-target interactions. Surprisingly, comparing these interaction-based models against collections of independent (single task) enzyme-only or substrate-only models reveals that current CPI approaches are incapable of learning interactions between compounds and proteins in the current family level data regime. We further validate this observation by demonstrating that our no-interaction baseline can outperform CPI-based models from the literature used to guide the discovery of kinase inhibitors. Given the high performance of non-interaction based models, we introduce a new structure-based strategy for pooling residue representations across a protein sequence. Altogether, this work motivates a principled path forward in order to build and evaluate meaningful predictive models for biocatalysis and other drug discovery applications.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Abstract
Motivation
Machine-learning models trained on protein sequences and their measured functions can infer biological properties of unseen sequences without requiring an understanding of the ...underlying physical or biological mechanisms. Such models enable the prediction and discovery of sequences with optimal properties. Machine-learning models generally require that their inputs be vectors, and the conversion from a protein sequence to a vector representation affects the model's ability to learn. We propose to learn embedded representations of protein sequences that take advantage of the vast quantity of unmeasured protein sequence data available. These embeddings are low-dimensional and can greatly simplify downstream modeling.
Results
The predictive power of Gaussian process models trained using embeddings is comparable to those trained on existing representations, which suggests that embeddings enable accurate predictions despite having orders of magnitude fewer dimensions. Moreover, embeddings are simpler to obtain because they do not require alignments, structural data, or selection of informative amino-acid properties. Visualizing the embedding vectors shows meaningful relationships between the embedded proteins are captured.
Availability and implementation
The embedding vectors and code to reproduce the results are available at https://github.com/fhalab/embeddings_reproduction/.
Supplementary information
Supplementary data are available at Bioinformatics online.
Protein sequence design with deep generative models Wu, Zachary; Johnston, Kadina E.; Arnold, Frances H. ...
Current opinion in chemical biology,
December 2021, 2021-12-00, 20211201, Letnik:
65
Journal Article
Recenzirano
Odprti dostop
Protein engineering seeks to identify protein sequences with optimized properties. When guided by machine learning, protein sequence generation methods can draw on prior knowledge and experimental ...efforts to improve this process. In this review, we highlight recent applications of machine learning to generate protein sequences, focusing on the emerging field of deep generative methods.
Conjecture II.3.6 of Spohn in 47 and Lecture 7 of Jensen–Yau in 35 ask for a general derivation of universal fluctuations of hydrodynamic limits in large-scale stochastic interacting particle ...systems. However, the past few decades have witnessed only minimal progress according to 26. In this paper, we develop a general method for deriving the so-called Boltzmann–Gibbs principle for a general family of nonintegrable and nonstationary interacting particle systems, thereby responding to Spohn and Jensen–Yau. Most importantly, our method depends mostly on local and dynamical, and thus more general/universal, features of the model. This contrasts with previous work 6, 8, 24, 34, all of which rely on global and nonuniversal assumptions on invariant measures or initial measures of the model. As a concrete application of the method, we derive the KPZ equation as a large-scale limit of the height functions for a family of nonstationary and nonintegrable exclusion processes with an environment-dependent asymmetry. This establishes a first result to Big Picture Question 1.6 in 54 for nonstationary and nonintegrable ‘speed-change’ models that have also been of interest beyond KPZ 18, 22, 23, 38.
Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using these models to suggest new ...protein designs, one must deal with the vast combinatorial complexity of protein sequences. Here, we review how to use a sequence-to-function machine-learning surrogate model to select sequences for experimental measurement. First, we discuss how to select sequences through a single round of machine-learning optimization. Then, we discuss sequential optimization, where the goal is to discover optimized sequences and improve the model across multiple rounds of training, optimization, and experimental measurement.
•Machine learning can guide efficient exploration of a combinatorially large protein search space.•Model-based optimization finds the best sequences based on a trained sequence-to-function surrogate model.•Sequential optimization strategies help guide multiple rounds of computational prediction and experimental measurement.
Uncertainty quantification (UQ) is an important component of molecular property prediction, particularly for drug discovery applications where model predictions direct experimental design and where ...unanticipated imprecision wastes valuable time and resources. The need for UQ is especially acute for neural models, which are becoming increasingly standard yet are challenging to interpret. While several approaches to UQ have been proposed in the literature, there is no clear consensus on the comparative performance of these models. In this paper, we study this question in the context of regression tasks. We systematically evaluate several methods on five regression data sets using multiple complementary performance metrics. Our experiments show that none of the methods we tested is unequivocally superior to all others, and none produces a particularly reliable ranking of errors across multiple data sets. While we believe that these results show that existing UQ methods are not sufficient for all common use cases and further research is needed, we conclude with a practical recommendation as to which existing techniques seem to perform well relative to others.
Objective To report on an early adopter series of collagenase Clostridium histolyticum (CCh) for Peyronie's disease (PD). Postapproval studies of CCh have been anticipated after recent Food and Drug ...Administration authorization of its use for men with PD as definitive and durable nonsurgical interventions have been long desired. Materials and Methods From May 2014 to October 2015, a database consisting of PD patients with >30° of penile curvature received CCh from a single provider at a single institution. Objective penile curvature measurements and deformity directions were assessed pre- and posttreatment. Using the validated Peyronie's Disease Questionnaire (PDQ), changes in subjective symptoms of intercourse ability, penile pain, and bother were also noted. Results We followed 49 unique PD patients treated with CCh. Mean follow-up was 183 days with a median of 6 injections over 3 cycles performed per patient. The mean pretreatment penile curvature was 49.3 degrees. Curvature was reduced by 15.4 degrees (32.4%, P < .01) after therapy. There were 10 out of 22 patients who regained ability to perform vaginal intercourse. Subjectively, there was an improvement in the ability to perform intercourse (29.1% improvement, P < .01) and bother symptoms (mean decrease 43.2%, P < .01), but no significant changes in penile pain ( P = .89). Five notable bleeding events (10.2%) were noted, including 1 penile fracture requiring operative exploration. Conclusion CCh use for PD yielded improvements in penile curvature, subjective intercourse, and bother symptoms. Further postanalysis studies of greater follow-up are needed to assess long-term durability, efficacy, and safety.
The global propagation of SARS-CoV-2 leads to an unprecedented public health emergency. Despite that the lungs are the primary organ targeted by COVID-19, systemic endothelial inflammation and ...dysfunction is observed particularly in patients with severe COVID-19, manifested by elevated endothelial injury markers, endotheliitis, and coagulopathy. Here, we review the clinical characteristics of COVID-19 associated endothelial dysfunction; and the likely pathological mechanisms underlying the disease including direct cell entry or indirect immune overreactions after SARS-CoV-2 infection. In addition, we discuss potential biomarkers that might indicate the disease severity, particularly related to the abnormal development of thrombosis that is a fatal vascular complication of severe COVID-19. Furthermore, we summarize clinical trials targeting the direct and indirect pathological pathways after SARS-CoV-2 infection to prevent or inhibit the virus induced endothelial disorders.
Display omitted
•Endothelial dysfunction is a clinical characterization of COVID-19 particularly in severe cases regardless of age.•The likely pathological mechanisms underlying the disease including direct cell entry or indirect immune overreactions after SARS-CoV-2 infection.•Biomarkers are identified to indicate the disease severity, particularly related to the abnormal development of thrombosis that is a fatal vascular complication of severe COVID-19.•Clinical trials targeting the direct or indirect pathological pathways after SARS-CoV-2 infection are underway to prevent or inhibit the virus induced endothelial disorders.
There is growing interest in studying and engineering integral membrane proteins (MPs) that play key roles in sensing and regulating cellular response to diverse external signals. A MP must be ...expressed, correctly inserted and folded in a lipid bilayer, and trafficked to the proper cellular location in order to function. The sequence and structural determinants of these processes are complex and highly constrained. Here we describe a predictive, machine-learning approach that captures this complexity to facilitate successful MP engineering and design. Machine learning on carefully-chosen training sequences made by structure-guided SCHEMA recombination has enabled us to accurately predict the rare sequences in a diverse library of channelrhodopsins (ChRs) that express and localize to the plasma membrane of mammalian cells. These light-gated channel proteins of microbial origin are of interest for neuroscience applications, where expression and localization to the plasma membrane is a prerequisite for function. We trained Gaussian process (GP) classification and regression models with expression and localization data from 218 ChR chimeras chosen from a 118,098-variant library designed by SCHEMA recombination of three parent ChRs. We use these GP models to identify ChRs that express and localize well and show that our models can elucidate sequence and structure elements important for these processes. We also used the predictive models to convert a naturally occurring ChR incapable of mammalian localization into one that localizes well.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK