Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new ...prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The protein LMs (pLMs) were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw pLM- embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks: (1) a per-residue (per-token) prediction of protein secondary structure (3-state accuracy Q3=81%-87%); (2) per-protein (pooling) predictions of protein sub-cellular location (ten-state accuracy: Q10=81%) and membrane versus water-soluble (2-state accuracy Q2=91%). For secondary structure, the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without multiple sequence alignments (MSAs) or evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that pLMs learned some of the grammar of the language of life . All our models are available through https://github.com/agemagician/ProtTrans .
We examine the problem of clustering biomolecular simulations using deep learning techniques. Since biomolecular simulation datasets are inherently high dimensional, it is often necessary to build ...low dimensional representations that can be used to extract quantitative insights into the atomistic mechanisms that underlie complex biological processes.
We use a convolutional variational autoencoder (CVAE) to learn low dimensional, biophysically relevant latent features from long time-scale protein folding simulations in an unsupervised manner. We demonstrate our approach on three model protein folding systems, namely Fs-peptide (14 μs aggregate sampling), villin head piece (single trajectory of 125 μs) and β- β- α (BBA) protein (223 + 102 μs sampling across two independent trajectories). In these systems, we show that the CVAE latent features learned correspond to distinct conformational substates along the protein folding pathways. The CVAE model predicts, on average, nearly 89% of all contacts within the folding trajectories correctly, while being able to extract folded, unfolded and potentially misfolded states in an unsupervised manner. Further, the CVAE model can be used to learn latent features of protein folding that can be applied to other independent trajectories, making it particularly attractive for identifying intrinsic features that correspond to conformational substates that share similar structural features.
Together, we show that the CVAE model can quantitatively describe complex biophysical processes such as protein folding.
The process of drug discovery involves a search over the space of all possible chemical compounds. Generative Adversarial Networks (GANs) provide a valuable tool towards exploring chemical space and ...optimizing known compounds for a desired functionality. Standard approaches to training GANs, however, can result in mode collapse, in which the generator primarily produces samples closely related to a small subset of the training data. In contrast, the search for novel compounds necessitates exploration beyond the original data. Here, we present an approach to training GANs that promotes incremental exploration and limits the impacts of mode collapse using concepts from Genetic Algorithms. In our approach, valid samples from the generator are used to replace samples from the training data. We consider both random and guided selection along with recombination during replacement. By tracking the number of novel compounds produced during training, we show that updates to the training data drastically outperform the traditional approach, increasing potential applications for GANs in drug discovery.
Glycosylation of secondary metabolites involves plant UDP-dependent glycosyltransferases (UGTs). UGTs have shown promise as catalysts in the synthesis of glycosides for medical treatment. However, ...limited understanding at the molecular level due to insufficient biochemical and structural information has hindered potential applications of most of these UGTs. In the absence of experimental crystal structures, we employed advanced molecular modeling and simulations in conjunction with biochemical characterization to design a workflow to study five Group H Arabidopsis thaliana (76E1, 76E2, 76E4, 76E5, 76D1) UGTs. Based on our rational structural manipulation and analysis, we identified key amino acids (P129 in 76D1; D374 in 76E2; K275 in 76E4), which when mutated improved donor substrate recognition than wildtype UGTs. Molecular dynamics simulations and deep learning analysis identified structural differences, which drive substrate preferences. The design of these UGTs with broader substrate specificity may play important role in biotechnological and industrial applications. These findings can also serve as basis to study other plant UGTs and thereby advancing UGT enzyme engineering.
The inverse design of novel molecules with a desirable optoelectronic property requires consideration of the vast chemical spaces associated with varying chemical composition and molecular size. ...First principles-based property predictions have become increasingly helpful for assisting the selection of promising candidate chemical species for subsequent experimental validation. However, a brute-force computational screening of the entire chemical space is decidedly impossible. To alleviate the computational burden and accelerate rational molecular design, we here present an iterative deep learning workflow that combines (i) the density-functional tight-binding method for dynamic generation of property training data, (ii) a graph convolutional neural network surrogate model for rapid and reliable predictions of chemical and physical properties, and (iii) a masked language model. As proof of principle, we employ our workflow in the iterative generation of novel molecules with a target energy gap between the highest occupied molecular orbital (HOMO) and the lowest unoccupied molecular orbital (LUMO).
Molecular simulations have allowed us to probe the atomic details of aqueous solutions of tetramethylammonium (TMA) and tetrabutylammonium (TBA) bromide, across a wide range of concentrations (0.5 to ...3-4 molal). We highlight the space-filling (TMA(+)) versus penetrable (TBA(+)) nature of these polyatomic cations and its consequence for ion hydration, ion dynamics and ion-ion interactions. A well-established hydration is seen for both TMA(+) and TBA(+) throughout the concentration range studied. A clear penetration of water molecules, as well as counterions, between the hydrocarbon arms of TBA(+), which remain in an extended configuration, is seen. Global rotation of individual TBA(+) points towards isolated rather than aggregated ions (from dilute up to 1 m concentration). Only for highly concentrated solutions, in which inter-penetration of adjacent TBA(+)s cannot be avoided, does the rotational time increase dramatically. From both structural and dynamic data we conclude that there is absence of hydrophobicity-driven cation-cation aggregation in both TMABr and TBABr solutions studied. The link between these real systems and the theoretical predictions for spherical hydrophobic solutes of varying size does not seem straightforward.
Language models for the prediction of SARS-CoV-2 inhibitors Blanchard, Andrew E; Gounley, John; Bhowmik, Debsindhu ...
The international journal of high performance computing applications,
11/2022, Volume:
36, Issue:
5-6
Journal Article
Peer reviewed
Open access
The COVID-19 pandemic highlights the need for computational tools to automate and accelerate drug design for novel protein targets. We leverage deep learning language models to generate and score ...drug candidates based on predicted protein binding affinity. We pre-trained a deep learning language model (BERT) on ∼9.6 billion molecules and achieved peak performance of 603 petaflops in mixed precision. Our work reduces pre-training time from days to hours, compared to previous efforts with this architecture, while also increasing the dataset size by nearly an order of magnitude. For scoring, we fine-tuned the language model using an assembled set of thousands of protein targets with binding affinity data and searched for inhibitors of specific protein targets, SARS-CoV-2 Mpro and PLpro. We utilized a genetic algorithm approach for finding optimal candidates using the generation and scoring capabilities of the language model. Our generalizable models accelerate the identification of inhibitors for emerging therapeutic targets.
The vast size of chemical space necessitates computational approaches to automate and accelerate the design of molecular sequences to guide experimental efforts for drug discovery. Genetic algorithms ...provide a useful framework to incrementally generate molecules by applying mutations to known chemical structures. Recently, masked language models have been applied to automate the mutation process by leveraging large compound libraries to learn commonly occurring chemical sequences (i.e., using tokenization) and predict rearrangements (i.e., using mask prediction). Here, we consider how language models can be adapted to improve molecule generation for different optimization tasks. We use two different generation strategies for comparison, fixed and adaptive. The fixed strategy uses a pre-trained model to generate mutations; the adaptive strategy trains the language model on each new generation of molecules selected for target properties during optimization. Our results show that the adaptive strategy allows the language model to more closely fit the distribution of molecules in the population. Therefore, for enhanced fitness optimization, we suggest the use of the fixed strategy during an initial phase followed by the use of the adaptive strategy. We demonstrate the impact of adaptive training by searching for molecules that optimize both heuristic metrics, drug-likeness and synthesizability, as well as predicted protein binding affinity from a surrogate model. Our results show that the adaptive strategy provides a significant improvement in fitness optimization compared to the fixed pre-trained model, empowering the application of language models to molecular design tasks.
Class A β-lactamases are known for being able to rapidly gain broad spectrum catalytic efficiency against most β-lactamase inhibitor combinations as a result of elusively minor point mutations. The ...evolution in class A β-lactamases occurs through optimisation of their dynamic phenotypes at different timescales. At long-timescales, certain conformations are more catalytically permissive than others while at the short timescales, fine-grained optimisation of free energy barriers can improve efficiency in ligand processing by the active site. Free energy barriers, which define all coordinated movements, depend on the flexibility of the secondary structural elements. The most highly conserved residues in class A β-lactamases are hydrophobic nodes that stabilize the core. To assess how the stable hydrophobic core is linked to the structural dynamics of the active site, we carried out adaptively sampled molecular dynamics (MD) simulations in four representative class A β-lactamases (KPC-2, SME-1, TEM-1, and SHV-1). Using Markov State Models (MSM) and unsupervised deep learning, we show that the dynamics of the hydrophobic nodes is used as a metastable relay of kinetic information within the core and is coupled with the catalytically permissive conformation of the active site environment. Our results collectively demonstrate that the class A enzymes described here, share several important dynamic similarities and the hydrophobic nodes comprise of an informative set of dynamic variables in representative class A β-lactamases.
β-Lactam antibiotics are the most important and widely used antibacterial agents across the world. However, the widespread dissemination of β-lactamases among pathogenic bacteria limits the efficacy ...of β-lactam antibiotics. This has created a major public health crisis. The use of β-lactamase inhibitors has proven useful in restoring the activity of β-lactam antibiotics, yet, effective clinically approved inhibitors against class B metallo-β-lactamases are not available. L1, a class B3 enzyme expressed by
, is a significant contributor to the β-lactam resistance displayed by this opportunistic pathogen. Structurally, L1 is a tetramer with two elongated loops, α3-β7 and β12-α5, present around the active site of each monomer. Residues in these two loops influence substrate/inhibitor binding. To study how the conformational changes of the elongated loops affect the active site in each monomer, enhanced sampling molecular dynamics simulations were performed, Markov State Models were built, and convolutional variational autoencoder-based deep learning was applied. The key identified residues (D150a, H151, P225, Y227, and R236) were mutated and the activity of the generated L1 variants was evaluated in cell-based experiments. The results demonstrate that there are extremely significant gating interactions between α3-β7 and β12-α5 loops. Taken together, the gating interactions with the conformational changes of the key residues play an important role in the structural remodeling of the active site. These observations offer insights into the potential for novel drug development exploiting these gating interactions.