We have devised and implemented a novel computational strategy for de novo design of molecules with desired properties termed ReLeaSE (Reinforcement Learning for Structural Evolution). On the basis ...of deep and reinforcement learning (RL) approaches, ReLeaSE integrates two deep neural networks-generative and predictive-that are trained separately but are used jointly to generate novel targeted chemical libraries. ReLeaSE uses simple representation of molecules by their simplified molecular-input line-entry system (SMILES) strings only. Generative models are trained with a stack-augmented memory network to produce chemically feasible SMILES strings, and predictive models are derived to forecast the desired properties of the de novo-generated compounds. In the first phase of the method, generative and predictive models are trained separately with a supervised learning algorithm. In the second phase, both models are trained jointly with the RL approach to bias the generation of new chemical structures toward those with the desired physical and/or biological properties. In the proof-of-concept study, we have used the ReLeaSE method to design chemical libraries with a bias toward structural complexity or toward compounds with maximal, minimal, or specific range of physical properties, such as melting point or hydrophobicity, or toward compounds with inhibitory activity against Janus protein kinase 2. The approach proposed herein can find a general use for generating targeted chemical libraries of novel compounds optimized for either a single desired property or multiple properties.
Although historically materials discovery has been driven by a laborious trial-and-error process, knowledge-driven materials design can now be enabled by the rational combination of Machine Learning ...methods and materials databases. Here, data from the AFLOW repository for ab initio calculations is combined with Quantitative Materials Structure-Property Relationship models to predict important properties: metal/insulator classification, band gap energy, bulk/shear moduli, Debye temperature and heat capacities. The prediction's accuracy compares well with the quality of the training data for virtually any stoichiometric inorganic crystalline material, reciprocating the available thermomechanical experimental data. The universality of the approach is attributed to the construction of the descriptors: Property-Labelled Materials Fragments. The representations require only minimal structural input allowing straightforward implementations of simple heuristic design rules.
Quantitative Structure Activity Relationship (QSAR) modeling has been traditionally applied as an evaluative approach, i.e., with the focus on developing retrospective and explanatory models of ...existing data. Model extrapolation was considered if only in hypothetical sense in terms of potential modifications of known biologically active chemicals that could improve compounds' activity. This critical review re-examines the strategy and the output of the modern QSAR modeling approaches. We provide examples and arguments suggesting that current methodologies may afford robust and validated models capable of accurate prediction of compound properties for molecules not included in the training sets. We discuss a data-analytical modeling workflow developed in our laboratory that incorporates modules for combinatorial QSAR model development (i.e., using all possible binary combinations of available descriptor sets and statistical data modeling techniques), rigorous model validation, and virtual screening of available chemical databases to identify novel biologically active compounds. Our approach places particular emphasis on model validation as well as the need to define model applicability domains in the chemistry space. We present examples of studies where the application of rigorously validated QSAR models to virtual screening identified computational hits that were confirmed by subsequent experimental investigations. The emerging focus of QSAR modeling on target property forecasting brings it forward as predictive, as opposed to evaluative, modeling approach.
Molecular modelers and cheminformaticians typically analyze experimental data generated by other scientists. Consequently, when it comes to data accuracy, cheminformaticians are always at the mercy ...of data providers who may inadvertently publish (partially) erroneous data. Thus, dataset curation is crucial for any cheminformatics analysis such as similarity searching, clustering, QSAR modeling, virtual screening, etc., especially nowadays when the availability of chemical datasets in public domain has skyrocketed in recent years. Despite the obvious importance of this preliminary step in the computational analysis of any dataset, there appears to be no commonly accepted guidance or set of procedures for chemical data curation. The main objective of this paper is to emphasize the need for a standardized chemical data curation strategy that should be followed at the onset of any molecular modeling investigation. Herein, we discuss several simple but important steps for cleaning chemical records in a database including the removal of a fraction of the data that cannot be appropriately handled by conventional cheminformatics techniques. Such steps include the removal of inorganic and organometallic compounds, counterions, salts and mixtures; structure validation; ring aromatization; normalization of specific chemotypes; curation of tautomeric forms; and the deletion of duplicates. To emphasize the importance of data curation as a mandatory step in data analysis, we discuss several case studies where chemical curation of the original “raw” database enabled the successful modeling study (specifically, QSAR analysis) or resulted in a significant improvement of model's prediction accuracy. We also demonstrate that in some cases rigorously developed QSAR models could be even used to correct erroneous biological data associated with chemical compounds. We believe that good practices for curation of chemical records outlined in this paper will be of value to all scientists working in the fields of molecular modeling, cheminformatics, and QSAR studies.
QSAR without borders Muratov, Eugene N; Bajorath, Jürgen; Sheridan, Robert P ...
Chemical Society reviews,
06/2020, Letnik:
49, Številka:
11
Journal Article
Recenzirano
Odprti dostop
Prediction of chemical bioactivity and physical properties has been one of the most important applications of statistical and more recently, machine learning and artificial intelligence methods in ...chemical sciences. This field of research, broadly known as quantitative structure-activity relationships (QSAR) modeling, has developed many important algorithms and has found a broad range of applications in physical organic and medicinal chemistry in the past 55+ years. This Perspective summarizes recent technological advances in QSAR modeling but it also highlights the applicability of algorithms, modeling methods, and validation practices developed in QSAR to a wide range of research areas outside of traditional QSAR boundaries including synthesis planning, nanotechnology, materials science, biomaterials, and clinical informatics. As modern research methods generate rapidly increasing amounts of data, the knowledge of robust data-driven modelling methods professed within the QSAR field can become essential for scientists working both within and outside of chemical research. We hope that this contribution highlighting the generalizable components of QSAR modeling will serve to address this challenge.
Word cloud summary of diverse topics associated with QSAR modeling that are discussed in this review.
As the proliferation of high-throughput approaches in materials science is increasing the wealth of data in the field, the gap between accumulated-information and derived-knowledge widens. We address ...the issue of scientific discovery in materials databases by introducing novel analytical approaches based on structural and electronic materials fingerprints. The framework is employed to (i) query large databases of materials using similarity concepts, (ii) map the connectivity of materials space (i.e., as a materials cartograms) for rapidly identifying regions with unique organizations/properties, and (iii) develop predictive Quantitative Materials Structure–Property Relationship models for guiding materials design. In this study, we test these fingerprints by seeking target material properties. As a quantitative example, we model the critical temperatures of known superconductors. Our novel materials fingerprinting and materials cartography approaches contribute to the emerging field of materials informatics by enabling effective computational tools to analyze, visualize, model, and design new materials.
After nearly five decades “in the making”, QSAR modeling has established itself as one of the major computational molecular modeling methodologies. As any mature research discipline, QSAR modeling ...can be characterized by a collection of well defined protocols and procedures that enable the expert application of the method for exploring and exploiting ever growing collections of biologically active chemical compounds. This review examines most critical QSAR modeling routines that we regard as best practices in the field. We discuss these procedures in the context of integrative predictive QSAR modeling workflow that is focused on achieving models of the highest statistical rigor and external predictive power. Specific elements of the workflow consist of data preparation including chemical structure (and when possible, associated biological data) curation, outlier detection, dataset balancing, and model validation. We especially emphasize procedures used to validate models, both internally and externally, as well as the need to define model applicability domains that should be used when models are employed for the prediction of external compounds or compound libraries. Finally, we present several examples of successful applications of QSAR models for virtual screening to identify experimentally confirmed hits.
Evaluation of biological effects, both desired and undesired, caused by manufactured nanoparticles (MNPs) is of critical importance for nanotechnology. Experimental studies, especially toxicological, ...are time-consuming, costly, and often impractical, calling for the development of efficient computational approaches capable of predicting biological effects of MNPs. To this end, we have investigated the potential of cheminformatics methods such as quantitative structure−activity relationship (QSAR) modeling to establish statistically significant relationships between measured biological activity profiles of MNPs and their physical, chemical, and geometrical properties, either measured experimentally or computed from the structure of MNPs. To reflect the context of the study, we termed our approach quantitative nanostructure−activity relationship (QNAR) modeling. We have employed two representative sets of MNPs studied recently using in vitro cell-based assays: (i) 51 various MNPs with diverse metal cores (Proc. Natl. Acad. Sci. 2008, 105, 7387−7392) and (ii) 109 MNPs with similar core but diverse surface modifiers (Nat. Biotechnol. 2005, 23, 1418−1423). We have generated QNAR models using machine learning approaches such as support vector machine (SVM)-based classification and k nearest neighbors (kNN)-based regression; their external prediction power was shown to be as high as 73% for classification modeling and having an R 2 of 0.72 for regression modeling. Our results suggest that QNAR models can be employed for: (i) predicting biological activity profiles of novel nanomaterials, and (ii) prioritizing the design and manufacturing of nanomaterials toward better and safer products.
Myocarditis and pericarditis have been linked recently to COVID-19 vaccines without exploring the underlying mechanisms, or compared to cardiac adverse events post-non-COVID-19 vaccines. We introduce ...an informatics approach to study post-vaccine adverse events on the systems biology level to aid the prioritization of effective preventive measures and mechanism-based pharmacotherapy by integrating the analysis of adverse event reports from the Vaccine Adverse Event Reporting System (VAERS) with systems biology methods. Our results indicated that post-vaccine myocarditis and pericarditis were associated most frequently with mRNA COVID-19 vaccines followed by live or live-attenuated non-COVID-19 vaccines such as smallpox and anthrax vaccines. The frequencies of cardiac adverse events were affected by vaccine, vaccine type, vaccine dose, sex, and age of the vaccinated individuals. Systems biology results suggested a central role of interferon-gamma (INF-gamma) in the biological processes leading to cardiac adverse events, by impacting MAPK and JAK-STAT signaling pathways. We suggest that increasing the time interval between vaccine doses minimizes the risks of developing inflammatory adverse reactions. We also propose glucocorticoids as preferred treatments based on system biology evidence. Our informatics workflow provides an invaluable tool to study post-vaccine adverse events on the systems biology level to suggest effective mechanism-based pharmacotherapy and/or suitable preventive measures.
Prior to using a quantitative structure activity relationship (QSAR) model for external predictions, its predictive power should be established and validated. In the absence of a true external data ...set, the best way to validate the predictive ability of a model is to perform its statistical external validation. In statistical external validation, the overall data set is divided into training and test sets. Commonly, this splitting is performed using random division. Rational splitting methods can divide data sets into training and test sets in an intelligent fashion. The purpose of this study was to determine whether rational division methods lead to more predictive models compared to random division. A special data splitting procedure was used to facilitate the comparison between random and rational division methods. For each toxicity end point, the overall data set was divided into a modeling set (80% of the overall set) and an external evaluation set (20% of the overall set) using random division. The modeling set was then subdivided into a training set (80% of the modeling set) and a test set (20% of the modeling set) using rational division methods and by using random division. The Kennard-Stone, minimal test set dissimilarity, and sphere exclusion algorithms were used as the rational division methods. The hierarchical clustering, random forest, and k-nearest neighbor (kNN) methods were used to develop QSAR models based on the training sets. For kNN QSAR, multiple training and test sets were generated, and multiple QSAR models were built. The results of this study indicate that models based on rational division methods generate better statistical results for the test sets than models based on random division, but the predictive power of both types of models are comparable.