Deep neural networks can directly learn from chemical structures without extensive, user-driven selection of descriptors in order to predict molecular properties/activities with high reliability. But ...these approaches typically require large training sets to learn the endpoint-specific structural features and ensure reasonable prediction accuracy. Even though large datasets are becoming the new normal in drug discovery, especially when it comes to high-throughput screening or metabolomics datasets, one should also consider smaller datasets with challenging endpoints to model and forecast. Thus, it would be highly relevant to better utilize the tremendous compendium of unlabeled compounds from publicly-available datasets for improving the model performances for the user’s particular series of compounds. In this study, we propose the
Mol
ecular
P
rediction
Mo
del
Fi
ne-
T
uning (
MolPMoFiT
) approach, an effective transfer learning method based on self-supervised pre-training + task-specific fine-tuning for QSPR/QSAR modeling. A large-scale molecular structure prediction model is pre-trained using one million unlabeled molecules from ChEMBL in a self-supervised learning manner, and can then be fine-tuned on various QSPR/QSAR tasks for smaller chemical datasets with specific endpoints. Herein, the method is evaluated on four benchmark datasets (lipophilicity, FreeSolv, HIV, and blood–brain barrier penetration). The results showed the method can achieve strong performances for all four datasets compared to other
state
-
of
-
the
-
art
machine learning modeling techniques reported in the literature so far.
Molecular modelers and cheminformaticians typically analyze experimental data generated by other scientists. Consequently, when it comes to data accuracy, cheminformaticians are always at the mercy ...of data providers who may inadvertently publish (partially) erroneous data. Thus, dataset curation is crucial for any cheminformatics analysis such as similarity searching, clustering, QSAR modeling, virtual screening, etc., especially nowadays when the availability of chemical datasets in public domain has skyrocketed in recent years. Despite the obvious importance of this preliminary step in the computational analysis of any dataset, there appears to be no commonly accepted guidance or set of procedures for chemical data curation. The main objective of this paper is to emphasize the need for a standardized chemical data curation strategy that should be followed at the onset of any molecular modeling investigation. Herein, we discuss several simple but important steps for cleaning chemical records in a database including the removal of a fraction of the data that cannot be appropriately handled by conventional cheminformatics techniques. Such steps include the removal of inorganic and organometallic compounds, counterions, salts and mixtures; structure validation; ring aromatization; normalization of specific chemotypes; curation of tautomeric forms; and the deletion of duplicates. To emphasize the importance of data curation as a mandatory step in data analysis, we discuss several case studies where chemical curation of the original “raw” database enabled the successful modeling study (specifically, QSAR analysis) or resulted in a significant improvement of model's prediction accuracy. We also demonstrate that in some cases rigorously developed QSAR models could be even used to correct erroneous biological data associated with chemical compounds. We believe that good practices for curation of chemical records outlined in this paper will be of value to all scientists working in the fields of molecular modeling, cheminformatics, and QSAR studies.
QSAR without borders Muratov, Eugene N; Bajorath, Jürgen; Sheridan, Robert P ...
Chemical Society reviews,
06/2020, Letnik:
49, Številka:
11
Journal Article
Recenzirano
Odprti dostop
Prediction of chemical bioactivity and physical properties has been one of the most important applications of statistical and more recently, machine learning and artificial intelligence methods in ...chemical sciences. This field of research, broadly known as quantitative structure-activity relationships (QSAR) modeling, has developed many important algorithms and has found a broad range of applications in physical organic and medicinal chemistry in the past 55+ years. This Perspective summarizes recent technological advances in QSAR modeling but it also highlights the applicability of algorithms, modeling methods, and validation practices developed in QSAR to a wide range of research areas outside of traditional QSAR boundaries including synthesis planning, nanotechnology, materials science, biomaterials, and clinical informatics. As modern research methods generate rapidly increasing amounts of data, the knowledge of robust data-driven modelling methods professed within the QSAR field can become essential for scientists working both within and outside of chemical research. We hope that this contribution highlighting the generalizable components of QSAR modeling will serve to address this challenge.
Word cloud summary of diverse topics associated with QSAR modeling that are discussed in this review.
There is a growing interest for the broad use of Augmented Reality (AR) and Virtual Reality (VR) in the fields of bioinformatics and cheminformatics to visualize complex biological and chemical ...structures. AR and VR technologies allow for stunning and immersive experiences, offering untapped opportunities for both research and education purposes. However, preparing 3D models ready to use for AR and VR is time-consuming and requires a technical expertise that severely limits the development of new contents of potential interest for structural biologists, medicinal chemists, molecular modellers and teachers.
Herein we present the RealityConvert software tool and associated website, which allow users to easily convert molecular objects to high quality 3D models directly compatible for AR and VR applications. For chemical structures, in addition to the 3D model generation, RealityConvert also generates image trackers, useful to universally call and anchor that particular 3D model when used in AR applications. The ultimate goal of RealityConvert is to facilitate and boost the development and accessibility of AR and VR contents for bioinformatics and cheminformatics applications.
http://www.realityconvert.com.
dfourch@ncsu.edu.
Supplementary data are available at Bioinformatics online.
As the proliferation of high-throughput approaches in materials science is increasing the wealth of data in the field, the gap between accumulated-information and derived-knowledge widens. We address ...the issue of scientific discovery in materials databases by introducing novel analytical approaches based on structural and electronic materials fingerprints. The framework is employed to (i) query large databases of materials using similarity concepts, (ii) map the connectivity of materials space (i.e., as a materials cartograms) for rapidly identifying regions with unique organizations/properties, and (iii) develop predictive Quantitative Materials Structure–Property Relationship models for guiding materials design. In this study, we test these fingerprints by seeking target material properties. As a quantitative example, we model the critical temperatures of known superconductors. Our novel materials fingerprinting and materials cartography approaches contribute to the emerging field of materials informatics by enabling effective computational tools to analyze, visualize, model, and design new materials.
Evaluation of biological effects, both desired and undesired, caused by manufactured nanoparticles (MNPs) is of critical importance for nanotechnology. Experimental studies, especially toxicological, ...are time-consuming, costly, and often impractical, calling for the development of efficient computational approaches capable of predicting biological effects of MNPs. To this end, we have investigated the potential of cheminformatics methods such as quantitative structure−activity relationship (QSAR) modeling to establish statistically significant relationships between measured biological activity profiles of MNPs and their physical, chemical, and geometrical properties, either measured experimentally or computed from the structure of MNPs. To reflect the context of the study, we termed our approach quantitative nanostructure−activity relationship (QNAR) modeling. We have employed two representative sets of MNPs studied recently using in vitro cell-based assays: (i) 51 various MNPs with diverse metal cores (Proc. Natl. Acad. Sci. 2008, 105, 7387−7392) and (ii) 109 MNPs with similar core but diverse surface modifiers (Nat. Biotechnol. 2005, 23, 1418−1423). We have generated QNAR models using machine learning approaches such as support vector machine (SVM)-based classification and k nearest neighbors (kNN)-based regression; their external prediction power was shown to be as high as 73% for classification modeling and having an R 2 of 0.72 for regression modeling. Our results suggest that QNAR models can be employed for: (i) predicting biological activity profiles of novel nanomaterials, and (ii) prioritizing the design and manufacturing of nanomaterials toward better and safer products.
Humans are exposed to thousands of man-made chemicals in the environment. Some chemicals mimic natural endocrine hormones and, thus, have the potential to be endocrine disruptors. Most of these ...chemicals have never been tested for their ability to interact with the estrogen receptor (ER). Risk assessors need tools to prioritize chemicals for evaluation in costly in vivo tests, for instance, within the U.S. EPA Endocrine Disruptor Screening Program.
We describe a large-scale modeling project called CERAPP (Collaborative Estrogen Receptor Activity Prediction Project) and demonstrate the efficacy of using predictive computational models trained on high-throughput screening data to evaluate thousands of chemicals for ER-related activity and prioritize them for further testing.
CERAPP combined multiple models developed in collaboration with 17 groups in the United States and Europe to predict ER activity of a common set of 32,464 chemical structures. Quantitative structure-activity relationship models and docking approaches were employed, mostly using a common training set of 1,677 chemical structures provided by the U.S. EPA, to build a total of 40 categorical and 8 continuous models for binding, agonist, and antagonist ER activity. All predictions were evaluated on a set of 7,522 chemicals curated from the literature. To overcome the limitations of single models, a consensus was built by weighting models on scores based on their evaluated accuracies.
Individual model scores ranged from 0.69 to 0.85, showing high prediction reliabilities. Out of the 32,464 chemicals, the consensus model predicted 4,001 chemicals (12.3%) as high priority actives and 6,742 potential actives (20.8%) to be considered for further testing.
This project demonstrated the possibility to screen large libraries of chemicals using a consensus of different in silico approaches. This concept will be applied in future projects related to other end points.
Mansouri K, Abdelaziz A, Rybacka A, Roncaglioni A, Tropsha A, Varnek A, Zakharov A, Worth A, Richard AM, Grulke CM, Trisciuzzi D, Fourches D, Horvath D, Benfenati E, Muratov E, Wedebye EB, Grisoni F, Mangiatordi GF, Incisivo GM, Hong H, Ng HW, Tetko IV, Balabin I, Kancherla J, Shen J, Burton J, Nicklaus M, Cassotti M, Nikolov NG, Nicolotti O, Andersson PL, Zang Q, Politi R, Beger RD, Todeschini R, Huang R, Farag S, Rosenberg SA, Slavov S, Hu X, Judson RS. 2016.
Collaborative Estrogen Receptor Activity Prediction Project. Environ Health Perspect 124:1023-1033; http://dx.doi.org/10.1289/ehp.1510267.
Celotno besedilo
Dostopno za:
CEKLJ, DOBA, IZUM, KILJ, NUK, OILJ, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK, VSZLJ
Despite the success of protein kinase inhibitors as approved therapeutics, drug discovery has focused on a small subset of kinase targets. Here we provide a thorough characterization of the Published ...Kinase Inhibitor Set (PKIS), a set of 367 small-molecule ATP-competitive kinase inhibitors that was recently made freely available with the aim of expanding research in this field and as an experiment in open-source target validation. We screen the set in activity assays with 224 recombinant kinases and 24 G protein-coupled receptors and in cellular assays of cancer cell proliferation and angiogenesis. We identify chemical starting points for designing new chemical probes of orphan kinases and illustrate the utility of these leads by developing a selective inhibitor for the previously untargeted kinases LOK and SLK. Our cellular screens reveal compounds that modulate cancer cell growth and angiogenesis in vitro. These reagents and associated data illustrate an efficient way forward to increasing understanding of the historically untargeted kinome.
Celotno besedilo
Dostopno za:
DOBA, IJS, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SBMB, SIK, UILJ, UKNU, UL, UM, UPUK
Ion mobility spectrometry (IMS) is a widely used analytical technique providing rapid gas phase separations. IMS alone is useful, but its coupling with mass spectrometry (IMS-MS) and various ...front-end separation techniques has greatly increased the molecular information achievable from different omic analyses. IMS-MS analyses are specifically gaining attention for improving metabolomic, lipidomic, glycomic, proteomic and exposomic analyses by increasing measurement sensitivity (e.g. S/N ratio), lowering the detection limit, and amplifying peak capacity. Numerous studies including national security-related analyses, disease screenings and environmental evaluations are illustrating that IMS-MS is able to extract information not possible with MS alone. Furthermore, IMS-MS has shown great utility in salvaging molecular information for low abundance molecules of interest when high concentration contaminant ions are present in the sample by reducing detector suppression. This review highlights how IMS-MS is currently being used in omic analyses to distinguish structurally similar molecules, isomers, molecular classes and contaminant ions.
•IMS is a widely used analytical technique providing rapid gas phase separations.•IMS coupled with MS is rapidly gaining attention for improving omic analyses.•IMS-MS is able to extract information not possible with MS alone in complex samples.•IMS-MS distinguishes isomers, isobars, molecular classes and contaminant ions.
The blockage of the hERG K+ channels is closely associated with lethal cardiac arrhythmia. The notorious ligand promiscuity of this channel earmarked hERG as one of the most important antitargets to ...be considered in early stages of drug development process. Herein we report on the development of an innovative and freely accessible web server for early identification of putative hERG blockers and non‐blockers in chemical libraries. We have collected the largest publicly available curated hERG dataset of 5,984 compounds. We succeed in developing robust and externally predictive binary (CCR≈0.8) and multiclass models (accuracy≈0.7). These models are available as a web‐service freely available for public at http://labmol.farmacia.ufg.br/predherg/. Three following outcomes are available for the users: prediction by binary model, prediction by multi‐class model, and the probability maps of atomic contribution. The Pred‐hERG will be continuously updated and upgraded as new information became available.