Molecular structure property modeling is an increasingly important tool for predicting compounds with desired properties due to the expensive and resource-intensive nature and the problem of ...toxicity-related attrition in late phases during drug discovery and development. Lately, the interest for applying deep learning techniques has increased considerably. This investigation compares the traditional physico-chemical descriptor and machine learning-based approaches through autoencoder generated descriptors to two different descriptor-free, Simplified Molecular Input Line Entry System (SMILES) based, deep learning architectures of Bidirectional Encoder Representations from Transformers (BERT) type using the Mondrian aggregated conformal prediction method as overarching framework. The results show for the binary CATMoS non-toxic and very-toxic datasets that for the former, almost equally balanced, dataset all methods perform equally well while for the latter dataset, with an 11-fold difference between the two classes, the MolBERT model based on a large pre-trained network performs somewhat better compared to the rest with high efficiency for both classes (0.93-0.94) as well as high values for sensitivity, specificity and balanced accuracy (0.86-0.87). The descriptor-free, SMILES-based, deep learning BERT architectures seem capable of producing well-balanced predictive models with defined applicability domains. This work also demonstrates that the class imbalance problem is gracefully handled through the use of Mondrian conformal prediction without the use of over- and/or under-sampling, weighting of classes or cost-sensitive methods.
Machine learning algorithms have attained widespread use in assessing the potential toxicities of pharmaceuticals and industrial chemicals because of their faster speed and lower cost compared to ...experimental bioassays. Gradient boosting is an effective algorithm that often achieves high predictivity, but historically the relative long computational time limited its applications in predicting large compound libraries or developing
predictive models that require frequent retraining. LightGBM, a recent improvement of the gradient boosting algorithm, inherited its high predictivity but resolved its scalability and long computational time by adopting a leaf-wise tree growth strategy and introducing novel techniques. In this study, we compared the predictive performance and the computational time of LightGBM to deep neural networks, random forests, support vector machines, and XGBoost. All algorithms were rigorously evaluated on publicly available Tox21 and mutagenicity data sets using a Bayesian optimization integrated nested 10-fold cross-validation scheme that performs hyperparameter optimization while examining model generalizability and transferability to new data. The evaluation results demonstrated that LightGBM is an effective and highly scalable algorithm offering the best predictive performance while consuming significantly shorter computational time than the other investigated algorithms across all Tox21 and mutagenicity data sets. We recommend LightGBM for applications of
safety assessment and also other areas of cheminformatics to fulfill the ever-growing demand for accurate and rapid prediction of various toxicity or activity related end points of large compound libraries present in the pharmaceutical and chemical industry.
Conformal prediction is introduced as an alternative approach to domain applicability estimation. The advantages of using conformal prediction are as follows: First, the approach is based on a ...consistent and well-defined mathematical framework. Second, the understanding of the confidence level concept in conformal predictions is straightforward, e.g. a confidence level of 0.8 means that the conformal predictor will commit, at most, 20% errors (i.e., true values outside the assigned prediction range). Third, the confidence level can be varied depending on the situation where the model is to be applied and the consequences of such changes are readily understandable, i.e. prediction ranges are increased or decreased, and the changes can immediately be inspected. We demonstrate the usefulness of conformal prediction by applying it to 10 publicly available data sets.
The hepatic organic anion transporting polypeptides (OATPs) influence the pharmacokinetics of several drug classes and are involved in many clinical drug–drug interactions. Predicting potential ...interactions with OATPs is, therefore, of value. Here, we developed in vitro and in silico models for identification and prediction of specific and general inhibitors of OATP1B1, OATP1B3, and OATP2B1. The maximal transport activity (MTA) of each OATP in human liver was predicted from transport kinetics and protein quantification. We then used MTA to predict the effects of a subset of inhibitors on atorvastatin uptake in vivo. Using a data set of 225 drug-like compounds, 91 OATP inhibitors were identified. In silico models indicated that lipophilicity and polar surface area are key molecular features of OATP inhibition. MTA predictions identified OATP1B1 and OATP1B3 as major determinants of atorvastatin uptake in vivo. The relative contributions to overall hepatic uptake varied with isoform specificities of the inhibitors.
Drug discovery is a rigorous process that requires billion dollars of investments and decades of research to bring a molecule “from bench to a bedside”. While virtual docking can significantly ...accelerate the process of drug discovery, it ultimately lags the current rate of expansion of chemical databases that already exceed billions of molecular records. This recent surge of small molecules availability presents great drug discovery opportunities, but also demands much faster screening protocols. In order to address this challenge, we herein introduce Deep Docking (DD), a novel deep learning platform that is suitable for docking billions of molecular structures in a rapid, yet accurate fashion. The DD approach utilizes quantitative structure–activity relationship (QSAR) deep models trained on docking scores of subsets of a chemical library to approximate the docking outcome for yet unprocessed entries and, therefore, to remove unfavorable molecules in an iterative manner. The use of DD methodology in conjunction with the FRED docking program allowed rapid and accurate calculation of docking scores for 1.36 billion molecules from the ZINC15 library against 12 prominent target proteins and demonstrated up to 100-fold data reduction and 6000-fold enrichment of high scoring molecules (without notable loss of favorably docked entities). The DD protocol can readily be used in conjunction with any docking program and was made publicly available.
High-throughput screening, where thousands of molecules rapidly can be assessed for activity against a protein, has been the dominating approach in drug discovery for many years. However, these ...methods are costly and require much time and effort. In order to suggest an improvement to this situation, in this study, we apply an iterative screening process, where an initial set of compounds are selected for screening based on molecular docking. The outcome of the initial screen is then used to classify the remaining compounds through a conformal predictor. The approach was retrospectively validated using 41 targets from the Directory of Useful Decoys, Enhanced (DUD-E), ensuring scaffold diversity among the active compounds. The results show that 57% of the remaining active compounds could be identified while only screening 9.4% of the database. The overall hit rate (7.6%) was also higher than when using docking alone (5.2%). When limiting the search to the top scored compounds from docking, 39.6% of the active compounds could be identified, compared to 13.5% when screening the same number of compounds solely based on docking. The use of conformal predictors also gives a clear indication of the number of compounds to screen in the next iteration. These results indicate that iterative screening based on molecular docking and conformal prediction can be an efficient way to find active compounds while screening only a small part of the compound collection.
Quantitative structure-activity relationships (QSAR) are critical to exploitation of the chemical information in toxicology databases. Exploitation can be extraction of chemical knowledge from the ...data but also making predictions of new chemicals based on quantitative analysis of past findings. In this study, we analyzed the ToxCast and Tox21 estrogen receptor data sets using Conformal Prediction to enhance the full exploitation of the information in these data sets. We applied aggregated conformal prediction (ACP) to the ToxCast and Tox21 estrogen receptor data sets using support vector machine classifiers to compare overall performance of the models but, more importantly, to explore the performance of ACP on data sets that are significantly enriched in one class without employing sampling strategies of the training set. ACP was also used to investigate the problem of applicability domain using both data sets. Comparison of ACP to previous results obtained on the same data sets using traditional QSAR approaches indicated similar overall balanced performance to methods in which careful training set selections were made, e.g., sensitivity and specificity for the external Tox21 data set of 70-75% and far superior results to those obtained using traditional methods without training set sampling where the corresponding results showed a clear imbalance of 50 and 96%, respectively. Application of conformal prediction to imbalanced data sets facilitates an unambiguous analysis of all data, allows accurate predictive models to be built which display similar accuracy in external validation to external validation, and, most importantly, allows an unambiguous treatment of the applicability domain.
Machine learning models are widely applied to predict molecular properties or the biological activity of small molecules on a specific protein. Models can be integrated in a conformal prediction (CP) ...framework which adds a calibration step to estimate the confidence of the predictions. CP models present the advantage of ensuring a predefined error rate under the assumption that test and calibration set are exchangeable. In cases where the test data have drifted away from the descriptor space of the training data, or where assay setups have changed, this assumption might not be fulfilled and the models are not guaranteed to be valid. In this study, the performance of internally valid CP models when applied to either newer time-split data or to external data was evaluated. In detail, temporal data drifts were analysed based on twelve datasets from the ChEMBL database. In addition, discrepancies between models trained on publicly-available data and applied to proprietary data for the liver toxicity and MNT in vivo endpoints were investigated. In most cases, a drastic decrease in the validity of the models was observed when applied to the time-split or external (holdout) test sets. To overcome the decrease in model validity, a strategy for updating the calibration set with data more similar to the holdout set was investigated. Updating the calibration set generally improved the validity, restoring it completely to its expected value in many cases. The restored validity is the first requisite for applying the CP models with confidence. However, the increased validity comes at the cost of a decrease in model efficiency, as more predictions are identified as inconclusive. This study presents a strategy to recalibrate CP models to mitigate the effects of data drifts. Updating the calibration sets without having to retrain the model has proven to be a useful approach to restore the validity of most models.
Macrocycles are of increasing interest as chemical probes and drugs for intractable targets like protein-protein interactions, but the determinants of their cell permeability and oral absorption are ...poorly understood. To enable rational design of cell-permeable macrocycles, we generated an extensive data set under consistent experimental conditions for more than 200 non-peptidic, de novo-designed macrocycles from the Broad Institute's diversity-oriented screening collection. This revealed how specific functional groups, substituents and molecular properties impact cell permeability. Analysis of energy-minimized structures for stereo- and regioisomeric sets provided fundamental insight into how dynamic, intramolecular interactions in the 3D conformations of macrocycles may be linked to physicochemical properties and permeability. Combined use of quantitative structure-permeability modeling and the procedure for conformational analysis now, for the first time, provides chemists with a rational approach to design cell-permeable non-peptidic macrocycles with potential for oral absorption.
Confidence predictors can deliver predictions with the associated confidence required for decision making and can play an important role in drug discovery and toxicity predictions. In this work we ...investigate a recently introduced version of conformal prediction, synergy conformal prediction, focusing on the predictive performance when applied to bioactivity data. We compare the performance to other variants of conformal predictors for multiple partitioned datasets and demonstrate the utility of synergy conformal predictors for federated learning where data cannot be pooled in one location. Our results show that synergy conformal predictors based on training data randomly sampled with replacement can compete with other conformal setups, while using completely separate training sets often results in worse performance. However, in a federated setup where no method has access to all the data, synergy conformal prediction is shown to give promising results. Based on our study, we conclude that synergy conformal predictors are a valuable addition to the conformal prediction toolbox.