In qualitative or quantitative studies of structure–activity relationships (SARs), machine learning (ML) models are trained to recognize structural patterns that differentiate between active and ...inactive compounds. Understanding model decisions is challenging but of critical importance to guide compound design. Moreover, the interpretation of ML results provides an additional level of model validation based on expert knowledge. A number of complex ML approaches, especially deep learning (DL) architectures, have distinctive black-box character. Herein, a locally interpretable explanatory method termed Shapley additive explanations (SHAP) is introduced for rationalizing activity predictions of any ML algorithm, regardless of its complexity. Models resulting from random forest (RF), nonlinear support vector machine (SVM), and deep neural network (DNN) learning are interpreted, and structural patterns determining the predicted probability of activity are identified and mapped onto test compounds. The results indicate that SHAP has high potential for rationalizing predictions of complex ML models.
Difficulties in interpreting machine learning (ML) models and their predictions limit the practical applicability of and confidence in ML in pharmaceutical research. There is a need for agnostic ...approaches aiding in the interpretation of ML models regardless of their complexity that is also applicable to deep neural network (DNN) architectures and model ensembles. To these ends, the SHapley Additive exPlanations (SHAP) methodology has recently been introduced. The SHAP approach enables the identification and prioritization of features that determine compound classification and activity prediction using any ML model. Herein, we further extend the evaluation of the SHAP methodology by investigating a variant for exact calculation of Shapley values for decision tree methods and systematically compare this variant in compound activity and potency value predictions with the model-independent SHAP method. Moreover, new applications of the SHAP analysis approach are presented including interpretation of DNN models for the generation of multi-target activity profiles and ensemble regression models for potency prediction.
Undetected pan-assay interference compounds (PAINS) with false-positive activities in assays often propagate through medicinal chemistry programs and compromise their outcomes. Although a large ...number of PAINS have been classified, often on the basis of individual studies or chemical experience, little has been done so far to systematically assess their activity profiles. Herein we report a large-scale analysis of the behavior of PAINS in biological screening assays. More than 23 000 extensively tested compounds containing PAINS substructures were detected, and their hit rates were determined. Many consistently inactive compounds were identified. The hit frequency was low overall, with median values of two to five hits for PAINS tested in hundreds of assays. Only confined subsets of PAINS produced abundant hits. The same PAINS substructure was often found in consistently inactive and frequently active compounds, indicating that the structural context in which PAINS occur modulates their effects.
Compound potency prediction is a major task in medicinal chemistry and drug design. Inspired by the concept of activity cliffs (which encode large differences in potency between similar active ...compounds), we have devised a new methodology for predicting potent compounds from weakly potent input molecules. Therefore, a chemical language model was implemented consisting of a conditional transformer architecture for compound design guided by observed potency differences. The model was evaluated using a newly generated compound test system enabling a rigorous assessment of its performance. It was shown to predict known potent compounds from different activity classes not encountered during training. Moreover, the model was capable of creating highly potent compounds that were structurally distinct from input molecules. It also produced many novel candidate compounds not included in test sets. Taken together, the findings confirmed the ability of the new methodology to generate structurally diverse highly potent compounds.
A major challenge in computational chemistry is the generation of novel molecular structures with desirable pharmacological and physiochemical properties. In this work, we investigate the potential ...use of autoencoder, a deep learning methodology, for de novo molecular design. Various generative autoencoders were used to map molecule structures into a continuous latent space and vice versa and their performance as structure generator was assessed. Our results show that the latent space preserves chemical similarity principle and thus can be used for the generation of analogue structures. Furthermore, the latent space created by autoencoders were searched systematically to generate novel compounds with predicted activity against dopamine receptor type 2 and compounds similar to known active compounds not included in the trainings set were identified.
In drug discovery, compounds with well-defined activity against multiple targets (multitarget compounds, MT-CPDs) provide the basis for polypharmacology and are thus of high interest. Typically, ...MT-CPDs for polypharmacology have been discovered serendipitously. Therefore, over the past decade, computational approaches have also been adapted for the design of MT-CPDs or their identification via computational screening. Such approaches continue to be under development and are far from being routine. Recently, different machine learning (ML) models have been derived to distinguish between MT-CPDs and corresponding compounds with activity against the individual targets (single-target compounds, ST-CPDs). When evaluating alternative models for predicting MT-CPDs, we discovered that MT-CPDs could also be accurately predicted with models derived for corresponding ST-CPDs; this was an unexpected finding that we further investigated using explainable ML. The analysis revealed that accurate predictions of ST-CPDs were determined by subsets of structural features of MT-CPDs required for their prediction. These findings provided a chemically intuitive rationale for the successful prediction of MT-CPDs using different ML models and uncovered general-feature subset relationships between MT- and ST-CPDs with activities against different targets.