Profile-quantitative structure–activity relationship (pQSAR) is a massively multitask, two-step machine learning method with unprecedented scope, accuracy, and applicability domain. In step one, a ...“profile” of conventional single-assay random forest regression models are trained on a very large number of biochemical and cellular pIC50 assays using Morgan 2 substructural fingerprints as compound descriptors. In step two, a panel of partial least squares (PLS) models are built using the profile of pIC50 predictions from those random forest regression models as compound descriptors (hence the name). Previously described for a panel of 728 biochemical and cellular kinase assays, we have now built an enormous pQSAR from 11 805 diverse Novartis (NVS) IC50 and EC50 assays. This large number of assays, and hence of compound descriptors for PLS, dictated reducing the profile by only including random forest regression models whose predictions correlate with the assay being modeled. The random forest regression and pQSAR models were evaluated with our “realistically novel” held-out test set, whose median average similarity to the nearest training set member across the 11 805 assays was only 0.34, comparable to the novelty of compounds actually selected from virtual screens. For the 11 805 single-assay random forest regression models, the median correlation of prediction with the experiment was only r ext 2 = 0.05, virtually random, and only 8% of the models achieved our standard success threshold of r ext 2 = 0.30. For pQSAR, the median correlation was r ext 2 = 0.53, comparable to four-concentration experimental IC50s, and 72% of the models met our r ext 2 > 0.30 standard, totaling 8558 successful models. The successful models included assays from all of the 51 annotated target subclasses, as well as 4196 phenotypic assays, indicating that pQSAR can be applied to virtually any disease area. Every month, all models are updated to include new measurements, and predictions are made for 5.5 million NVS compounds, totaling 50 billion predictions. Common uses have included virtual screening, selectivity design, toxicity and promiscuity prediction, mechanism-of-action prediction, and others. Several such actual applications are described.
While conventional random forest regression (RFR) virtual screening models appear to have excellent accuracy on random held-out test sets, they prove lacking in actual practice. Analysis of 18 ...historical virtual screens showed that random test sets are far more similar to their training sets than are the compounds project teams actually order. A new, cluster-based “realistic” training/test set split, which mirrors the chemical novelty of real-life virtual screens, recapitulates the poor predictive power of RFR models in real projects. The original Profile-QSAR (pQSAR) method greatly broadened the domain of applicability over conventional models by using as independent variables a profile of activity predictions from all historical assays in a large protein family. However, the accuracy still fell short of experiment on realistic test sets. The improved “pQSAR 2.0” method replaces probabilities of activity from naïve Bayes categorical models at several thresholds with predicted IC50s from RFR models. Unexpectedly, the high accuracy also requires removing the RFR model for the actual assay of interest from the independent variable profile. With these improvements, pQSAR 2.0 activity predictions are now statistically comparable to medium-throughput four-concentration IC50 measurements even on the realistic test set. Beyond the yes/no activity predictions from a typical high-throughput screen (HTS) or conventional virtual screen, these semiquantitative IC50 predictions allow for predicted potency, ligand efficiency, lipophilic efficiency, and selectivity against antitargets, greatly facilitating hitlist triaging and enabling virtual screening panels such as toxicity panels and overall promiscuity predictions.
A method is presented for an ultrafast shape-based search workflow for the screening of large compound collections, i.e., those of vendors. The three-dimensional shape of a molecule dictates its ...biological activity by enabling the molecule to fit into binding pockets of proteins. Quite often, distinctly different chemical compounds that have similar shapes can bind in a similar way. OpenEye pioneered an algorithm for comparing shapes of molecules by overlaying them in a computer and measuring differences between a query molecule and a target molecule. Overlaying shapes is a computationally intensive process and represents a bottleneck in searching for similar molecules. More recent publications describe alternative methods of overlaying molecules, which are accomplished by comparing shape-based descriptors. These methods were implemented in the Open Drug Discovery Toolkit (ODDT) package. We utilized a combination of open-source software packages like ODDT and RDkit to implement a workflow for ultrafast conformer generation and matching that does not require storing precomputed conformers on the file system or in memory. Moreover, the generated descriptors could be optionally stored in MongoDB for performing searches in the future. To speed up the search, we created a set of indexes from the transformed shape-based descriptors. We are in the process of calculating descriptors for multiple vendors, including Enamine’s “REAL” collection of 1.2 billion compounds. Currently, the shape similarity search on more than 70 million compounds takes less than 8 s! We exemplified our methodology with the screen of compounds that can act as putative TLR4 agonists. The search was based on a literature-known small-molecule TLR4 agonist series. In due course, we identified compounds with novel structural motifs that were active in mouse and human TLR4 reporter cell lines.
Predicting solubility of small molecules is a very difficult undertaking due to the lack of reliable and consistent experimental solubility data. It is well known that for a molecule in a crystal ...lattice to be dissolved, it must, first, dissociate from the lattice and then, second, be solvated. The melting point of a compound is proportional to the lattice energy, and the octanol–water partition coefficient (log P) is a measure of the compound’s solvation efficiency. The CCDC’s melting point dataset of almost one hundred thousand compounds was utilized to create widely applicable machine learning models of small molecule melting points. Using the general solubility equation, the aqueous thermodynamic solubilities of the same compounds can be predicted. The global model could be easily localized by adding additional melting point measurements for a chemical series of interest.
Resistance to the RAF inhibitor vemurafenib arises commonly in melanomas driven by the activated BRAF oncogene. Here, we report antitumor properties of RAF709, a novel ATP-competitive kinase ...inhibitor with high potency and selectivity against RAF kinases. RAF709 exhibited a mode of RAF inhibition distinct from RAF monomer inhibitors such as vemurafenib, showing equal activity against both RAF monomers and dimers. As a result, RAF709 inhibited MAPK signaling activity in tumor models harboring either BRAF
alterations or mutant N- and KRAS-driven signaling, with minimal paradoxical activation of wild-type RAF. In cell lines and murine xenograft models, RAF709 demonstrated selective antitumor activity in tumor cells harboring BRAF or RAS mutations compared with cells with wild-type BRAF and RAS genes. RAF709 demonstrated a direct pharmacokinetic/pharmacodynamic relationship in
tumor models harboring KRAS mutation. Furthermore, RAF709 elicited regression of primary human tumor-derived xenograft models with BRAF, NRAS, or KRAS mutations with excellent tolerability. Our results support further development of inhibitors like RAF709, which represents a next-generation RAF inhibitor with unique biochemical and cellular properties that enables antitumor activities in RAS-mutant tumors.
In an effort to develop RAF inhibitors with the appropriate pharmacological properties to treat RAS mutant tumors, RAF709, a compound with potency, selectivity, and
properties, was developed that will allow preclinical therapeutic hypothesis testing, but also provide an excellent probe to further unravel the complexities of RAF kinase signaling.
.
Profile-quantitative structure-activity relationship (pQSAR) is a massively multitask, two-step machine learning method with unprecedented scope, accuracy, and applicability domain. In step one, a ..."profile" of conventional single-assay random forest regression models are trained on a very large number of biochemical and cellular pIC
assays using Morgan 2 substructural fingerprints as compound descriptors. In step two, a panel of partial least squares (PLS) models are built using the profile of pIC
predictions from those random forest regression models as compound descriptors (hence the name). Previously described for a panel of 728 biochemical and cellular kinase assays, we have now built an enormous pQSAR from 11 805 diverse Novartis (NVS) IC
and EC
assays. This large number of assays, and hence of compound descriptors for PLS, dictated reducing the profile by only including random forest regression models whose predictions correlate with the assay being modeled. The random forest regression and pQSAR models were evaluated with our "realistically novel" held-out test set, whose median average similarity to the nearest training set member across the 11 805 assays was only 0.34, comparable to the novelty of compounds actually selected from virtual screens. For the 11 805 single-assay random forest regression models, the median correlation of prediction with the experiment was only
= 0.05, virtually random, and only 8% of the models achieved our standard success threshold of
= 0.30. For pQSAR, the median correlation was
= 0.53, comparable to four-concentration experimental IC
s, and 72% of the models met our
> 0.30 standard, totaling 8558 successful models. The successful models included assays from all of the 51 annotated target subclasses, as well as 4196 phenotypic assays, indicating that pQSAR can be applied to virtually any disease area. Every month, all models are updated to include new measurements, and predictions are made for 5.5 million NVS compounds, totaling 50 billion predictions. Common uses have included virtual screening, selectivity design, toxicity and promiscuity prediction, mechanism-of-action prediction, and others. Several such actual applications are described.
While conventional random forest regression (RFR) virtual screening models appear to have excellent accuracy on random held-out test sets, they prove lacking in actual practice. Analysis of 18 ...historical virtual screens showed that random test sets are far more similar to their training sets than are the compounds project teams actually order. A new, cluster-based "realistic" training/test set split, which mirrors the chemical novelty of real-life virtual screens, recapitulates the poor predictive power of RFR models in real projects. The original Profile-QSAR (pQSAR) method greatly broadened the domain of applicability over conventional models by using as independent variables a profile of activity predictions from all historical assays in a large protein family. However, the accuracy still fell short of experiment on realistic test sets. The improved "pQSAR 2.0" method replaces probabilities of activity from naïve Bayes categorical models at several thresholds with predicted IC
s from RFR models. Unexpectedly, the high accuracy also requires removing the RFR model for the actual assay of interest from the independent variable profile. With these improvements, pQSAR 2.0 activity predictions are now statistically comparable to medium-throughput four-concentration IC
measurements even on the realistic test set. Beyond the yes/no activity predictions from a typical high-throughput screen (HTS) or conventional virtual screen, these semiquantitative IC
predictions allow for predicted potency, ligand efficiency, lipophilic efficiency, and selectivity against antitargets, greatly facilitating hitlist triaging and enabling virtual screening panels such as toxicity panels and overall promiscuity predictions.
CLK2 inhibition has been proposed as a potential mechanism to improve autism and neuronal functions in Phelan-McDermid syndrome (PMDS). Herein, the discovery of a very potent indazole CLK inhibitor ...series and the CLK2 X-ray structure of the most potent analogue are reported. This new indazole series was identified through a biochemical CLK2 Caliper assay screen with 30k compounds selected by an in silico approach. Novel high-resolution X-ray structures of all CLKs, including the first CLK4 X-ray structure, bound to known CLK2 inhibitor tool compounds (e.g., TG003, CX-4945), are also shown and yield insight into inhibitor selectivity in the CLK family. The efficacy of the new CLK2 inhibitors from the indazole series was demonstrated in the mouse brain slice assay, and potential safety concerns were investigated. Genotoxicity findings in the human lymphocyte micronucleus test (MNT) assay are shown by using two structurally different CLK inhibitors to reveal a major concern for pan-CLK inhibition in PMDS.
A phenotypic screen (PS) is used to identify compounds causing a desired phenotype in a complex biological system where mechanisms and targets are largely unknown. Deconvoluting the mechanism of ...action of actives and identification of relevant targets and pathways remains a formidable challenge. Current methods fail to use the rich information available regarding compounds and their targets in a systematic way for this deconvolution. We have developed an enrichment analysis algorithm to identify targets associated with the desired phenotype in a rigorous data-driven manner using actives and hundreds of thousands of inactives in a PS, as well as results of thousands of available legacy target-based screens in an institution. Our method quantifies association between the PS and targets while reducing sampling bias, which leads to identification of novel targets, additional chemical matter, and appropriate assays. Its use is illustrated using two examples from our laboratories: TRAIL and DNA fragmentation. Enrichment analysis of these PSs is discussed using both biological pathway analysis and known cell biology to demonstrate the value of our method. We believe this enrichment analysis method is an indispensable tool for the analysis of PSs.