The biochemical half maximal inhibitory concentration (IC50) is the most commonly used metric for on-target activity in lead optimization. It is used to guide lead optimization, build large-scale ...chemogenomics analysis, off-target activity and toxicity models based on public data. However, the use of public biochemical IC50 data is problematic, because they are assay specific and comparable only under certain conditions. For large scale analysis it is not feasible to check each data entry manually and it is very tempting to mix all available IC50 values from public database even if assay information is not reported. As previously reported for Ki database analysis, we first analyzed the types of errors, the redundancy and the variability that can be found in ChEMBL IC50 database. For assessing the variability of IC50 data independently measured in two different labs at least ten IC50 data for identical protein-ligand systems against the same target were searched in ChEMBL. As a not sufficient number of cases of this type are available, the variability of IC50 data was assessed by comparing all pairs of independent IC50 measurements on identical protein-ligand systems. The standard deviation of IC50 data is only 25% larger than the standard deviation of Ki data, suggesting that mixing IC50 data from different assays, even not knowing assay conditions details, only adds a moderate amount of noise to the overall data. The standard deviation of public ChEMBL IC50 data, as expected, resulted greater than the standard deviation of in-house intra-laboratory/inter-day IC50 data. Augmenting mixed public IC50 data by public Ki data does not deteriorate the quality of the mixed IC50 data, if the Ki is corrected by an offset. For a broad dataset such as ChEMBL database a Ki- IC50 conversion factor of 2 was found to be the most reasonable.
Diarrhoeal disease is responsible for 8.6% of global child mortality. Recent epidemiological studies found the protozoan parasite Cryptosporidium to be a leading cause of paediatric diarrhoea, with ...particularly grave impact on infants and immunocompromised individuals. There is neither a vaccine nor an effective treatment. Here we establish a drug discovery process built on scalable phenotypic assays and mouse models that take advantage of transgenic parasites. Screening a library of compounds with anti-parasitic activity, we identify pyrazolopyridines as inhibitors of Cryptosporidium parvum and Cryptosporidium hominis. Oral treatment with the pyrazolopyridine KDU731 results in a potent reduction in intestinal infection of immunocompromised mice. Treatment also leads to rapid resolution of diarrhoea and dehydration in neonatal calves, a clinical model of cryptosporidiosis that closely resembles human infection. Our results suggest that the Cryptosporidium lipid kinase PI(4)K (phosphatidylinositol-4-OH kinase) is a target for pyrazolopyridines and that KDU731 warrants further preclinical evaluation as a drug candidate for the treatment of cryptosporidiosis.
Matched molecular pair analysis (MMPA) has become a major tool for analyzing large chemistry data sets for promising chemical transformations. However, the dependence of MMPA predictions on data ...constraints such as the number of pairs involved, experimental uncertainty, source of the experiments, and variability of the true physical effect has not yet been described. In this contribution the statistical basics for judging MMPA are analyzed. We illustrate the connection between overall MMPA statistics and individual pairs with a detailed comparison of average CHEMBL hERG MMPA results versus pairs with extreme transformation effects. Comparing the CHEMBL results to Novartis data, we find that significant transformation effects agree very well if the experimental uncertainty is considered. This indicates that caution must be exercised for predictions from insignificant MMPAs, yet highlights the robustness of statistically validated MMPA and shows that MMPA on public databases can yield results that are very useful for medicinal chemistry.
Mutations in the Plasmodium falciparum cyclic amine resistance locus (PfCARL) are associated with parasite resistance to the imidazolopiperazines, a potent class of novel antimalarial compounds that ...display both prophylactic and transmission-blocking activity, in addition to activity against blood-stage parasites. Here, we show that pfcarl encodes a protein, with a predicted molecular weight of 153 kDa, that localizes to the cis-Golgi apparatus of the parasite in both asexual and sexual blood stages. Utilizing clustered regularly interspaced short palindromic repeat (CRISPR)-mediated gene introduction of 5 variants (L830V, S1076N/I, V1103L, and I1139K), we demonstrate that mutations in pfcarl are sufficient to generate resistance against the imidazolopiperazines in both asexual and sexual blood-stage parasites. We further determined that the mutant PfCARL protein confers resistance to several structurally unrelated compounds. These data suggest that PfCARL modulates the levels of small-molecule inhibitors that affect Golgi-related processes, such as protein sorting or membrane trafficking, and is therefore an important mechanism of resistance in malaria parasites.
Several previous in vitro evolution studies have implicated the Plasmodium falciparum cyclic amine resistance locus (PfCARL) as a potential target of imidazolopiperazines, potent antimalarial compounds with broad activity against different parasite life cycle stages. Given that the imidazolopiperazines are currently being tested in clinical trials, understanding their mechanism of resistance and the cellular processes involved will allow more effective clinical usage.
We describe a file format that is designed to represent mixtures of compounds in a way that is fully machine readable. This
Mixfile
format is intended to fill the same role for substances that are ...composed of multiple components as the venerable
Molfile
does for specifying individual structures. This much needed datastructure is intended to replace current practices for communicating information about mixtures, which usually relies on human-readable text descriptions, drawing several species within a single molecular diagram, or mutually incompatible ad hoc solutions. We describe an open source software application for editing mixture files, which can also be used as web-ready tools for manipulating the file format. We also present a corpus of mixture examples, which we have extracted from collections of text-based descriptions. Furthermore, we present an early look at the proposed IUPAC Mixtures InChI specification, instances of which can be automatically generated using the
Mixfile
format as a precursor.
Chemical mixtures have recently come to the attention of open standards and data structures for capturing machine-readable descriptions for informatics uses. At the present time, essentially all ...transmission of information about mixtures is done using short text descriptions that are readable only by trained scientists, and there are no accessible repositories of marked-up mixture data. We have designed a machine learning tool that can interpret mixture descriptions and upgrade them to the high-level Mixfile format, which can in turn be used to generate Mixtures InChI notation. The interpretation achieves a high success rate and can be used at scale to markup large catalogs and inventories, with some expert checking to catch edge cases. The training data that was accumulated during the project is made openly available, along with previously released mixture editing tools and utilities.
With the emergence of large collections of protein−ligand complexes complemented by binding data, as found in PDBbind or BindingMOAD, new opportunities for parametrizing and evaluating scoring ...functions have arisen. With huge data collections available, it becomes feasible to fit scoring functions in a QSAR style, i.e., by defining protein−ligand interaction descriptors and analyzing them with modern machine-learning methods. As in each data modeling ansatz, care has to be taken to validate the model carefully. Here, we show that there are large differences measured in R (0.77 vs 0.46) or R 2 (0.59 vs 0.21) for a relatively simple scoring function depending on whether it is validated against the PDBbind core set or validated in a leave-cluster-out cross-validation. If proteins from the same family are present in both the training and validation set, the estimated prediction quality from standard validation techniques looks too optimistic.
The maximum achievable accuracy of in silico models depends on the quality of the experimental data. Consequently, experimental uncertainty defines a natural upper limit to the predictive performance ...possible. Models that yield errors smaller than the experimental uncertainty are necessarily overtrained. A reliable estimate of the experimental uncertainty is therefore of high importance to all originators and users of in silico models. The data deposited in ChEMBL was analyzed for reproducibility, i.e., the experimental uncertainty of independent measurements. Careful filtering of the data was required because ChEMBL contains unit-transcription errors, undifferentiated stereoisomers, and repeated citations of single measurements (90% of all pairs). The experimental uncertainty is estimated to yield a mean error of 0.44 pK i units, a standard deviation of 0.54 pK i units, and a median error of 0.34 pK i units. The maximum possible squared Pearson correlation coefficient (R 2) on large data sets is estimated to be 0.81.