Abstract
ChEMBL is a large, open-access bioactivity database (https://www.ebi.ac.uk/chembl), previously described in the 2012, 2014 and 2017 Nucleic Acids Research Database Issues. In the last two ...years, several important improvements have been made to the database and are described here. These include more robust capture and representation of assay details; a new data deposition system, allowing updating of data sets and deposition of supplementary data; and a completely redesigned web interface, with enhanced search and filtering capabilities.
The pharmaceutical industry remains under huge pressure to address the high attrition rates in drug development. Attempts to reduce the number of efficacy- and safety-related failures by analysing ...possible links to the physicochemical properties of small-molecule drug candidates have been inconclusive because of the limited size of data sets from individual companies. Here, we describe the compilation and analysis of combined data on the attrition of drug candidates from AstraZeneca, Eli Lilly and Company, GlaxoSmithKline and Pfizer. The analysis reaffirms that control of physicochemical properties during compound optimization is beneficial in identifying compounds of candidate drug quality and indicates for the first time a link between the physicochemical properties of compounds and clinical failure due to safety issues. The results also suggest that further control of physicochemical properties is unlikely to have a significant effect on attrition rates and that additional work is required to address safety-related failures. Further cross-company collaborations will be crucial to future progress in this area.
Background
The ChEMBL database is one of a number of public databases that contain bioactivity data on small molecule compounds curated from diverse sources. Incoming compounds are typically not ...standardised according to consistent rules. In order to maintain the quality of the final database and to easily compare and integrate data on the same compound from different sources it is necessary for the chemical structures in the database to be appropriately standardised.
Results
A chemical curation pipeline has been developed using the open source toolkit RDKit. It comprises three components: a
Checker
to test the validity of chemical structures and flag any serious errors; a
Standardizer
which formats compounds according to defined rules and conventions and a
GetParent
component that removes any salts and solvents from the compound to create its parent. This pipeline has been applied to the latest version of the ChEMBL database as well as uncurated datasets from other sources to test the robustness of the process and to identify common issues in database molecular structures.
Conclusion
All the components of the structure pipeline have been made freely available for other researchers to use and adapt for their own use. The code is available in a GitHub repository and it can also be accessed via the ChEMBL Beaker webservices. It has been used successfully to standardise the nearly 2 million compounds in the ChEMBL database and the compound validity checker has been used to identify compounds with the most serious issues so that they can be prioritised for manual curation.
Abstract
ChEMBL (https://www.ebi.ac.uk/chembl/) is a manually curated, high-quality, large-scale, open, FAIR and Global Core Biodata Resource of bioactive molecules with drug-like properties, ...previously described in the 2012, 2014, 2017 and 2019 Nucleic Acids Research Database Issues. Since its introduction in 2009, ChEMBL’s content has changed dramatically in size and diversity of data types. Through incorporation of multiple new datasets from depositors since the 2019 update, ChEMBL now contains slightly more bioactivity data from deposited data vs data extracted from literature. In collaboration with the EUbOPEN consortium, chemical probe data is now regularly deposited into ChEMBL. Release 27 made curated data available for compounds screened for potential anti-SARS-CoV-2 activity from several large-scale drug repurposing screens. In addition, new patent bioactivity data have been added to the latest ChEMBL releases, and various new features have been incorporated, including a Natural Product likeness score, updated flags for Natural Products, a new flag for Chemical Probes, and the initial annotation of the action type for ∼270 000 bioactivity measurements.
Graphical Abstract
Graphical Abstract
Structure–activity relationship modelling is frequently used in the early stage of drug discovery to assess the activity of a compound on one or several targets, and can also be used to assess the ...interaction of compounds with liability targets. QSAR models have been used for these and related applications over many years, with good success. Conformal prediction is a relatively new QSAR approach that provides information on the certainty of a prediction, and so helps in decision-making. However, it is not always clear how best to make use of this additional information. In this article, we describe a case study that directly compares conformal prediction with traditional QSAR methods for large-scale predictions of target-ligand binding. The ChEMBL database was used to extract a data set comprising data from 550 human protein targets with different bioactivity profiles. For each target, a QSAR model and a conformal predictor were trained and their results compared. The models were then evaluated on new data published since the original models were built to simulate a “real world” application. The comparative study highlights the similarities between the two techniques but also some differences that it is important to bear in mind when the methods are used in practical drug discovery applications.
Low success rates during drug development are due, in part, to the difficulty of defining drug mechanism‐of‐action and molecular markers of therapeutic activity. Here, we integrated 199,219 drug ...sensitivity measurements for 397 unique anti‐cancer drugs with genome‐wide CRISPR loss‐of‐function screens in 484 cell lines to systematically investigate cellular drug mechanism‐of‐action. We observed an enrichment for positive associations between the profile of drug sensitivity and knockout of a drug's nominal target, and by leveraging protein–protein networks, we identified pathways underpinning drug sensitivity. This revealed an unappreciated positive association between mitochondrial E3 ubiquitin–protein ligase MARCH5 dependency and sensitivity to MCL1 inhibitors in breast cancer cell lines. We also estimated drug on‐target and off‐target activity, informing on specificity, potency and toxicity. Linking drug and gene dependency together with genomic data sets uncovered contexts in which molecular networks when perturbed mediate cancer cell loss‐of‐fitness and thereby provide independent and orthogonal evidence of biomarkers for drug development. This study illustrates how integrating cell line drug sensitivity with CRISPR loss‐of‐function screens can elucidate mechanism‐of‐action to advance drug development.
Synopsis
This study integrates pharmacological and CRISPR screens in 484 cancer cell lines to systematically investigate anticancer drug mechanism of action, yielding insights into the genetic contexts and cellular networks underpinning drug response.
CRISPR screens reveal important aspects of drug mechanism‐of‐action, specifically in the context of cellular activity, isoform specificity, off‐target and polypharmacological effects.
By leveraging protein interaction networks that underlie drug‐responses, novel drug‐target interactions involving anti‐apoptotic MCL1 inhibitors are identified.
Improved pharmacogenomic biomarker discovery using two independent and orthogonal cell viability screens.
This study integrates pharmacological and CRISPR screens in 484 cancer cell lines to systematically investigate anticancer drug mechanism of action, yielding insights into the genetic contexts and cellular networks underpinning drug response.
Development of adaptive immunity after COVID-19 and after vaccination against SARS-CoV-2 is predicated on recognition of viral peptides, presented on HLA class II molecules, by CD4+ T-cells. We ...capitalised on extensive high-resolution HLA data on twenty five human race/ethnic populations to investigate the role of HLA polymorphism on SARS-CoV-2 immunogenicity at the population and individual level. Within populations, we identify wide inter-individual variability in predicted peptide presentation from structural, non-structural and accessory SARS-CoV-2 proteins, according to individual HLA genotype. However, we find similar potential for anti-SARS-CoV-2 cellular immunity at the population level suggesting that HLA polymorphism is unlikely to account for observed disparities in clinical outcomes after COVID-19 among different race/ethnic groups. Our findings provide important insight on the potential role of HLA polymorphism on development of protective immunity after SARS-CoV-2 infection and after vaccination and a firm basis for further experimental studies in this field.
The safety of marketed drugs is an ongoing concern, with some of the more frequently prescribed medicines resulting in serious or life-threatening adverse effects in some patients. Safety-related ...information for approved drugs has been curated to include the assignment of toxicity class(es) based on their withdrawn status and/or black box warning information described on medicinal product labels. The ChEMBL resource contains a wide range of bioactivity data types, from early “Discovery” stage preclinical data for individual compounds through to postclinical data on marketed drugs; the inclusion of the curated drug safety data set within this framework can support a wide range of safety-related drug discovery questions. The curated drug safety data set will be made freely available through ChEMBL and updated in future database releases.