Drug-induced liver injury (DILI) presents a significant challenge to drug development and regulatory science. The FDA's Liver Toxicity Knowledge Base (LTKB) evaluated >1000 drugs for their likelihood ...of causing DILI in humans, of which >700 drugs were classified into three categories (most-DILI, less-DILI, and no-DILI). Based on this dataset, we developed and compared 2-class and 3-class DILI prediction models using the machine learning algorithm of Decision Forest (DF) with Mold2 structural descriptors. The models were evaluated through 1000 iterations of 5-fold cross-validations, 1000 bootstrapping validations and 1000 permutation tests (that assessed the chance correlation). Furthermore, prediction confidence analysis was conducted, which provides an additional parameter for proper interpretation of prediction results. We revealed that the 3-class model not only had a higher resolution to estimate DILI risk but also showed an improved capability to differentiate most-DILI drugs from no-DILI drugs in comparison with the 2-class DILI model. We demonstrated the utility of the models for drug ingredients with warnings very recently issued by the FDA. Moreover, we identified informative molecular features important for assessing DILI risk. Our results suggested that the 3-class model presents a better option than the binary model (which most publications are focused on) for drug safety evaluation.
Persistent organic pollutants (POPs) present in foods have been a major concern for food safety due to their persistence and toxic effects. To ensure food safety and protect human health from POPs, ...it is critical to achieve a better understanding of POP pathways into food and develop strategies to reduce human exposure. POPs could present in food in the raw stages, transferred from the environment or artificially introduced during food preparation steps. Exposure to these pollutants may cause various health problems such as endocrine disruption, cardiovascular diseases, cancers, diabetes, birth defects, and dysfunctional immune and reproductive systems. This review describes potential sources of POP food contamination, analytical approaches to measure POP levels in food and efforts to control food contamination with POPs.
Humans and animals may be exposed to tens of thousands of natural and synthetic chemicals during their lifespan. It is difficult to assess risk for all the chemicals with experimental toxicity tests. ...An alternative approach is to use computational toxicology methods such as quantitative structure–activity relationship (QSAR) modeling. Mitochondrial toxicity is involved in many diseases such as cancer, neurodegeneration, type 2 diabetes, cardiovascular diseases and autoimmune diseases. Thus, it is important to rapidly and efficiently identify chemicals with mitochondrial toxicity. In this study, five machine learning algorithms and twelve types of molecular fingerprints were employed to generate QSAR discriminant models for mitochondrial toxicity. A threshold moving method was adopted to resolve the imbalance issue in the training data. Consensus of the models by an averaging probability strategy improved prediction performance. The best model has correct classification rates of 81.8% and 88.3% in ten-fold cross validation and external validation, respectively. Substructures such as phenol, carboxylic acid, nitro and arylchloride were found informative through analysis of information gain and frequency of substructures. The results demonstrate that resolving imbalance in training and building consensus models can improve classification rates for mitochondrial toxicity prediction.
Display omitted
•Discriminant models of mitochondrial toxicity were developed with machine learning algorithms.•Performance of consensus models for predicting mitochondrial toxicity is better than that of individual models.•Resolving imbalance in training can improve models for mitochondrial toxicity prediction.
Multi-label classification of data remains to be a challenging problem. Because of the complexity of the data, it is sometimes difficult to infer information about classes that are not mutually ...exclusive. For medical data, patients could have symptoms of multiple different diseases at the same time and it is important to develop tools that help to identify problems early. Intelligent health risk prediction models built with deep learning architectures offer a powerful tool for physicians to identify patterns in patient data that indicate risks associated with certain types of chronic diseases.
Physical examination records of 110,300 anonymous patients were used to predict diabetes, hypertension, fatty liver, a combination of these three chronic diseases, and the absence of disease (8 classes in total). The dataset was split into training (90%) and testing (10%) sub-datasets. Ten-fold cross validation was used to evaluate prediction accuracy with metrics such as precision, recall, and F-score. Deep Learning (DL) architectures were compared with standard and state-of-the-art multi-label classification methods. Preliminary results suggest that Deep Neural Networks (DNN), a DL architecture, when applied to multi-label classification of chronic diseases, produced accuracy that was comparable to that of common methods such as Support Vector Machines. We have implemented DNNs to handle both problem transformation and algorithm adaption type multi-label methods and compare both to see which is preferable.
Deep Learning architectures have the potential of inferring more information about the patterns of physical examination data than common classification methods. The advanced techniques of Deep Learning can be used to identify the significance of different features from physical examination data as well as to learn the contributions of each feature that impact a patient's risk for chronic diseases. However, accurate prediction of chronic disease risks remains a challenging problem that warrants further studies.
Since November 2021, Omicron has been the dominant severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variant that causes the coronavirus disease 2019 (COVID-19) and has continuously ...impacted human health. Omicron sublineages are still increasing and cause increased transmission and infection rates. The additional 15 mutations on the receptor binding domain (RBD) of Omicron spike proteins change the protein conformation, enabling the Omicron variant to evade neutralizing antibodies. For this reason, many efforts have been made to design new antigenic variants to induce effective antibodies in SARS-CoV-2 vaccine development. However, understanding the different states of Omicron spike proteins with and without external molecules has not yet been addressed. In this review, we analyze the structures of the spike protein in the presence and absence of angiotensin-converting enzyme 2 (ACE2) and antibodies. Compared to previously determined structures for the wildtype spike protein and other variants such as alpha, beta, delta, and gamma, the Omicron spike protein adopts a partially open form. The open-form spike protein with one RBD up is dominant, followed by the open-form spike protein with two RBD up, and the closed-form spike protein with the RBD down. It is suggested that the competition between antibodies and ACE2 induces interactions between adjacent RBDs of the spike protein, which lead to a partially open form of the Omicron spike protein. The comprehensive structural information of Omicron spike proteins could be helpful for the efficient design of vaccines against the Omicron variant.
Low drug productivity has been a significant problem of the pharmaceutical industry for several decades even though numerous novel technologies were introduced during this period. Currently ...pharmacologic dogma, "single drug, single target, single disease", is at the root of the lack of drug productivity. From a systems biology viewpoint, network pharmacology has been proposed to complement the established guiding pharmacologic approaches. The rationale for network pharmacology as a major component of drug discovery and development is that a disease can be caused by perturbation of the disease-causing network and a drug may be designed to interact with multiple targets for modulation of such a network from the disease status toward normal status. Therefore, network pharmacology has been applied to guide and assist in drug repositioning. Drugs exerting their therapeutic effects may directly target disease-associated proteins, but they may also modulate the pathways involved in the pathological process. In this review, we discuss the progresses and prospects in network pharmacology, focusing on drug off-targets discovery, disease-associated protein identification, and pathway analysis for elucidating relationships between drug targets and disease-associated proteins.
The specificity of toxicant-target biomolecule interactions lends to the very imbalanced nature of many toxicity datasets, causing poor performance in Structure–Activity Relationship (SAR)-based ...chemical classification. Undersampling and oversampling are representative techniques for handling such an imbalance challenge. However, removing inactive chemical compound instances from the majority class using an undersampling technique can result in information loss, whereas increasing active toxicant instances in the minority class by interpolation tends to introduce artificial minority instances that often cross into the majority class space, giving rise to class overlapping and a higher false prediction rate. In this study, in order to improve the prediction accuracy of imbalanced learning, we employed SMOTEENN, a combination of Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms, to oversample the minority class by creating synthetic samples, followed by cleaning the mislabeled instances. We chose the highly imbalanced Tox21 dataset, which consisted of 12 in vitro bioassays for > 10,000 chemicals that were distributed unevenly between binary classes. With Random Forest (RF) as the base classifier and bagging as the ensemble strategy, we applied four hybrid learning methods, i.e., RF without imbalance handling (RF), RF with Random Undersampling (RUS), RF with SMOTE (SMO), and RF with SMOTEENN (SMN). The performance of the four learning methods was compared using nine evaluation metrics, among which F
1
score, Matthews correlation coefficient and Brier score provided a more consistent assessment of the overall performance across the 12 datasets. The Friedman’s aligned ranks test and the subsequent Bergmann-Hommel post hoc test showed that SMN significantly outperformed the other three methods. We also found that a strong negative correlation existed between the prediction accuracy and the imbalance ratio (IR), which is defined as the number of inactive compounds divided by the number of active compounds. SMN became less effective when IR exceeded a certain threshold (e.g., > 28). The ability to separate the few active compounds from the vast amounts of inactive ones is of great importance in computational toxicology. This work demonstrates that the performance of SAR-based, imbalanced chemical toxicity classification can be significantly improved through the use of data rebalancing.
The U.S. Tox21 and EPA ToxCast program screen thousands of environmental chemicals for bioactivity using hundreds of high-throughput in vitro assays to build predictive models of toxicity. We ...represented chemicals based on bioactivity and chemical structure descriptors, then used supervised machine learning to predict in vivo hepatotoxic effects. A set of 677 chemicals was represented by 711 in vitro bioactivity descriptors (from ToxCast assays), 4,376 chemical structure descriptors (from QikProp, OpenBabel, PaDEL, and PubChem), and three hepatotoxicity categories (from animal studies). Hepatotoxicants were defined by rat liver histopathology observed after chronic chemical testing and grouped into hypertrophy (161), injury (101) and proliferative lesions (99). Classifiers were built using six machine learning algorithms: linear discriminant analysis (LDA), Naïve Bayes (NB), support vector machines (SVM), classification and regression trees (CART), k-nearest neighbors (KNN), and an ensemble of these classifiers (ENSMB). Classifiers of hepatotoxicity were built using chemical structure descriptors, ToxCast bioactivity descriptors, and hybrid descriptors. Predictive performance was evaluated using 10-fold cross-validation testing and in-loop, filter-based, feature subset selection. Hybrid classifiers had the best balanced accuracy for predicting hypertrophy (0.84 ± 0.08), injury (0.80 ± 0.09), and proliferative lesions (0.80 ± 0.10). Though chemical and bioactivity classifiers had a similar balanced accuracy, the former were more sensitive, and the latter were more specific. CART, ENSMB, and SVM classifiers performed the best, and nuclear receptor activation and mitochondrial functions were frequently found in highly predictive classifiers of hepatotoxicity. ToxCast and ToxRefDB provide the largest and richest publicly available data sets for mining linkages between the in vitro bioactivity of environmental chemicals and their adverse histopathological outcomes. Our findings demonstrate the utility of high-throughput assays for characterizing rodent hepatotoxicants, the benefit of using hybrid representations that integrate bioactivity and chemical structure, and the need for objective evaluation of classification performance.
Drug-induced liver injury (DILI) is one of the leading causes of the termination of drug development programs. Consequently, identifying the risk of DILI in humans for drug candidates during the ...early stages of the development process would greatly reduce the drug attrition rate in the pharmaceutical industry but would require the implementation of new research and development strategies. In this regard, several in silico models have been proposed as alternative means in prioritizing drug candidates. Because the accuracy and utility of a predictive model rests largely on how to annotate the potential of a drug to cause DILI in a reliable and consistent way, the Food and Drug Administration-approved drug labeling was given prominence. Out of 387 drugs annotated, 197 drugs were used to develop a quantitative structure-activity relationship (QSAR) model and the model was subsequently challenged by the left of drugs serving as an external validation set with an overall prediction accuracy of 68.9%. The performance of the model was further assessed by the use of 2 additional independent validation sets, and the 3 validation data sets have a total of 483 unique drugs. We observed that the QSAR model's performance varied for drugs with different therapeutic uses; however, it achieved a better estimated accuracy (73.6%) as well as negative predictive value (77.0%) when focusing only on these therapeutic categories with high prediction confidence. Thus, the model's applicability domain was defined. Taken collectively, the developed QSAR model has the potential utility to prioritize compound's risk for DILI in humans, particularly for the high-confidence therapeutic subgroups like analgesics, antibacterial agents, and antihistamines.
Accurate de novo genome assembly has become reality with the advancements in sequencing technology. With the ever-increasing number of de novo genome assembly tools, assessing the quality of ...assemblies has become of great importance in genome research. Although many quality metrics have been proposed and software tools for calculating those metrics have been developed, the existing tools do not produce a unified measure to reflect the overall quality of an assembly.
To address this issue, we developed the de novo Assembly Quality Evaluation Tool (dnAQET) that generates a unified metric for benchmarking the quality assessment of assemblies. Our framework first calculates individual quality scores for the scaffolds/contigs of an assembly by aligning them to a reference genome. Next, it computes a quality score for the assembly using its overall reference genome coverage, the quality score distribution of its scaffolds and the redundancy identified in it. Using synthetic assemblies randomly generated from the latest human genome build, various builds of the reference genomes for five organisms and six de novo assemblies for sample NA24385, we tested dnAQET to assess its capability for benchmarking quality evaluation of genome assemblies. For synthetic data, our quality score increased with decreasing number of misassemblies and redundancy and increasing average contig length and coverage, as expected. For genome builds, dnAQET quality score calculated for a more recent reference genome was better than the score for an older version. To compare with some of the most frequently used measures, 13 other quality measures were calculated. The quality score from dnAQET was found to be better than all other measures in terms of consistency with the known quality of the reference genomes, indicating that dnAQET is reliable for benchmarking quality assessment of de novo genome assemblies.
The dnAQET is a scalable framework designed to evaluate a de novo genome assembly based on the aggregated quality of its scaffolds (or contigs). Our results demonstrated that dnAQET quality score is reliable for benchmarking quality assessment of genome assemblies. The dnQAET can help researchers to identify the most suitable assembly tools and to select high quality assemblies generated.