Identification of drug-target interactions (DTIs) plays a key role in drug discovery. The high cost and labor-intensive nature of in vitro and in vivo experiments have highlighted the importance of ...in silico-based DTI prediction approaches. In several computational models, conventional protein descriptors have been shown to not be sufficiently informative to predict accurate DTIs. Thus, in this study, we propose a deep learning based DTI prediction model capturing local residue patterns of proteins participating in DTIs. When we employ a convolutional neural network (CNN) on raw protein sequences, we perform convolution on various lengths of amino acids subsequences to capture local residue patterns of generalized protein classes. We train our model with large-scale DTI information and demonstrate the performance of the proposed model using an independent dataset that is not seen during the training phase. As a result, our model performs better than previous protein descriptor-based models. Also, our model performs better than the recently developed deep learning models for massive prediction of DTIs. By examining pooled convolution results, we confirmed that our model can detect binding sites of proteins for DTIs. In conclusion, our prediction model for detecting local residue patterns of target proteins successfully enriches the protein features of a raw protein sequence, yielding better prediction results than previous approaches. Our code is available at https://github.com/GIST-CSBL/DeepConv-DTI.
Identifying drug–target interactions (DTIs) is important for drug discovery. However, searching all drug–target spaces poses a major bottleneck. Therefore, recently many deep learning models have ...been proposed to address this problem. However, the developers of these deep learning models have neglected interpretability in model construction, which is closely related to a model’s performance. We hypothesized that training a model to predict important regions on a protein sequence would increase DTI prediction performance and provide a more interpretable model. Consequently, we constructed a deep learning model, named Highlights on Target Sequences (HoTS), which predicts binding regions (BRs) between a protein sequence and a drug ligand, as well as DTIs between them. To train the model, we collected complexes of protein–ligand interactions and protein sequences of binding sites and pretrained the model to predict BRs for a given protein sequence–ligand pair via object detection employing transformers. After pretraining the BR prediction, we trained the model to predict DTIs from a compound token designed to assign attention to BRs. We confirmed that training the BRs prediction model indeed improved the DTI prediction performance. The proposed HoTS model showed good performance in BR prediction on independent test datasets even though it does not use 3D structure information in its prediction. Furthermore, the HoTS model achieved the best performance in DTI prediction on test datasets. Additional analysis confirmed the appropriate attention for BRs and the importance of transformers in BR and DTI prediction. The source code is available on GitHub (
https://github.com/GIST-CSBL/HoTS
).
Identification of drug-target interactions acts as a key role in drug discovery. However, identifying drug-target interactions via in-vitro, in-vivo experiments are very laborious, time-consuming. ...Thus, predicting drug-target interactions by using computational approaches is a good alternative. In recent studies, many feature-based and similarity-based machine learning approaches have shown promising results in drug-target interaction predictions. A previous study showed that accounting connectivity information of drug-drug and protein-protein interactions increase performances of prediction by the concept of 'guilt-by-association'. However, the approach that only considers directly connected nodes often misses the information that could be derived from distance nodes. Therefore, in this study, we yield global network topology information by using a random walk with restart algorithm and apply the global topology information to the prediction model.
As a result, our prediction model demonstrates increased prediction performance compare to the 'guilt-by-association' approach (AUC 0.89 and 0.67 in the training and independent test, respectively). In addition, we show how weighted features by a random walk with restart yields better performances than original features. Also, we confirmed that drugs and proteins that have high-degree of connectivity on the interactome network yield better performance in our model.
The prediction models with weighted features by considering global network topology increased the prediction performances both in the training and testing compared to non-weighted models and previous a 'guilt-by-association method'. In conclusion, global network topology information on protein-protein interaction and drug-drug interaction effects to the prediction performance of drug-target interactions.
Predicting drug-target interactions is important for the development of novel drugs and the repositioning of drugs. To predict such interactions, there are a number of methods based on drug and ...target protein similarity. Although these methods, such as the bipartite local model (BLM), show promise, they often categorize unknown interactions as negative interaction. Therefore, these methods are not ideal for finding potential drug-target interactions that have not yet been validated as positive interactions. Thus, here we propose a method that integrates machine learning techniques, such as self-training support vector machine (SVM) and BLM, to develop a self-training bipartite local model (SELF-BLM) that facilitates the identification of potential interactions. The method first categorizes unlabeled interactions and negative interactions among unknown interactions using a clustering method. Then, using the BLM method and self-training SVM, the unlabeled interactions are self-trained and final local classification models are constructed. When applied to four classes of proteins that include enzymes, G-protein coupled receptors (GPCRs), ion channels, and nuclear receptors, SELF-BLM showed the best performance for predicting not only known interactions but also potential interactions in three protein classes compare to other related studies. The implemented software and supporting data are available at https://github.com/GIST-CSBL/SELF-BLM.
Drug-induced liver injury (DILI) is a critical issue in drug development because DILI causes failures in clinical trials and the withdrawal of approved drugs from the market. There have been many ...attempts to predict the risk of DILI based on in vivo and in silico identification of hepatotoxic compounds. In the current study, we propose the in silico prediction model predicting DILI using weighted molecular fingerprints.
In this study, we used 881 bits of molecular fingerprint and used as features describing presence or absence of each substructure of compounds. Then, the Bayesian probability of each substructure was calculated and labeled (positive or negative for DILI), and a weighted fingerprint was determined from the ratio of DILI-positive to DILI-negative probability values. Using weighted fingerprint features, the prediction models were trained and evaluated with the Random Forest (RF) and Support Vector Machine (SVM) algorithms. The constructed models yielded accuracies of 73.8% and 72.6%, AUCs of 0.791 and 0.768 in cross-validation. In independent tests, models achieved accuracies of 60.1% and 61.1% for RF and SVM, respectively. The results validated that weighted features helped increase overall performance of prediction models. The constructed models were further applied to the prediction of natural compounds in herbs to identify DILI potential, and 13,996 unique herbal compounds were predicted as DILI-positive with the SVM model.
The prediction models with weighted features increased the performance compared to non-weighted models. Moreover, we predicted the DILI potential of herbs with the best performed model, and the prediction results suggest that many herbal compounds could have potential to be DILI. We can thus infer that taking natural products without detailed references about the relevant pathways may be dangerous. Considering the frequency of use of compounds in natural herbs and their increased application in drug development, DILI labeling would be very important.
Adverse drug-drug interaction (DDI) is a major concern to polypharmacy due to its unexpected adverse side effects and must be identified at an early stage of drug discovery and development. Many ...computational methods have been proposed for this purpose, but most require specific types of information, or they have less concern in interpretation on underlying genes. We propose a deep learning-based framework for DDI prediction with drug-induced gene expression signatures so that the model can provide the expression level of interpretability for DDIs. The model engineers dynamic drug features using a gating mechanism that mimics the co-administration effects by imposing attention to genes. Also, each side-effect is projected into a latent space through translating embedding. As a result, the model achieved an AUC of 0.889 and an AUPR of 0.915 in unseen interaction prediction, which is competitively very accurate and outperforms other state-of-the-art methods. Furthermore, it can predict potential DDIs with new compounds not used in training. In conclusion, using drug-induced gene expression signatures followed by gating and translating embedding can increase DDI prediction accuracy while providing model interpretability. The source code is available on GitHub (
https://github.com/GIST-CSBL/DeSIDE-DDI
).
Drug repositioning, also known as drug repurposing, defines new indications for existing drugs and can be used as an alternative to drug development. In recent years, the accumulation of large ...volumes of information related to drugs and diseases has led to the development of various computational approaches for drug repositioning. Although herbal medicines have had a great impact on current drug discovery, there are still a large number of herbal compounds that have no definite indications.
In the present study, we constructed a computational model to predict the unknown pharmacological effects of herbal compounds using machine learning techniques. Based on the assumption that similar diseases can be treated with similar drugs, we used four categories of drug-drug similarity (e.g., chemical structure, side-effects, gene ontology, and targets) and three categories of disease-disease similarity (e.g., phenotypes, human phenotype ontology, and gene ontology). Then, associations between drug and disease were predicted using the employed similarity features. The prediction models were constructed using classification algorithms, including logistic regression, random forest and support vector machine algorithms. Upon cross-validation, the random forest approach showed the best performance (AUC = 0.948) and also performed well in an external validation assessment using an unseen independent dataset (AUC = 0.828). Finally, the constructed model was applied to predict potential indications for existing drugs and herbal compounds. As a result, new indications for 20 existing drugs and 31 herbal compounds were predicted and validated using clinical trial data.
The predicted results were validated manually confirming the performance and underlying mechanisms - for example, irinotecan as a treatment for neuroblastoma. From the prediction, herbal compounds were considered to be drug candidates for related diseases which is important to be further developed. The proposed prediction model can contribute to drug discovery by suggesting drug candidates from herbal compounds which have potentials but few were studied.
Genome-wide studies of DNA methylation across the epigenetic landscape provide insights into the heterogeneity of pluripotent embryonic stem cells (ESCs). Differentiating into embryonic somatic and ...germ cells, ESCs exhibit varying degrees of pluripotency, and epigenetic changes occurring in this process have emerged as important factors explaining stem cell pluripotency.
Here, using paired scBS-seq and scRNA-seq data of mice, we constructed a machine learning model that predicts degrees of pluripotency for mouse ESCs. Since the biological activities of non-CpG markers have yet to be clarified, we tested the predictive power of CpG and non-CpG markers, as well as a combination thereof, in the model. Through rigorous performance evaluation with both internal and external validation, we discovered that a model using both CpG and non-CpG markers predicted the pluripotency of ESCs with the highest prediction performance (0.956 AUC, external test). The prediction model consisted of 16 CpG and 33 non-CpG markers. The CpG and most of the non-CpG markers targeted depletions of methylation and were indicative of cell pluripotency, whereas only a few non-CpG markers reflected accumulations of methylation. Additionally, we confirmed that there exists the differing pluripotency between individual developmental stages, such as E3.5 and E6.5, as well as between induced mouse pluripotent stem cell (iPSC) and somatic cell.
In this study, we investigated CpG and non-CpG methylation in relation to mouse stem cell pluripotency and developed a model thereon that successfully predicts the pluripotency of mouse ESCs.
Genomic and precision medicine research has afforded notable advances in human cancer treatment, yet applicability to other species remains uncertain. Through whole-exome and transcriptome analyses ...of 191 spontaneous canine mammary tumors (CMTs) that exhibit the archetypal features of human breast cancers, we found a striking resemblance of genomic characteristics including frequent PIK3CA mutations (43.1%), aberrations of the PI3K-Akt pathway (61.7%), and key genes involved in cancer initiation and progression. We also identified three gene expression-based CMT subtypes, one of which segregated with basal-like human breast cancer subtypes with activated epithelial-to-mesenchymal transition, low claudin expression, and unfavorable disease prognosis. A relative lack of ERBB2 amplification and Her2-enrichment subtype in CMT denoted species-specific molecular mechanisms. Taken together, our results elucidate cross-species oncogenic signatures for a better understanding of universal and context-dependent mechanisms in breast cancer development and provide a basis for precision diagnostics and therapeutics for domestic dogs.
Accurate genome-wide detection of somatic mutations with low variant allele frequency (VAF, <1%) has proven difficult, for which generalized, scalable methods are lacking. Herein, we describe a new ...computational method, called RePlow, that we developed to detect low-VAF somatic mutations based on simple, library-level replicates for next-generation sequencing on any platform. Through joint analysis of replicates, RePlow is able to remove prevailing background errors in next-generation sequencing analysis, facilitating remarkable improvement in the detection accuracy for low-VAF somatic mutations (up to ~99% reduction in false positives). The method is validated in independent cancer panel and brain tissue sequencing data. Our study suggests a new paradigm with which to exploit an overwhelming abundance of sequencing data for accurate variant detection.