In this paper, we propose a novel biclustering approach called BiClusO. Biclustering can be applied to various types of bipartite data such as gene-condition or gene-disease relations. For example, ...we applied BiClusO to bipartite relations between species and volatile organic compounds (VOCs). VOCs, which are emitted by different species, have huge environmental and ecological impacts. The biosynthesis of VOCs depends on different metabolic pathways which can be used to categorize the species. A previous study related to the KNApSAcK VOC database classified microorganisms based on their VOC profiles, which confirmed the consistency between VOC-based and pathogenicity-based classifications. However, due to limited data, classification of all species in terms of VOC profiles was not performed. In this study, we enriched our database with additional data collected from different online sources and journals. Then, by applying BiClusO to species-VOC relational data, we determined that VOC-based classification is consistent with taxonomy-based classification of the species. We also assessed the diversity of VOC pathways across different kingdoms of species.
The plants produce numerous types of secondary metabolites which have pharmacological importance in drug development for different diseases. Computational methods widely use the fingerprints of the ...metabolites to understand different properties and similarities among metabolites and for the prediction of chemical reactions etc. In this work, we developed three different deep neural network models (DNN) to predict the antibacterial property of plant metabolites. We developed the first DNN model using the fingerprint set of metabolites as features. In the second DNN model, we searched the similarities among fingerprints using correlation and used one representative feature from each group of highly correlated fingerprints. In the third model, the fingerprints of metabolites were used to find structurally similar chemical compound clusters. Form each cluster a representative metabolite is selected and made part of the training dataset. The second model reduced the number of features where the third model achieved better classification results for test data. In both cases, we applied the simple graph clustering method to cluster the corresponding network. The correlation‐based DNN model reduced some features while retaining an almost similar performance compared to the first DNN model. The third model improves classification results for test data by capturing wider variance within training data using graph clustering method. This third model is somewhat novel approach and can be applied to build DNN models for other purposes.
Machine learning approaches are widely used to evaluate ligand activities of chemical compounds toward potential target proteins. Especially, exploration of highly selective ligands is important for ...the development of new drugs with higher safety. One difficulty in constructing well‐performing model predicting such a ligand activity is the absence of data on true negative ligand‐protein interactions. In other words, in many cases we can access to plenty of information on ligands that bind to specific protein, but less or almost no information showing that compounds don't bind to proteins of interest. In this paper, we suggested an approach to comprehensively explore candidates for ligands specifically targeting toward proteins without using information on the true negative interaction. The approach consists of 4 steps: 1) constructing a model that distinguishes ligands for the target proteins of interest from those targeting proteins that cause off‐target effects, by using graph convolution neural network (GCNN); 2) extracting feature vectors after convolution/pooling processes and mapping their principal components in two dimensions; 3) specifying regions with higher density for two ligand groups through kernel density estimation; and 4) investigating the distribution of compounds for exploration on the density map using the same classifier and decomposer. If compounds for exploration are located in higher‐density regions of ligand compounds, these compounds can be regarded as having relatively high binding affinity to the major target or off‐target proteins compared with other compounds. We applied the approach to the exploration of ligands for β‐site amyloid precursor protein APP‐cleaving enzyme 1 (BACE1), a major target for Alzheimer Disease (AD), with less off‐target effect toward cathepsin D. We demonstrated that the density region of BACE1 and cathepsin D ligands are well‐divided, and a group of natural compounds as a target for exploration of new drug candidates also has significantly different distribution on the density map.
Deep learning approaches are widely used to search molecular structures for a candidate drug/material. The basic approach in drug/material candidate structure discovery is to embed a relationship ...that holds between a molecular structure and the physical property into a low‐dimensional vector space (chemical space) and search for a candidate molecular structure in that space based on a desired physical property value. Deep learning simplifies the structure search by efficiently modeling the structure of the chemical space with greater detail and lower dimensions than the original input space. In our research, we propose an effective method for molecular embedding learning that combines variational autoencoders (VAEs) and metric learning using any physical property. Our method enables molecular structures and physical properties to be embedded locally and continuously into VAEs’ latent space while maintaining the consistency of the relationship between the structural features and the physical properties of molecules to yield better predictions.
After complete sequencing of a number of genomes the focus has now turned to proteomics. Advanced proteomics technologies such as two-hybrid assay, mass spectrometry etc. are producing huge data sets ...of protein-protein interactions which can be portrayed as networks, and one of the burning issues is to find protein complexes in such networks. The enormous size of protein-protein interaction (PPI) networks warrants development of efficient computational methods for extraction of significant complexes.
This paper presents an algorithm for detection of protein complexes in large interaction networks. In a PPI network, a node represents a protein and an edge represents an interaction. The input to the algorithm is the associated matrix of an interaction network and the outputs are protein complexes. The complexes are determined by way of finding clusters, i. e. the densely connected regions in the network. We also show and analyze some protein complexes generated by the proposed algorithm from typical PPI networks of Escherichia coli and Saccharomyces cerevisiae. A comparison between a PPI and a random network is also performed in the context of the proposed algorithm.
The proposed algorithm makes it possible to detect clusters of proteins in PPI networks which mostly represent molecular biological functional units. Therefore, protein complexes determined solely based on interaction data can help us to predict the functions of proteins, and they are also useful to understand and explain certain biological processes.
In order to obtain a better understanding why some Jamu formulas can be used to treat a specific disease, we performed metabolomic studies of Jamu by taking into consideration the biologically active ...compounds existing in plants used as Jamu ingredients. A thorough integration of information from omics is expected to provide solid evidence‐based scientific rationales for the development of modern phytomedicines. This study focused on prediction of Jamu efficacy based on its component metabolites and also identification of important metabolites related to each efficacy group. Initially, we compared the performance of Support Vector Machines and Random Forest to predict the Jamu efficacy with three different data pre‐processing approaches, such as no filtering, Single Filtering algorithm, and a combination of Single Filtering algorithm and feature selection using Regularized Random Forest. Both classifiers performed very well and according to 5‐fold cross‐validation results, the mean accuracy of Support Vector Machine with linear kernel was slightly better than Random Forest. It can be concluded that machine learning methods can successfully relate Jamu efficacy with metabolites. In addition, we extended our analysis by identifying important metabolites from the Random Forest model. The inTrees framework was used to extract the rules and to select important metabolites for each efficacy group. Overall, we identified 94 significant metabolites associated to 12 efficacy groups and many of them were validated by published literature and KNApSAcK Metabolite Activity database.
Mental disorders (MDs), including schizophrenia (SCZ) and bipolar disorder (BD), have attracted special attention from scientists due to their high prevalence and significantly debilitating clinical ...features. The diagnosis of MDs is still essentially based on clinical interviews, and intensive efforts to introduce biochemical based diagnostic methods have faced several difficulties for implementation in clinics, due to the complexity and still limited knowledge in MDs. In this context, aiming for improving the knowledge in etiology and pathophysiology, many authors have reported several alterations in metabolites in MDs and other brain diseases. After potentially fishing all metabolite biomarkers reported up to now for SCZ and BD, we investigated here the proteins related to these metabolites in order to construct a protein-protein interaction (PPI) network associated with these diseases. We determined the statistically significant clusters in this PPI network and, based on these clusters, we identified 28 significant pathways for SCZ and BDs that essentially compose three groups representing three major systems, namely stress response, energy and neuron systems. By characterizing new pathways with potential to innovate the diagnosis and treatment of psychiatric diseases, the present data may also contribute to the proposal of new intervention for the treatment of still unmet aspects in MDs.
In recent years, competition in organic photovoltaic cells (OPVs) performance improvement and organic semiconductor development has intensified. In response, there has been an upsurge in the ...development of predictive models for OPV performance utilizing machine learning. Until now, chemistry researchers have used various approaches when creating OPV cells as well as developing new materials to improve power conversion efficiency (PCE). However, not many of those original approaches have been used for performance prediction due to the small sample size. In this study, we conducted Data-science approach where we collected information from 115 scientific literatures and constructed a dataset with the addition of some new proposed variables to describe the structure and material composition of the active layer. This allows us to use 25 variables to describe OPVs in which the active layer forms a 1~3-level structure (1-layer, two- tiered and three-tiered). Proposed work also includes post-processing and measurement data that have not been addressed in existing studies. Several regression models were constructed with coefficients of determination exceeding 0.9 by supervised learning methods (random forest (RF), monmlp, etc.) using this data.
A database (DB) describing the relationships between species and their metabolites would be useful for metabolomics research, because it targets systematic analysis of enormous numbers of organic ...compounds with known or unknown structures in metabolomics. We constructed an extensive species-metabolite DB for plants, the KNApSAcK Core DB, which contains 101,500 species-metabolite relationships encompassing 20,741 species and 50,048 metabolites. We also developed a search engine within the KNApSAcK Core DB for use in metabolomics research, making it possible to search for metabolites based on an accurate mass, molecular formula, metabolite name or mass spectra in several ionization modes. We also have developed databases for retrieving metabolites related to plants used for a range of purposes. In our multifaceted plant usage DB, medicinal/edible plants are related to the geographic zones (GZs) where the plants are used, their biological activities, and formulae of Japanese and Indonesian traditional medicines (Kampo and Jamu, respectively). These data are connected to the species-metabolites relationship DB within the KNApSAcK Core DB, keyed via the species names. All databases can be accessed via the website http://kanaya.naist.jp/KNApSAcK_Family/. KNApSAcK WorldMap DB comprises 41,548 GZ-plant pair entries, including 222 GZs and 15,240 medicinal/edible plants. The KAMPO DB consists of 336 formulae encompassing 278 medicinal plants; the JAMU DB consists of 5,310 formulae encompassing 550 medicinal plants. The Biological Activity DB consists of 2,418 biological activities and 33,706 pairwise relationships between medicinal plants and their biological activities. Current statistics of the binary relationships between individual databases were characterized by the degree distribution analysis, leading to a prediction of at least 1,060,000 metabolites within all plants. In the future, the study of metabolomics will need to take this huge number of metabolites into consideration.