Abstract
Motivation
Predicting new drug–target interactions is an important step in new drug development, understanding of its side effects and drug repositioning. Heterogeneous data sources can ...provide comprehensive information and different perspectives for drug–target interaction prediction. Thus, there have been many calculation methods relying on heterogeneous networks. Most of them use graph-related algorithms to characterize nodes in heterogeneous networks for predicting new drug–target interactions (DTI). However, these methods can only make predictions in known heterogeneous network datasets, and cannot support the prediction of new chemical entities outside the heterogeneous network, which hinder further drug discovery and development.
Results
To solve this problem, we proposed a multi-modal DTI prediction model named ‘MultiDTI’ which uses our proposed joint learning framework based on heterogeneous networks. It combines the interaction or association information of the heterogeneous network and the drug/target sequence information, and maps the drugs, targets, side effects and disease nodes in the heterogeneous network into a common space. In this way, ‘MultiDTI’ can map the new chemical entity to this learned common space based on the chemical structure of the new entity. That is, bridging the gap between new chemical entities and known heterogeneous network. Our model has strong predictive performance, and the area under the receiver operating characteristic curve of the model is 0.961 and the area under the precision recall curve is 0.947 with 10-fold cross validation. In addition, some predicted new DTIs have been confirmed by ChEMBL database. Our results indicate that ‘MultiDTI’ is a powerful and practical tool for predicting new DTI, which can promote the development of drug discovery or drug repositioning.
Availability and implementation
Python codes and dataset are available at https://github.com/Deshan-Zhou/MultiDTI/.
Supplementary information
Supplementary data are available at Bioinformatics online.
This paper investigates the problem of the stock closing price forecasting for the stock market. Based on existing two-stage fusion models in the literature, two new prediction models based on ...clustering have been proposed, where k-means clustering method is adopted to cluster several common technical indicators. In addition, ensemble learning has also been applied to improve the prediction accuracy. Finally, a hybrid prediction model, which combines both the k-means clustering and ensemble learning, has been proposed. The experimental results on a number of Chinese stocks demonstrate that the hybrid prediction model obtains the best predicting accuracy of the stock price. The k-means clustering on the stock technical indicators can further enhance the prediction accuracy of the ensemble learning.
As data can be acquired in an ever-increasing number of ways, multi-view data is becoming more and more available. Considering the high price of labeling data in many machine learning applications, ...we focus on multi-view semi-supervised classification problem. To address this problem, in this paper, we propose a method called joint consensus and diversity for multi-view semi-supervised classification, which learns a common label matrix for all training samples and view-specific classifiers simultaneously. A novel classification loss named probabilistic square hinge loss is proposed, which avoids the incorrect penalization problem and characterizes the contribution of training samples according to its uncertainty. Power mean is introduced to incorporate the losses of different views, which contains the auto-weighted strategy as a special case and distinguishes the importance of various views. To solve the non-convex minimization problem, we prove that its solution can be obtained from another problem with introduced variables. And an efficient algorithm with proved convergence is developed for optimization. Extensive experimental results on nine datasets demonstrate the effectiveness of the proposed algorithm.
Medical data contains multiple records of patient data that are important for subsequent treatment and future research. However, it needs to be stored and shared securely to protect the privacy of ...the data. Blockchain is widely used in the management of healthcare data because of its decentralized and tamper-proof features. In order to study the development of blockchain in healthcare, this paper evaluates it from various perspectives. We analyze blockchain-based approaches from different application scenarios. These are blockchain-based electronic medical record sharing, blockchain and the Internet of Medical Things and blockchain-based federal learning. The results show that blockchain and smart contracts have a natural advantage in the field of medical data since they are tamper-proof and traceable. Finally, the challenges and future directions of blockchain in healthcare are discussed, which can help drive the field forward.
Background Considering the heterogeneity of tumors, it is a key issue in precision medicine to predict the drug response of each individual. The accumulation of various types of drug informatics and ...multi-omics data facilitates the development of efficient models for drug response prediction. However, the selection of high-quality data sources and the design of suitable methods remain a challenge. Methods In this paper, we design NeRD, a multidimensional data integration model based on the PRISM drug response database, to predict the cellular response of drugs. Four feature extractors, including drug structure extractor (DSE), molecular fingerprint extractor (MFE), miRNA expression extractor (mEE), and copy number extractor (CNE), are designed for different types and dimensions of data. A fully connected network is used to fuse all features and make predictions. Results Experimental results demonstrate the effective integration of the global and local structural features of drugs, as well as the features of cell lines from different omics data. For all metrics tested on the PRISM database, NeRD surpassed previous approaches. We also verified that NeRD has strong reliability in the prediction results of new samples. Moreover, unlike other algorithms, when the amount of training data was reduced, NeRD maintained stable performance. Conclusions NeRD's feature fusion provides a new idea for drug response prediction, which is of great significance for precise cancer treatment. Keywords: Precision medicine, Drug response, Data integration, Deep learning
The coronavirus disease (COVID-19) has led to an rush to repurpose existing drugs, although the underlying evidence base is of variable quality. Drug repurposing is a technique by taking advantage of ...existing known drugs or drug combinations to be explored in an unexpected medical scenario. Drug repurposing, hence, plays a vital role in accelerating the pre-clinical process of designing novel drugs by saving time and cost compared to the traditional de novo drug discovery processes. Since drug repurposing depends on massive observed data from existing drugs and diseases, the tremendous growth of publicly available large-scale machine learning methods supplies the state-of-the-art application of data science to signaling disease, medicine, therapeutics, and identifying targets with the least error. In this article, we introduce guidelines on strategies and options of utilizing machine learning approaches for accelerating drug repurposing. We discuss how to employ machine learning methods in studying precision medicine, and as an instance, how machine learning approaches can accelerate COVID-19 drug repurposing by developing Chinese traditional medicine therapy. This article provides a strong reasonableness for employing machine learning methods for drug repurposing, including during fighting for COVID-19 pandemic.
There is a rapidly increasing amount of de novo genome assembly using next-generation sequencing (NGS) short reads; however, several big challenges remain to be overcome in order for this to be ...efficient and accurate. SOAPdenovo has been successfully applied to assemble many published genomes, but it still needs improvement in continuity, accuracy and coverage, especially in repeat regions.
To overcome these challenges, we have developed its successor, SOAPdenovo2, which has the advantage of a new algorithm design that reduces memory consumption in graph construction, resolves more repeat regions in contig assembly, increases coverage and length in scaffold construction, improves gap closing, and optimizes for large genome.
Benchmark using the Assemblathon1 and GAGE datasets showed that SOAPdenovo2 greatly surpasses its predecessor SOAPdenovo and is competitive to other assemblers on both assembly length and accuracy. We also provide an updated assembly version of the 2008 Asian (YH) genome using SOAPdenovo2. Here, the contig and scaffold N50 of the YH genome were ~20.9 kbp and ~22 Mbp, respectively, which is 3-fold and 50-fold longer than the first published version. The genome coverage increased from 81.16% to 93.91%, and memory consumption was ~2/3 lower during the point of largest memory consumption.
The Type II clustered regularly interspaced short palindromic repeats (CRISPR) and CRISPR-associated proteins (Cas) is a powerful genome editing technology, which is more and more popular in gene ...function analysis. In CRISPR/Cas, RNA guides Cas nuclease to the target site to perform DNA modification.
The performance of CRISPR/Cas depends on well-designed single guide RNA (sgRNA). However, the off-target effect of sgRNA leads to undesired mutations in genome and limits the use of CRISPR/Cas. Here, we present OffScan, a universal and fast CRISPR off-target detection tool.
OffScan is not limited by the number of mismatches and allows custom protospacer-adjacent motif (PAM), which is the target site by Cas protein. Besides, OffScan adopts the FM-index, which efficiently improves query speed and reduce memory consumption.
Many of genome features which could help unravel the often complex post-speciation evolution of closely related species are obscured because of their location in chromosomal regions difficult to ...accurately characterize using standard genome analysis methods, including centromeres and repeat regions.
Here, we analyze the genome evolution and diversification of two recently diverged sister cotton species based on nanopore long-read sequence assemblies and Hi-C 3D genome data. Although D genomes are conserved in gene content, they have diversified in gene order, gene structure, gene family diversification, 3D chromatin structure, long-range regulation, and stress-related traits. Inversions predominate among D genome rearrangements. Our results support roles for 5mC and 6mA in gene activation, and 3D chromatin analysis showed that diversification in proximal-vs-distal regulatory-region interactions shape the regulation of defense-related-gene expression. Using a newly developed method, we accurately positioned cotton centromeres and found that these regions have undergone obviously more rapid evolution relative to chromosome arms. We also discovered a cotton-specific LTR class that clarifies evolutionary trajectories among diverse cotton species and identified genetic networks underlying the Verticillium tolerance of Gossypium thurberi (e.g., SA signaling) and salt-stress tolerance of Gossypium davidsonii (e.g., ethylene biosynthesis). Finally, overexpression of G. thurberi genes in upland cotton demonstrated how wild cottons can be exploited for crop improvement.
Our study substantially deepens understanding about how centromeres have developed and evolutionarily impacted the divergence among closely related cotton species and reveals genes and 3D genome structures which can guide basic investigations and applied efforts to improve crops.
With advances in next-generation sequencing(NGS) technologies, a large number of multiple types of high-throughput genomics data are available. A great challenge in exploring cancer progression is to ...identify the driver genes from the variant genes by analyzing and integrating multi-types genomics data. Breast cancer is known as a heterogeneous disease. The identification of subtype-specific driver genes is critical to guide the diagnosis, assessment of prognosis and treatment of breast cancer. We developed an integrated frame based on gene expression profiles and copy number variation (CNV) data to identify breast cancer subtype-specific driver genes. In this frame, we employed statistical machine-learning method to select gene subsets and utilized an module-network analysis method to identify potential candidate driver genes. The final subtype-specific driver genes were acquired by paired-wise comparison in subtypes. To validate specificity of the driver genes, the gene expression data of these genes were applied to classify the patient samples with 10-fold cross validation and the enrichment analysis were also conducted on the identified driver genes. The experimental results show that the proposed integrative method can identify the potential driver genes and the classifier with these genes acquired better performance than with genes identified by other methods.