Abstract
Motivation
Sequence-order independent structural comparison, also called structural alignment, of small ligand molecules is often needed for computer-aided virtual drug screening. Although ...many ligand structure alignment programs are proposed, most of them build the alignments based on rigid-body shape comparison which cannot provide atom-specific alignment information nor allow structural variation; both abilities are critical to efficient high-throughput virtual screening.
Results
We propose a novel ligand comparison algorithm, LS-align, to generate fast and accurate atom-level structural alignments of ligand molecules, through an iterative heuristic search of the target function that combines inter-atom distance with mass and chemical bond comparisons. LS-align contains two modules of Rigid-LS-align and Flexi-LS-align, designed for rigid-body and flexible alignments, respectively, where a ligand-size independent, statistics-based scoring function is developed to evaluate the similarity of ligand molecules relative to random ligand pairs. Large-scale benchmark tests are performed on prioritizing chemical ligands of 102 protein targets involving 1 415 871 candidate compounds from the DUD-E (Database of Useful Decoys: Enhanced) database, where LS-align achieves an average enrichment factor (EF) of 22.0 at the 1% cutoff and the AUC score of 0.75, which are significantly higher than other state-of-the-art methods. Detailed data analyses show that the advanced performance is mainly attributed to the design of the target function that combines structural and chemical information to enhance the sensitivity of recognizing subtle difference of ligand molecules and the introduces of structural flexibility that help capture the conformational changes induced by the ligand-receptor binding interactions. These data demonstrate a new avenue to improve the virtual screening efficiency through the development of sensitive ligand structural alignments.
Availability and implementation
http://zhanglab.ccmb.med.umich.edu/LS-align/
Supplementary information
Supplementary data are available at Bioinformatics online.
•A novel, hybrid artificial neural network – EMOS postprocessing scheme is proposed.•The hybrid scheme allows for simultaneous postprocessing of precipitation forecasts for multiple seasons and lead ...times.•The scheme outperforms existing EMOS schemes as judged by the skills of postprocessed probabilistic precipitation forecasts.
Many present-day statistical schemes for postprocessing weather forecasts, in particular precipitation forecasts, rely on calibration using prescribed statistical models to relate forecast statistics to distributional parameters. The efficacy of such schemes is often constrained not only by prescribed predictor-predictand relation, but also by arbitrary choices of temporal window and lead time range for training. To address this limitation, we propose an end-to-end, computationally efficient hybrid postprocessing scheme capable of producing full predictive distributions of precipitation accumulation without explicit stratification of forecast-observation pairs by forecast lead time and season. The proposed framework uses the censored, shifted gamma distribution (CSGD) as the predictive distribution but uses an artificial neural network (ANN) to estimate the distributional parameters of CSGD through a unified approach. This approach, referred to as ANN-CSGD, allows for simultaneous estimation of distributional parameters over multiple lead times and seasons in a single model by incorporating the latter variables as predictors to the ANN. We test our proposed ANN-CSGD model for postprocessing of ensemble mean forecasts of 24-h precipitation totals over selected river basins in California, at one- to seven-day lead times, from the Global Ensemble Forecast System (GEFS). The probabilistic quantitative precipitation forecasts (PQPFs) from the ANN-CSGD, are more skillful overall than those from the benchmark CSGD and the Mixed-type meta-Gaussian distribution (MMGD) models. The ANN-CSGD PQPFs highly improve the performance of those from CSGD in predicting the probability of precipitation (PoP) and are also much sharper and reliable at higher precipitation thresholds. We demonstrate how the hybrid approach, by using the entire available training data and its modified formulation, efficiently represents interactions between GEFS forecasts and season/lead times, thus leading to enhanced predictive performance.
Cell-penetrating peptides (CPPs) are short length permeable proteins have emerged as drugs delivery tool of therapeutic agents including genetic materials and macromolecules into cells. Recently, CPP ...has become a hotspot avenue for life science research and paved a new way of disease treatment without harmful impact on cell viability due to nontoxic characteristic. Therefore, the correct identification of CPPs will provide hints for medical applications. Considering the shortcomings of traditional experimental CPPs identification, it is urgently needed to design intelligent predictor for accurate identification of CPPs for the large scale uncharacterized sequences. We develop a novel computational method, called TargetCPP, to discriminate CPPs from Non-CPPs with improved accuracy. In TargetCPP, first the peptide sequences are formulated with four distinct encoding methods i.e., composite protein sequence representation, composition transition and distribution, split amino acid composition, and information theory features. These dominant feature vectors were fused and applied intelligent minimum redundancy and maximum relevancy feature selection method to choose an optimal subset of features. Finally, the predictive model is learned through different classification algorithms on the optimized features. Among these classifiers, gradient boost decision tree algorithm achieved excellent performance throughout the experiments. Notably, the TargetCPP tool attained high prediction Accuracy of 93.54% and 88.28% using jackknife and independent test, respectively. Empirical outcomes prove the superiority and potency of proposed bioinformatics method over state-of-the-art methods. It is highly anticipated that the outcomes of this study will provide a strong background for large scale prediction of CPPs and instructive guidance in clinical therapy and medical applications.
Protein–protein interactions (PPIs) are fundamental to many biological processes. The coevolution-based prediction of interacting residues has made great strides in protein complexes that are known ...to interact. A multiple sequence alignment (MSA) is the basis of coevolution analysis. MSAs have recently made significant progress in the protein monomer sequence analysis. However, no standard or efficient pipelines are available for the sensitive protein complex MSA (cpxMSA) collection. How to generate cpxMSA is one of the most challenging problems of sequence coevolution analysis. Although several methods have been developed to address this problem, no standalone program exists. Furthermore, the number of built-in properties is limited; hence, it is often difficult for users to analyze sequence coevolution according to their desired cpxMSA. In this article, we developed a novel cpxMSA approach (cpxDeepMSA. We used different protein monomer databases and incorporated the three strategies (genomic distance, phylogeny information, and STRING interaction network) used to join the monomer MSA results of protein complexes, which can prevent using a single method fail to the joint two-monomer MSA causing the cpxMSA construction failure. We anticipate that the cpxDeepMSA algorithm will become a useful high-throughput tool in protein complex structure predictions, inter-protein residue-residue contacts, and the biological sequence coevolution analysis.
Just like PTM or PTLM (post-translational modification) in proteins, PTCM (post-transcriptional modification) in RNA plays very important roles in biological processes. Occurring at adenine (A) with ...the genetic code motif (GAC), N(6)-methyldenosine (m(6)A) is one of the most common and abundant PTCMs in RNA found in viruses and most eukaryotes. Given an uncharacterized RNA sequence containing many GAC motifs, which of them can be methylated, and which cannot? It is important for both basic research and drug development to address this problem. Particularly with the avalanche of RNA sequences generated in the postgenomic age, it is highly demanded to develop computational methods for timely identifying the N(6)-methyldenosine sites in RNA. Here we propose a new predictor called pRNAm-PC, in which RNA sequence samples are expressed by a novel mode of pseudo dinucleotide composition (PseDNC) whose components were derived from a physical-chemical matrix via a series of auto-covariance and cross covariance transformations. It was observed via a rigorous jackknife test that, in comparison with the existing predictor for the same purpose, pRNAm-PC achieved remarkably higher success rates in both overall accuracy and stability, indicating that the new predictor will become a useful high-throughput tool for identifying methylation sites in RNA, and that the novel approach can also be used to study many other RNA-related problems and conduct genome analysis. A user-friendly Web server for pRNAm-PC has been established at http://www.jci-bioinfo.cn/pRNAm-PC, by which users can easily get their desired results without needing to go through the mathematical details.
Predicting protein–protein interaction (PPI) sites from protein sequences is still a challenge task in computational biology. There exists a severe class imbalance phenomenon in predicting PPI sites, ...which leads to a decrease in overall performance for traditional statistical machine-learning-based classifiers, such as SVM and random forests. In this study, an ensemble of SVM and sample-weighted random forests (SSWRF) was proposed to deal with class imbalance. An SVM classifier was trained and applied to estimate the weights of training samples. Then, the training samples with estimated weights were utilized to train a sample-weighted random forests (SWRF). In addition, a lower-dimensional feature representation method, which consists of evolutionary conservation, hydrophobic property, solvent accessibility features derived from a target residue and its neighbors, was developed to improve the discriminative capability for PPI sites prediction. The analysis of feature importance shows that the proposed feature representation method is an effective representation for predicting PPI sites. The proposed SSWRF achieved 22.4% and 35.1% in MCC and F-measure, respectively, on independent validation dataset Dtestset72, and achieved 15.2% and 36.5% in MCC and F-measure, respectively, on PDBtestset164. Computational comparisons between existing PPI sites predictors on benchmark datasets demonstrated that the proposed SSWRF is effective for PPI sites prediction and outperforms the state-of-the-art sequence-based method (i.e., LORIS) released most recently. The benchmark datasets used in this study and the source codes of the proposed method are publicly available at http://csbio.njust.edu.cn/bioinf/SSWRF for academic use.
DNA‐binding proteins play essential roles in many molecular functions and gene regulation. Therefore, it becomes highly desirable to develop effective computational techniques for detecting ...DNA‐binding proteins. In this paper, we proposed a new method, iDBP‐DEP, which performs DNA‐binding prediction by using the discriminative feature derived from multi‐view feature sources including evolutionary profile, dipeptide composition, and physicochemical properties with feature selection. We evaluated iDBP‐DEP on two benchmark datasets, i. e., PDB1075 and PDB594 by rigorous Jackknife test. Compared with the state‐of‐the‐art sequence‐based DNA‐binding predictors, the proposed iDBP‐DEP achieved 1.8 % and 3.0 % improvements of accuracy (Acc) and Mathew's Correlation Coefficient (MCC), respectively, on PDB1075 dataset; 7.4 % and 14.8 % improvements of Acc and MCC, respectively, on PDB594. The independent validation test with PDB186 show that the proposed method achieved the best performances on Acc (80.1 %) and MCC (0.684), which further demonstrated the robustness of iDBP‐DEP for the detection of DNA‐binding proteins. Datasets and codes used in this study are freely available at https://githup.com/Zll‐codeside/iDBP‐DEP.
The topology of protein folds can be specified by the inter-residue contact-maps and accurate contact-map prediction can help ab initio structure folding. We developed TripletRes to deduce protein ...contact-maps from discretized distance profiles by end-to-end training of deep residual neural-networks. Compared to previous approaches, the major advantage of TripletRes is in its ability to learn and directly fuse a triplet of coevolutionary matrices extracted from the whole-genome and metagenome databases and therefore minimize the information loss during the course of contact model training. TripletRes was tested on a large set of 245 non-homologous proteins from CASP 11&12 and CAMEO experiments and outperformed other top methods from CASP12 by at least 58.4% for the CASP 11&12 targets and 44.4% for the CAMEO targets in the top-L long-range contact precision. On the 31 FM targets from the latest CASP13 challenge, TripletRes achieved the highest precision (71.6%) for the top-L/5 long-range contact predictions. It was also shown that a simple re-training of the TripletRes model with more proteins can lead to further improvement with precisions comparable to state-of-the-art methods developed after CASP13. These results demonstrate a novel efficient approach to extend the power of deep convolutional networks for high-accuracy medium- and long-range protein contact-map predictions starting from primary sequences, which are critical for constructing 3D structure of proteins that lack homologous templates in the PDB library.
The rational design of highly efficient, low-cost, and durable electrocatalysts to replace platinum-based electrodes for oxygen reduction reaction (ORR) is highly desirable. Although atomically ...dispersed supported metal catalysts often exhibit excellent catalytic performance with maximized atom efficiency, the fabrication of single-atom catalysts remains a great challenge because of their easy aggregation. Herein, a simple ionothermal method was developed to fabricate atomically dispersed Fe–N x species on porous porphyrinic triazine-based frameworks (FeSAs/PTF) with high Fe loading up to 8.3 wt %, resulting in highly reactive and stable single-atom ORR catalysts for the first time. Owing to the high density of single-atom Fe–N4 active sites, highly hierarchical porosity, and good conductivity, the as-prepared catalyst FeSAs/PTF-600 exhibited highly efficient activity, methanol tolerance, and superstability for oxygen reduction reaction (ORR) under both alkaline and acidic conditions. This work will bring new inspiration to the design of highly efficient noble-metal-free catalysts at the atomic scale for energy conversion.
•We have enhanced TBSVM to LSTBSVM in least squares sense, while in LSTBSVM the distance is measured by L1-norm.•L1-LSTBSVM has more robustness to outliers, can lower the computational costs and ...improve the classification performance.•We design a valid iterative algorithm to solve the L1-norm optimal problems, which is an important theoretical contribution.•The method which we proposed can be conveniently extended to solve other improved methods of TWSVM.
In this paper, we construct a least squares version of the recently proposed twin bounded support vector machine (TBSVM) for binary classification. As a valid classification tool, TBSVM attempts to seek two non-parallel planes that can be produced by solving a pair of quadratic programming problems (QPPs), but this is time-consuming. Here, we solve two systems of linear equations rather than two QPPs to avoid this deficiency. Furthermore, the distance in least squares TBSVM (LSTBSVM) is measured by L2-norm, but L1-norm distance is usually regarded as an alternative to L2-norm to improve model robustness in the presence of outliers. Inspired by the advantages of least squares twin support vector machine (LSTWSVM), TBSVM and L1-norm distance, we propose a LSTBSVM based on L1-norm distance metric for binary classification, termed as L1-LSTBSVM, which is specially designed for suppressing the negative effect of outliers and improving computational efficiency in large datasets. Then, we design a powerful iterative algorithm to solve the L1-norm optimal problems, and it is easy to implement and its convergence to an optimum solution is theoretically ensured. Finally, the feasibility and effectiveness of L1-LSTBSVM are validated by extensive experimental results on both UCI datasets and artificial datasets.