Phosphorylation is a ubiquitous type of post-translational modification (PTM) that occurs in both eukaryotic and prokaryotic cells where in a phosphate group binds with amino acid residues. These ...specific residues, i.e., serine (S), threonine (T), and tyrosine (Y), exhibit diverse functions at the molecular level. Recent studies have determined that some diseases such as cancer, diabetes, and neurodegenerative diseases are caused by abnormal phosphorylation. Based on its potential applications in biological research and drug development, the large-scale identification of phosphorylation sites has attracted interest. Existing wet-lab technologies for targeting phosphorylation sites are overpriced and time consuming. Thus, computational algorithms that can efficiently accelerate the annotation of phosphorylation sites from massive protein sequences are needed. Numerous machine learning-based methods have been implemented for phosphorylation sites prediction. However, despite extensive efforts, existing computational approaches continue to have inadequate performance, particularly in terms of overall ACC, MCC, and AUC. In this paper, we report a novel deep learning-based predictor to overcome these performance hurdles, DeepPPSite, which was constructed using a stacked long short-term memory recurrent network for predicting phosphorylation sites. The proposed technique expediently learns the protein representations from conjoint protein descriptors. The experimental results indicated that our model achieved superior performance on the training dataset for S, T and Y, with MCC values of 0.608, 0.602, and 0.558, respectively, using a 10-fold cross-validation test. We further determined the generalization efficacy of the proposed predictor DeepPPSite by conducting a rigorous independent test. The predictive MCC values were 0.358, 0.356, and 0.350 for the S, T, and Y phosphorylation sites, respectively. Rigorous cross-validation and independent validation tests for the three types of phosphorylation sites demonstrated that the designed DeepPPSite tool significantly outperforms state-of-the-art methods.
Display omitted
•A deep learning based an Automatic method, named DeepPPSite, is developed for prediction of phosphorylation sites.•The protein sequence features are extracted by fusing the PSPM, IPC and EGBW methods.•A stacked LSTMs network is used as powerful classifier and F-score as a dominant feature selection strategy.•K-fold techniques are applied as validation tests and independent dataset test are used to test the model generality.•DeepPPSite outperformed existing methods using selective features with higher prediction performance.
Accurate identification of protein function is critical to elucidate life mechanisms and design new drugs. We proposed a novel deep-learning method, ATGO, to predict Gene Ontology (GO) attributes of ...proteins through a triplet neural-network architecture embedded with pre-trained language models from protein sequences. The method was systematically tested on 1068 non-redundant benchmarking proteins and 3328 targets from the third Critical Assessment of Protein Function Annotation (CAFA) challenge. Experimental results showed that ATGO achieved a significant increase of the GO prediction accuracy compared to the state-of-the-art approaches in all aspects of molecular function, biological process, and cellular component. Detailed data analyses showed that the major advantage of ATGO lies in the utilization of pre-trained transformer language models which can extract discriminative functional pattern from the feature embeddings. Meanwhile, the proposed triplet network helps enhance the association of functional similarity with feature similarity in the sequence embedding space. In addition, it was found that the combination of the network scores with the complementary homology-based inferences could further improve the accuracy of the predicted models. These results demonstrated a new avenue for high-accuracy deep-learning function prediction that is applicable to large-scale protein function annotations from sequence alone.
Protein subcellular localization plays a crucial role in characterizing the function of proteins and understanding various cellular processes. Therefore, accurate identification of protein ...subcellular location is an important yet challenging task. Numerous computational methods have been proposed to predict the subcellular location of proteins. However, most existing methods have limited capability in terms of the overall accuracy, time consumption and generalization power. To address these problems, in this study, we developed a novel computational approach based on human protein atlas (HPA) data, referred to as PScL-HDeep, for accurate and efficient image-based prediction of protein subcellular location in human tissues. We extracted different handcrafted and deep learned (by employing pretrained deep learning model) features from different viewpoints of the image. The step-wise discriminant analysis (SDA) algorithm was applied to generate the optimal feature set from each original raw feature set. To further obtain a more informative feature subset, support vector machine-based recursive feature elimination with correlation bias reduction (SVM-RFE + CBR) feature selection algorithm was applied to the integrated feature set. Finally, the classification models, namely support vector machine with radial basis function (SVM-RBF) and support vector machine with linear kernel (SVM-LNR), were learned on the final selected feature set. To evaluate the performance of the proposed method, a new gold standard benchmark training dataset was constructed from the HPA databank. PScL-HDeep achieved the maximum performance on 10-fold cross validation test on this dataset and showed a better efficacy over existing predictors. Furthermore, we also illustrated the generalization ability of the proposed method by conducting a stringent independent validation test.
Non-synonymous single-nucleotide polymorphisms (nsSNPs) is a typical kind of genetic variant, and more than 6000 diseases have been detected to be caused by nsSNPs. Accordingly, the accurate ...prediction of nsSNPs is of great importance for a better understanding of their functional mechanisms and disease treatment. Till now, many computational studies have been developed to identify disease-causing nsSNPs from the neutral ones; however, there is still some gap existing for further improvement in terms of overall prediction performance. In this work, we proposed a novel deep learning model, called multi-scale convolutional neural network (MSCNN). It utilized multi-scale convolution with different kernel sizes for feature processing, which can collect more effective characteristics than using a single convolution kernel size. Moreover, we applied three types of nominal structural features for further improving the nsSNPs prediction performance. Notably, the nsSNPs sequence and structural features were extracted based on the “residue environment” method we proposed, which has proved to be effective for protein nsSNPs prediction in our previous research. Based on the proposed MSCNN model and the extracted informative feature matrix, we implemented a new nsSNPs predictor, named DeepnsSNPs. The DeepnsSNPs was tested on three nsSNPs datasets collected from the PredictSNP1 website and achieved an average Matthews correlation coefficient of 0.507, which is 18.28% higher than the individual classifiers and 11.37% higher than the consensus classifier on average. Detailed dataset analyses have demonstrated that the DeepnsSNPs would be useful in the nsSNPs prediction. We provide the source python codes and benchmark datasets at https://github.com/sera616/DeepnsSNPs.git for academic use.
•MSCNN is a deep learning model, by combining multi-scale convolutional neural network and residue environment information.•Sequence- and structure-derived features, i.e., PSSM, PSS, PRSA, and PDO , are extracted and combined to form feature matrix.•DeepnsSNPs, is implemented and freely available at https://github.com/sera616/DeepnsSNPs.git for nsSNPs prediction.
Protein-nucleotide interactions are ubiquitous in a wide variety of biological processes. Accurately identifying interaction residues solely from protein sequences is useful for both protein function ...annotation and drug design, especially in the post-genomic era, as large volumes of protein data have not been functionally annotated. Protein-nucleotide binding residue prediction is a typical imbalanced learning problem, where binding residues are extremely fewer in number than non-binding residues. Alleviating the severity of class imbalance has been demonstrated to be a promising means of improving the prediction performance of a machine-learning-based predictor for class imbalance problems. However, little attention has been paid to the negative impact of class imbalance on protein-nucleotide binding residue prediction. In this study, we propose a new supervised over-sampling algorithm that synthesizes additional minority class samples to address class imbalance. The experimental results from protein-nucleotide interaction datasets demonstrate that the proposed supervised over-sampling algorithm can relieve the severity of class imbalance and help to improve prediction performance. Based on the proposed over-sampling algorithm, a predictor, called TargetSOS, is implemented for protein-nucleotide binding residue prediction. Cross-validation tests and independent validation tests demonstrate the effectiveness of TargetSOS. The web-server and datasets used in this study are freely available at http://www.csbio.sjtu.edu.cn/bioinf/TargetSOS/.
Anticancer peptides (ACPs) have been emerged as a potential safe therapeutic agent for treating cancer. Identifying novel ACPs is crucial for understanding deep insight their functional mechanisms ...and vaccine production. Conventional wet-lab technological methods for finding ACPs are overpriced, slow, and resource-intensive. Thus, fast and accurate ACPs prediction through computational approach is highly desired because of massive peptide sequences accumulated in the post-genomic era. Recently, several intelligent statistical approaches have been designed for discriminating ACPs from non-ACPs. Although remarkable achievements have been accomplished, available methods still have inadequate feature descriptors and learning algorithms, thereby restricting the predictive performance. To address this, we develop a novel predictor called Stack-ACPred for the correct identification of ACPs. More specifically, the proposed method possesses three nominal feature encoding strategies i.e., evolutionary-profile and physicochemical information as segmented position-specific scoring matrix (SegPSSM), pseudo (PsePSSM), and extended pseudo amino acid composition (PseAAC). The extracted features are serially fused and further optimized through a powerful support vector machine recursive feature elimination and correlation bias reduction (SVM-RFE + CBR) algorithm. The optimal selected attributes are provided to build the stacking-base ensemble model for targeting effective ACPs. The proposed StackACPred attained 84.45% and 86.21% accuracy based on ACP740 and ACP240 datasets with 5-fold cross-validation test, which was 2.97% and 0.79% higher than other existing studies, respectively. The empirical outcomes of our developed automated tool demonstrate the excellent discriminative power for annotating large scale ACPs in particular and other peptides in general.
•We developed an intelligent predictor named StackACPred for correct identification of ACPs.•Three nominal feature encoding strategies on the bases of evolutionary-profile and physicochemical information as: N-Segmentation position-specific scoring matrix (N-SegPSSM), pseudo (PsePSSM), and extended pseudo amino acid composition (PseAAC).•Powerful support vector machine recursive feature elimination and correlation bias reduction (SVM-RFE + CBR) algorithm was used to select the optimal features.•LightGMB and stacking-base ensemble classifiers were used for predicting ACPs with k-fold cross-validation test.•StackACPred produced better results than others state-of-the-art predictors.
Abstract
Motivation
Characterization of protein subcellular localization has become an important and long-standing task in bioinformatics and computational biology, which provides valuable ...information for elucidating various cellular functions of proteins and guiding drug design.
Results
Here, we develop a novel bioimage-based computational approach, termed PScL-DDCFPred, to accurately predict protein subcellular localizations in human tissues. PScL-DDCFPred first extracts multiview image features, including global and local features, as base or pure features; next, it applies a new integrative feature selection method based on stepwise discriminant analysis and generalized discriminant analysis to identify the optimal feature sets from the extracted pure features; Finally, a classifier based on deep neural network (DNN) and deep-cascade forest (DCF) is established. Stringent 10-fold cross-validation tests on the new protein subcellular localization training dataset, constructed from the human protein atlas databank, illustrates that PScL-DDCFPred achieves a better performance than several existing state-of-the-art methods. Moreover, the independent test set further illustrates the generalization capability and superiority of PScL-DDCFPred over existing predictors. In-depth analysis shows that the excellent performance of PScL-DDCFPred can be attributed to three critical factors, namely the effective combination of the DNN and DCF models, complementarity of global and local features, and use of the optimal feature sets selected by the integrative feature selection algorithm.
Availability and implementation
https://github.com/csbio-njust-edu/PScL-DDCFPred.
Supplementary information
Supplementary data are available at Bioinformatics online.
RNA 5-methylcytosine (m5C) is an important post-transcriptional modification that plays an indispensable role in biological processes. The accurate identification of m5C sites from primary RNA ...sequences is especially useful for deeply understanding the mechanisms and functions of m5C. Due to the difficulty and expensive costs of identifying m5C sites with wet-lab techniques, developing fast and accurate machine-learning-based prediction methods is urgently needed. In this study, we proposed a new m5C site predictor, called M5C-HPCR, by introducing a novel heuristic nucleotide physicochemical property reduction (HPCR) algorithm and classifier ensemble. HPCR extracts multiple reducts of physical-chemical properties for encoding discriminative features, while the classifier ensemble is applied to integrate multiple base predictors, each of which is trained based on a separate reduct of the physical-chemical properties obtained from HPCR. Rigorous jackknife tests on two benchmark datasets demonstrate that M5C-HPCR outperforms state-of-the-art m5C site predictors, with the highest values of MCC (0.859) and AUC (0.962). We also implemented the webserver of M5C-HPCR, which is freely available at http://cslab.just.edu.cn:8080/M5C-HPCR/.
Application of a molecular catalyst in artificial photosynthesis is confronted with challenges such as rapid deactivation due to photodegradation or detrimental aggregation in harsh conditions. In ...this work, a metal–organic cage Pd6(RuL3)828+ (MOC-16), characteristic of a photochemical molecular device (PMD) concurrently integrating eight Ru2+ light-harvesting centers and six Pd2+ catalytic centers for efficient homogeneous H2 production, is successfully heterogenized through incorporation into a metal–organic framework (MOF) of ZIF-8 and then transformed into a carbonate matrix of Zn x (MeIm) x (CO3) x (CZIF), leading to hybridized MOC-16@CZIF. This MOC@MOF integrated photocatalyst inherits a highly efficient and directional electron transfer in the picosecond domain of MOC-16 and possesses one order increased microsecond magnitude of the triplet excited-state electron in comparison to that of the primitive MOC-16. The carbonate CZIF matrix endows MOC-16@CZIF with water wettability, serving as a proton relay to facilitate proton delivery by virtue of H2O as proton carriers. Electron transfer during the photocatalytic process is also enhanced by infiltration of a sacrificial agent of BIH into the CZIF matrix to promote conductivity, owing to its strong reducing ability to induce free charge carriers. These synergistic effects contribute to the extra high activity for H2 generation, making the turnover frequency of this heterogeneous MOC-16@CZIF photocatalyst maintain a level of ∼0.4 H2·s–1, increased by 50-fold over that of a homogeneous PMD. Meanwhile, it is robust enough to tolerate harsh reaction conditions, presenting an unprecedented heterogenization example of homogeneous PMD with a MOF-derived matrix to mimic catalytic features of a natural photosystem, which may shed light on the design of multifunctional PMD@MOF materials to expand the number of molecular catalysts for practical application in artificial photosynthesis.
DNase I Hypersensitive sites (DHS) are the regions that are sensitive to cleavage by the DNase I enzyme. Knowledge regarding these sites is helpful for decryption of the functions of non-coding ...genomic regions. Various biological processes need its intervention. Traditional techniques are laborious and time-consuming to predict DHS sites. Particularly, with the avalanche of DNA sequences generated in the post-genomic era, the development of computational approaches is highly essential to precisely and timely predict DHS sites in DNA sequences. The existing feature encoding schemes such as pseudo dinucleotide composition, pseudo trinucleotide composition etc. cannot effectively express features from DHS sequences. In the current study, we proposed a new computational technique to predict DHS sites which uses Un-biased Pseudo Trinucleotide Composition (Unb-PseTNC) strategy to extract nominal descriptors from the DHS benchmark dataset and avoid biasness among the classes during the classification phase. Several classification algorithm including Support vector machine (SVM), probabilistic neural network and k-nearest neighbor are employed to classify extracted features. It was observed that SVM in conjunction with Unb-PseTNC outperforms other techniques. By comparing with other existing predictors, it was perceived that our proposed method achieved higher prediction rates by applying rigorous jackknife test. This indicates that the proposed model will become a useful tool to predict DHS sites and can also be utilized for in-depth study of DNA and genome analysis.
•A new computational model was established for prediction of DHS sites.•The method used three nominal feature extraction methods called PseDNC, PseTNC and Un-PseTNC.•Classification engines like SVM, PNN and KNN were utilized for classification.•Jackknife cross-validation test was used.•Proposed method produced improved performance than the state-of-art methods.