As lung cancer remains the leading cause of cancer deaths globally, characterizing the tumor molecular profiles is crucial to tailoring treatments for individuals at advanced stages. Cancer cells ...exhibit strong dependence on iron for their proliferation, and several iron-regulatory proteins have been proposed as either oncogenes or tumor suppressive genes. This study aims to evaluate the prospective therapeutic and prognostic values of the sideroflexin (SFXN) gene family, whose functions involve mitochondrial iron metabolism, in lung adenocarcinoma (LUAD). Differential expression analysis using TIMER and UALCAN tools was first employed to compare SFXNs expression levels between normal and LUAD tissues. Next, SFXNs’ prognostic values, biological significance, and potential as immunotherapy candidates were examined from GEPIA, cBioPortal, MetaCore, Cytoscape, and TIMER databases. It was found that all members of SFXN family, except SFXN3, were differentially expressed in LUAD compared to normal samples and within different stages of LUAD. Survival analysis then revealed SFXN1 to be related to worse overall survival outcome in patients with LUAD. Furthermore, several correlations between expression of SFXN1 and immune infiltration cells were discovered. To conclude, our study provides evidence of SFXN family gene’s relevance to the prognosis and immunotherapeutic targets of LUAD.
Motor proteins are the driving force behind muscle contraction and are responsible for the active transportation of most proteins and vesicles in the cytoplasm. There are three superfamilies of ...cytoskeletal motor proteins with various molecular functions and structures: dynein, kinesin, and myosin. The functional loss of a specific motor protein molecular function has linked to a variety of human diseases, e.g., Charcot-Marie-Tooth disease, kidney disease. Therefore, creating a precise model to classify motor proteins is essential for helping biologists understand their molecular functions and design drug targets according to their impact on human diseases. Here we attempt to classify cytoskeleton motor proteins using deep learning, which has been increasingly and widely used to address numerous problems in a variety of fields resulting in state-of-the-art results. Our effective deep convolutional neural network is able to achieve an independent test accuracy of 97.5%, 96.4%, and 96.1% for each superfamily, respectively. Compared to other state-of-the-art methods, our approach showed a significant improvement in performance across a range of evaluation metrics. Through the proposed study, we provide an effective model for classifying motor proteins and a basis for further research that can enhance the performance of protein function classification using deep learning.
•A powerful deep learning framework for classifying motor proteins into three superfamilies.•The two-dimensional convolutional neural network was constructed on PSSM profiles.•The model can classify motor proteins with accuracy of 97.5%, 96.4%, and 96.1% for each family, respectively.•Achieving higher performance than traditional machine learning techniques.•A basis for applying deep learning in the classification of protein functions.
Early identification of epidermal growth factor receptor (EGFR) and Kirsten rat sarcoma viral oncogene homolog (KRAS) mutations is crucial for selecting a therapeutic strategy for patients with ...non-small-cell lung cancer (NSCLC). We proposed a machine learning-based model for feature selection and prediction of EGFR and KRAS mutations in patients with NSCLC by including the least number of the most semantic radiomics features. We included a cohort of 161 patients from 211 patients with NSCLC from The Cancer Imaging Archive (TCIA) and analyzed 161 low-dose computed tomography (LDCT) images for detecting EGFR and KRAS mutations. A total of 851 radiomics features, which were classified into 9 categories, were obtained through manual segmentation and radiomics feature extraction from LDCT. We evaluated our models using a validation set consisting of 18 patients derived from the same TCIA dataset. The results showed that the genetic algorithm plus XGBoost classifier exhibited the most favorable performance, with an accuracy of 0.836 and 0.86 for detecting EGFR and KRAS mutations, respectively. We demonstrated that a noninvasive machine learning-based model including the least number of the most semantic radiomics signatures could robustly predict EGFR and KRAS mutations in patients with NSCLC.
Accurately predicting tumor T-cell antigen (TTCA) sequences is a crucial task in the development of cancer vaccines and immunotherapies. TTCAs derived from tumor cells, are presented to immune cells ...(T cells) through major histocompatibility complex (MHC), via the recognition of specific portions of their structure known as epitopes. More specifically, MHC class I introduces TTCAs to T-cell receptors (TCR) which are located on the surface of CD8+ T cells. However, TTCA sequences are varied and lead to struggles in vaccine design. Recently, Machine learning (ML) models have been developed to predict TTCA sequences which could aid in fast and correct TTCA identification. During the construction of the TTCA predictor, the peptide encoding strategy is an important step. Previous studies have used biological descriptors for encoding TTCA sequences. However, there have been no studies that use natural language processing (NLP), a potential approach for this purpose. As sentences have their own words with diverse properties, biological sequences also hold unique characteristics that reflect evolutionary information, physicochemical values, and structural information. We hypothesized that NLP methods would benefit the prediction of TTCA. To develop a new identifying TTCA model, we first constructed a based model with widely used ML algorithms and extracted features from biological descriptors. Then, to improve our model performance, we added extracted features from biological language models (BLMs) based on NLP methods. Besides, we conducted feature selection by using Chi-square and Pearson Correlation Coefficient techniques. Then, SMOTE, Up-sampling, and Near-Miss were used to treat unbalanced data. Finally, we optimized Sa-TTCA by the SVM algorithm to the four most effective feature groups. The best performance of Sa-TTCA showed a competitive balanced accuracy of 87.5% on a training set, and 72.0% on an independent testing set. Our results suggest that integrating biological descriptors with natural language processing has the potential to improve the precision of predicting protein/peptide functionality, which could be beneficial for developing cancer vaccines.
•An innovative approach to tumor T-cell antigen (TTCA) prediction.•The integration of NLP with machine learning to improve the model’s performance.•Promising results have been achieved.
We herein proposed a novel approach based on the language representation learning method to categorize electron complex proteins into 5 types. The idea is stemmed from the the shared characteristics ...of human language and protein sequence language, thus advanced natural language processing techniques were used for extracting useful features. Specifically, we employed transfer learning and word embedding techniques to analyze electron complex sequences and create efficient feature sets before using a support vector machine algorithm to classify them. During the 5‐fold cross‐validation processes, seven types of sequence‐based features were analyzed to find the optimal features. On an average, our final classification models achieved the accuracy, specificity, sensitivity, and MCC of 96 %, 96.1 %, 95.3 %, and 0.86, respectively on cross‐validation data. For the independent test data, those corresponding performance scores are 95.3 %, 92.6 %, 94 %, and 0.87. We concluded that using feature extracted using these representation learning methods, the prediction performance of simple machine learning algorithm is on par with existing deep neural network method on the task of categorizing electron complexes while enjoying a much faster way for feature generation. Furthermore, the results also showed that the combination of features learned from the representation learning methods and sequence motif counts helps yield better performance.
This study introduces VF-Pred, a novel framework developed for the purpose of detecting virulence factors (VFs) through the analysis of genomic data. VFs are crucial for pathogens to successfully ...infect host tissue and evade the immune system, leading to the onset of infectious diseases. Identifying VFs accurately is of utmost importance in the quest for developing potent drugs and vaccines to counter these diseases. To accomplish this, VF-Pred combines various feature engineering techniques to generate inputs for distinct machine learning classification models. The collective predictions of these models are then consolidated by a final downstream model using an innovative ensembling approach. One notable aspect of VF-Pred is the inclusion of a novel Seq-Alignment feature, which significantly enhances the accuracy of the employed machine learning algorithms. The framework was meticulously trained on 982 features obtained from extensive feature engineering, utilizing a comprehensive ensemble of 25 models. The new downstream ensembling technique adopted by VF-Pred surpasses existing stacking strategies and other ensembling methods, delivering superior performance in VF detection. There have been similar studies done earlier, VF-Pred stands out in comparison showing higher accuracy (83.5 %), higher sensitivity (87 %) towards identification of VFs. Accessible through a user-friendly web page, VF-Pred can be accessed by providing the identifier and protein sequence, enabling the prediction of high or low likelihoods of VFs. Overall, VF-Pred showcases a highly promising methodology for the identification of VFs, potentially paving the way for the development of more effective strategies in the battle against infectious diseases.
Membrane transport proteins and their substrate specificities play crucial roles in various cellular functions. Identifying the substrate specificities of membrane transport proteins is closely ...related to protein-target interaction prediction, drug design, membrane recruitment, and dysregulation analysis, thus being an important problem for bioinformatics researchers. In this study, we applied word embedding approach, the main cause for natural language processing breakout in recent years, to protein sequences of transporters. We defined each protein sequence based on the word embeddings and frequencies of its biological words. The protein features were then fed into machine learning models for prediction. We also varied the lengths of protein sequence's constituent biological words to find the optimal length which generated the most discriminative feature set. Compared to four other feature types created from protein sequences, our proposed features can help prediction models yield superior performance. Our best models reach an average area under the curve of 0.96 and 0.99, respectively on the 5-fold cross validation and the independent test. With this result, our study can help biologists identify transporters based on substrate specificities as well as provides a basis for further research that enriches a field of applying natural language processing techniques in bioinformatics.
In this study, we aimed at identifying the substrate specificities of transport proteins, which is closely related to protein-target interaction prediction, drug design, and dysregulation analysis. We defined each protein sequence based on the word embeddings and frequencies of its biological words. The protein features were then fed into machine learning models for prediction. We also varied the lengths of protein sequence's constituent biological words to find the optimal length which generated the most discriminative feature set. Our final model built based on the best choice of biological word lengths and machine learning algorithms achieves a high average accuracy of 95% and 97%, average area under the curve of 0.96 and 0.99, respectively on the 5-fold cross validation and independent test. Display omitted
Enhancers are non-coding DNA fragments which are crucial in gene regulation (e.g. transcription and translation). Having high locational variation and free scattering in 98% of non-encoding genomes, ...enhancer identification is, therefore, more complicated than other genetic factors. To address this biological issue, several in silico studies have been done to identify and classify enhancer sequences among a myriad of DNA sequences using computational advances. Although recent studies have come up with improved performance, shortfalls in these learning models still remain. To overcome limitations of existing learning models, we introduce iEnhancer-ECNN, an efficient prediction framework using one-hot encoding and k-mers for data transformation and ensembles of convolutional neural networks for model construction, to identify enhancers and classify their strength. The benchmark dataset from Liu et al.'s study was used to develop and evaluate the ensemble models. A comparative analysis between iEnhancer-ECNN and existing state-of-the-art methods was done to fairly assess the model performance.
Our experimental results demonstrates that iEnhancer-ECNN has better performance compared to other state-of-the-art methods using the same dataset. The accuracy of the ensemble model for enhancer identification (layer 1) and enhancer classification (layer 2) are 0.769 and 0.678, respectively. Compared to other related studies, improvements in the Area Under the Receiver Operating Characteristic Curve (AUC), sensitivity, and Matthews's correlation coefficient (MCC) of our models are remarkable, especially for the model of layer 2 with about 11.0%, 46.5%, and 65.0%, respectively.
iEnhancer-ECNN outperforms other previously proposed methods with significant improvement in most of the evaluation metrics. Strong growths in the MCC of both layers are highly meaningful in assuring the stability of our models.
Essential genes contain key information of genomes that could be the key to a comprehensive understanding of life and evolution. Because of their importance, studies of essential genes have been ...considered a crucial problem in computational biology. Computational methods for identifying essential genes have become increasingly popular to reduce the cost and time-consumption of traditional experiments. A few models have addressed this problem, but performance is still not satisfactory because of high dimensional features and the use of traditional machine learning algorithms. Thus, there is a need to create a novel model to improve the predictive performance of this problem from DNA sequence features. This study took advantage of a natural language processing (NLP) model in learning biological sequences by treating them as natural language words. To learn the NLP features, a supervised learning model was consequentially employed by an ensemble deep neural network. Our proposed method could identify essential genes with sensitivity, specificity, accuracy, Matthews correlation coefficient (MCC), and area under the receiver operating characteristic curve (AUC) values of 60.2%, 84.6%, 76.3%, 0.449, and 0.814, respectively. The overall performance outperformed the single models without ensemble, as well as the state-of-the-art predictors on the same benchmark dataset. This indicated the effectiveness of the proposed method in determining essential genes, in particular, and other sequencing problems, in general.
Membrane proteins, the most important drug targets, account for around 30% of total proteins encoded by the genome of living organisms. An important role of these proteins is to bind adenosine ...triphosphate (ATP), facilitating crucial biological processes such as metabolism and cell signaling. There are several reports elucidating ATP-binding sites within proteins. However, such studies on membrane proteins are limited. Our prediction tool, DeepATP, combines evolutionary information in the form of Position Specific Scoring Matrix and two-dimensional Convolutional Neural Network to predict ATP-binding sites in membrane proteins with an MCC of 0.89 and an AUC of 99%. Compared to recently published ATP-binding site predictors and classifiers that use traditional machine learning algorithms, our approach performs significantly better. We suggest this method as a reliable tool for biologists for ATP-binding site prediction in membrane proteins.
In this study, we approach a deep learning technique via convolutional neural network on position specific scoring matrix to identify ATP-binding sites in membrane proteins, which is the most important drug targets. We also addressed the imbalanced dataset issue, which can be seen in most binding site prediction problems. With an MCC of 0.89 and an AUC of 99%, our proposed technique can serve as a powerful tool for biologists to identify ATP-binding sites in membrane proteins. Moreover, this study provides a basis for further research that can enrich a field of applying deep learning in bioinformatics. Display omitted
•Many life-essential biology mechanisms can be understood by identifying accurately ATP-binding sites in membrane proteins.•Existing predictors can be used to predict ATP-binding membrane proteins but lack of specificity reduces their potential.•The specific and greatly imbalanced dataset issue of ATP-binding sites in membrane proteins was also address.