In recent years, the rapid growth of biological data has increased interest in using bioinformatics to analyze and interpret this data. Proteomics, which studies the structure, function, and ...interactions of proteins, is a crucial area of bioinformatics. Using natural language processing (NLP) techniques in proteomics is an emerging field that combines machine learning and text mining to analyze biological data. Recently, transformer‐based NLP models have gained significant attention for their ability to process variable‐length input sequences in parallel, using self‐attention mechanisms to capture long‐range dependencies. In this review paper, we discuss the recent advancements in transformer‐based NLP models in proteome bioinformatics and examine their advantages, limitations, and potential applications to improve the accuracy and efficiency of various tasks. Additionally, we highlight the challenges and future directions of using these models in proteome bioinformatics research. Overall, this review provides valuable insights into the potential of transformer‐based NLP models to revolutionize proteome bioinformatics.
Flavin mono-nucleotides (FMNs) are cofactors that hold responsibility for carrying and transferring electrons in the electron transport chain stage of cellular respiration. Without being facilitated ...by FMNs, energy production is stagnant due to the interruption in most of the cellular processes. Investigation on FMN's functions, therefore, can gain holistic understanding about human diseases and molecular information on drug targets. We proposed a deep learning model using a two-dimensional convolutional neural network and position specific scoring matrices that could identify FMN interacting residues with the sensitivity of 83.7 percent, specificity of 99.2 percent, accuracy of 98.2 percent, and Matthews correlation coefficients of 0.85 for an independent dataset containing 141 FMN binding sites and 1,920 non-FMN binding sites. The proposed method outperformed other previous studies using similar evaluation metrics. Our positive outcome can also promote the utilization of deep learning in dealing with various problems in bioinformatics and computational biology.
DNA replication is a fundamental task that plays a crucial role in the propagation of all living things on earth. Hence, the accurate identification of its origin could be the key to giving an ...insightful understanding of the regulatory mechanism of gene expression. Indeed, with the robust development of computational techniques and the abundant biological sequencing data, it has become possible for scientists to identify the origin of replication accurately and promptly. This growing concern has drawn a lot of attention among experts in this field. However, to gain better outcomes, more work is required. Therefore, this study is designed to explore the combination of state-of-the-art features and extreme gradient boosting learning system in classifying DNA sequences. Our hybrid approach is able to identify the origin of DNA replication with achieved sensitivity of 85.19%, specificity of 93.83%, accuracy of 89.51%, and MCC of 0.7931. Evidence is presented to show that our proposed method is superior to the state-of-the-art methods on the same benchmark dataset. Moreover, the research results represent a further step towards developing the prediction models for DNA replication in particular and DNA sequences in general.
•A novel method for identifying origin of DNA replication in Saccharomyces cerevisiae with high performance.•Features are extracted from pseudo k-tuple nucleotide composition and continuous bags of nucleotides.•Optimizing hyper-parameters for extreme gradient boosting learning algorithm•Compared with the state-of-the-art methods, our method had a significant improvement in all of the measurement metrics.•A basis for further research that can improve the predictive performance of DNA sequencing problems.
Recently, language representation models have drawn a lot of attention in the natural language processing field due to their remarkable results. Among them, bidirectional encoder representations from ...transformers (BERT) has proven to be a simple, yet powerful language model that achieved novel state-of-the-art performance. BERT adopted the concept of contextualized word embedding to capture the semantics and context of the words in which they appeared. In this study, we present a novel technique by incorporating BERT-based multilingual model in bioinformatics to represent the information of DNA sequences. We treated DNA sequences as natural sentences and then used BERT models to transform them into fixed-length numerical matrices. As a case study, we applied our method to DNA enhancer prediction, which is a well-known and challenging problem in this field. We then observed that our BERT-based features improved more than 5-10% in terms of sensitivity, specificity, accuracy and Matthews correlation coefficient compared to the current state-of-the-art features in bioinformatics. Moreover, advanced experiments show that deep learning (as represented by 2D convolutional neural networks; CNN) holds potential in learning BERT features better than other traditional machine learning techniques. In conclusion, we suggest that BERT and 2D CNNs could open a new avenue in biological modeling using sequence information.
Deep learning has been increasingly used to solve a number of problems with state-of-the-art performance in a wide variety of fields. In biology, deep learning can be applied to reduce feature ...extraction time and achieve high levels of performance. In our present work, we apply deep learning via two-dimensional convolutional neural networks and position-specific scoring matrices to classify Rab protein molecules, which are main regulators in membrane trafficking for transferring proteins and other macromolecules throughout the cell. The functional loss of specific Rab molecular functions has been implicated in a variety of human diseases, e.g., choroideremia, intellectual disabilities, cancer. Therefore, creating a precise model for classifying Rabs is crucial in helping biologists understand the molecular functions of Rabs and design drug targets according to such specific human disease information. We constructed a robust deep neural network for classifying Rabs that achieved an accuracy of 99%, 99.5%, 96.3%, and 97.6% for each of four specific molecular functions. Our approach demonstrates superior performance to traditional artificial neural networks. Therefore, from our proposed study, we provide both an effective tool for classifying Rab proteins and a basis for further research that can improve the performance of biological modeling using deep neural networks.
•A deep learning technique for classifying Rab proteins in different functional classes with high performance.•Feature extraction with two-dimensional convolutional neural networks and position specific scoring matrices.•Compared with the other methods, our method had a significant improvement in all of the measurement metrics.•A powerful model to help biologists discover the new Rab proteins with functional annotation.•A basis for further research that can improve the performance of computational biology using deep neural networks.
This article proposes an efficient intelligent control structure for uncertain nonlinear systems. This controller is a new self-organizing fuzzy cerebellar model articulation controller (CMAC), which ...has a framework that includes a CMAC and which uses sliding mode control. A new mixed Gaussian membership function (GMF) is created using a prior GMF and a present GMF for each layer of the CMAC, which reuses relevant data in the prior GMF to more accurately detect tracking errors. This is more general than the local-feedback of a recurrent unit because inputs can simultaneously stir the present state and the prior state to regulate suitable errors. Using a self-organizing algorithm allows increasing or decreasing the layers so that the structure of the new self-organizing fuzzy CMAC (NSOFC) is constructed automatically. The proposed control system consists of a NSOFC and a compensation controller. The NSOFC is the main tracking controller, and imitates an ideal controller; and the compensator expels the approximation error. A Lyapunov stability function is used to make the system stable, and an adaptive proportional integral method allows online updating of the parameters for efficient control. An inverted double pendulum system and a magnetic levitation system are used to demonstrate that the proposed method gives good tracking performance.
Cellular respiration is a catabolic pathway for producing adenosine triphosphate (ATP) and is the most efficient process through which cells harvest energy from consumed food. When cells undergo ...cellular respiration, they require a pathway to keep and transfer electrons (i.e., the electron transport chain). Due to oxidation-reduction reactions, the electron transport chain produces a transmembrane proton electrochemical gradient. In case protons flow back through this membrane, this mechanical energy is converted into chemical energy by ATP synthase. The convert process is involved in producing ATP which provides energy in a lot of cellular processes. In the electron transport chain process, flavin adenine dinucleotide (FAD) is one of the most vital molecules for carrying and transferring electrons. Therefore, predicting FAD binding sites in the electron transport chain is vital for helping biologists understand the electron transport chain process and energy production in cells.
We used an independent data set to evaluate the performance of the proposed method, which had an accuracy of 69.84 %. We compared the performance of the proposed method in analyzing two newly discovered electron transport protein sequences with that of the general FAD binding predictor presented by Mishra and Raghava and determined that the accuracy of the proposed method improved by 9-45 % and its Matthew's correlation coefficient was 0.14-0.5. Furthermore, the proposed method enabled reducing the number of false positives significantly and can provide useful information for biologists.
We developed a method that is based on PSSM profiles and SAAPs for identifying FAD binding sites in newly discovered electron transport protein sequences. This approach achieved a significant improvement after we added SAAPs to PSSM features to analyze FAD binding proteins in the electron transport chain. The proposed method can serve as an effective tool for predicting FAD binding sites in electron transport proteins and can help biologists understand the functions of the electron transport chain, particularly those of FAD binding sites. We also developed a web server which identifies FAD binding sites in electron transporters available for academics.
SNAREs (soluble N-ethylmaleimide-sensitive factor activating protein receptors) are a group of proteins that are crucial for membrane fusion and exocytosis of neurotransmitters from the cell. They ...play an important role in a broad range of cell processes, including cell growth, cytokinesis, and synaptic transmission, to promote cell membrane integration in eukaryotes. Many studies determined that SNARE proteins have been associated with a lot of human diseases, especially in cancer. Therefore, identifying their functions is a challenging problem for scientists to better understand the cancer disease as well as design the drug targets for treatment. We described each protein sequence based on the amino acid embeddings using fastText, which is a natural language processing model performing well in its field. Because each protein sequence is similar to a sentence with different words, applying language model into protein sequence is challenging and promising. After generating, the amino acid embedding features were fed into a deep learning algorithm for prediction. Our model which combines fastText model and deep convolutional neural networks could identify SNARE proteins with an independent test accuracy of 92.8%, sensitivity of 88.5%, specificity of 97%, and Matthews correlation coefficient (MCC) of 0.86. Our performance results were superior to the state-of-the-art predictor (SNARE-CNN). We suggest this study as a reliable method for biologists for SNARE identification and it serves a basis for applying fastText word embedding model into bioinformatics, especially in protein sequencing prediction.