Recently, language representation models have drawn a lot of attention in the natural language processing field due to their remarkable results. Among them, bidirectional encoder representations from ...transformers (BERT) has proven to be a simple, yet powerful language model that achieved novel state-of-the-art performance. BERT adopted the concept of contextualized word embedding to capture the semantics and context of the words in which they appeared. In this study, we present a novel technique by incorporating BERT-based multilingual model in bioinformatics to represent the information of DNA sequences. We treated DNA sequences as natural sentences and then used BERT models to transform them into fixed-length numerical matrices. As a case study, we applied our method to DNA enhancer prediction, which is a well-known and challenging problem in this field. We then observed that our BERT-based features improved more than 5-10% in terms of sensitivity, specificity, accuracy and Matthews correlation coefficient compared to the current state-of-the-art features in bioinformatics. Moreover, advanced experiments show that deep learning (as represented by 2D convolutional neural networks; CNN) holds potential in learning BERT features better than other traditional machine learning techniques. In conclusion, we suggest that BERT and 2D CNNs could open a new avenue in biological modeling using sequence information.
Membrane proteins play a crucial role in various cellular processes and are essential components of cell membranes. Computational methods have emerged as a powerful tool for studying membrane ...proteins due to their complex structures and properties that make them difficult to analyze experimentally. Traditional features for protein sequence analysis based on amino acid types, composition, and pair composition have limitations in capturing higher‐order sequence patterns. Recently, multiple sequence alignment (MSA) and pre‐trained language models (PLMs) have been used to generate features from protein sequences. However, the significant computational resources required for MSA‐based features generation can be a major bottleneck for many applications. Several methods and tools have been developed to accelerate the generation of MSAs and reduce their computational cost, including heuristics and approximate algorithms. Additionally, the use of PLMs such as BERT has shown great potential in generating informative embeddings for protein sequence analysis. In this review, we provide an overview of traditional and more recent methods for generating features from protein sequences, with a particular focus on MSAs and PLMs. We highlight the advantages and limitations of these approaches and discuss the methods and tools developed to address the computational challenges associated with features generation. Overall, the advancements in computational methods and tools provide a promising avenue for gaining deeper insights into the function and properties of membrane proteins, which can have significant implications in drug discovery and personalized medicine.
Deep learning has been increasingly used to solve a number of problems with state-of-the-art performance in a wide variety of fields. In biology, deep learning can be applied to reduce feature ...extraction time and achieve high levels of performance. In our present work, we apply deep learning via two-dimensional convolutional neural networks and position-specific scoring matrices to classify Rab protein molecules, which are main regulators in membrane trafficking for transferring proteins and other macromolecules throughout the cell. The functional loss of specific Rab molecular functions has been implicated in a variety of human diseases, e.g., choroideremia, intellectual disabilities, cancer. Therefore, creating a precise model for classifying Rabs is crucial in helping biologists understand the molecular functions of Rabs and design drug targets according to such specific human disease information. We constructed a robust deep neural network for classifying Rabs that achieved an accuracy of 99%, 99.5%, 96.3%, and 97.6% for each of four specific molecular functions. Our approach demonstrates superior performance to traditional artificial neural networks. Therefore, from our proposed study, we provide both an effective tool for classifying Rab proteins and a basis for further research that can improve the performance of biological modeling using deep neural networks.
•A deep learning technique for classifying Rab proteins in different functional classes with high performance.•Feature extraction with two-dimensional convolutional neural networks and position specific scoring matrices.•Compared with the other methods, our method had a significant improvement in all of the measurement metrics.•A powerful model to help biologists discover the new Rab proteins with functional annotation.•A basis for further research that can improve the performance of computational biology using deep neural networks.
Cellular respiration is a catabolic pathway for producing adenosine triphosphate (ATP) and is the most efficient process through which cells harvest energy from consumed food. When cells undergo ...cellular respiration, they require a pathway to keep and transfer electrons (i.e., the electron transport chain). Due to oxidation-reduction reactions, the electron transport chain produces a transmembrane proton electrochemical gradient. In case protons flow back through this membrane, this mechanical energy is converted into chemical energy by ATP synthase. The convert process is involved in producing ATP which provides energy in a lot of cellular processes. In the electron transport chain process, flavin adenine dinucleotide (FAD) is one of the most vital molecules for carrying and transferring electrons. Therefore, predicting FAD binding sites in the electron transport chain is vital for helping biologists understand the electron transport chain process and energy production in cells.
We used an independent data set to evaluate the performance of the proposed method, which had an accuracy of 69.84 %. We compared the performance of the proposed method in analyzing two newly discovered electron transport protein sequences with that of the general FAD binding predictor presented by Mishra and Raghava and determined that the accuracy of the proposed method improved by 9-45 % and its Matthew's correlation coefficient was 0.14-0.5. Furthermore, the proposed method enabled reducing the number of false positives significantly and can provide useful information for biologists.
We developed a method that is based on PSSM profiles and SAAPs for identifying FAD binding sites in newly discovered electron transport protein sequences. This approach achieved a significant improvement after we added SAAPs to PSSM features to analyze FAD binding proteins in the electron transport chain. The proposed method can serve as an effective tool for predicting FAD binding sites in electron transport proteins and can help biologists understand the functions of the electron transport chain, particularly those of FAD binding sites. We also developed a web server which identifies FAD binding sites in electron transporters available for academics.
In cellular transportation mechanisms, the movement of ions across the cell membrane and its proper control are important for cells, especially for life processes. Ion transporters/pumps and ion ...channel proteins work as border guards controlling the incessant traffic of ions across cell membranes. We revisited the study of classification of transporters and ion channels from membrane proteins with a more efficient deep learning approach. Specifically, we applied multi‐window scanning filters of convolutional neural networks on almost full‐length position‐specific scoring matrices for extracting useful information. In this way, we were able to retain important evolutionary information of the proteins. Our experiment results show that a convolutional neural network with a minimum number of convolutional layers can be enough to extract the conserved information of proteins which leads to higher performance. Our best prediction models were obtained after examining different data imbalanced handling techniques, and different protein encoding methods. We also showed that our models were superior to traditional deep learning approaches on the same datasets as well as other machine learning classification algorithms.
Protein multiple sequence alignment information has long been important features to know about functions of proteins inferred from related sequences with known functions. It is therefore one of the ...underlying ideas of Alpha fold 2, a breakthrough study and model for the prediction of three‐dimensional structures of proteins from their primary sequence. Our study used protein multiple sequence alignment information in the form of position‐specific scoring matrices as input. We also refined the use of a convolutional neural network, a well‐known deep‐learning architecture with impressive achievement on image and image‐like data. Specifically, we revisited the study of prediction of adenosine triphosphate (ATP)‐binding sites with more efficient convolutional neural networks. We applied multiple convolutional window scanning filters of a convolutional neural network on position‐specific scoring matrices for as much as useful information as possible. Furthermore, only the most specific motifs are retained at each feature map output through the one‐max pooling layer before going to the next layer. We assumed that this way could help us retain the most conserved motifs which are discriminative information for prediction. Our experiment results show that a convolutional neural network with not too many convolutional layers can be enough to extract the conserved information of proteins, which leads to higher performance. Our best prediction models were obtained after examining them with different hyper‐parameters. Our experiment results showed that our models were superior to traditional use of convolutional neural networks on the same datasets as well as other machine‐learning classification algorithms.
•Integration of pre-trained protein language models with multi-window CNNs for secondary active transporter identification.•Pre-trained language model embeddings capture informative sequence patterns ...compared to one-hot encodings or evolutionary profiles.•Pre-trained language model embeddings with 128 filters and multi-window CNN achieved 86% sensitivity, 99% specificity, and 98% accuracy.
Secondary active transporters play pivotal roles in regulating ion and molecule transport across cell membranes, with implications in diseases like cancer. However, studying transporters via biochemical experiments poses challenges. We propose an effective computational approach to identify secondary active transporters from membrane protein sequences using pre-trained language models and deep learning neural networks.
Our dataset comprised 290 secondary active transporters and 5,420 other membrane proteins from UniProt. Three types of features were extracted - one-hot encodings, position-specific scoring matrix profiles, and contextual embeddings from the ProtTrans language model. A multi-window convolutional neural network architecture scanned the ProtTrans embeddings using varying window sizes to capture multi-scale sequence patterns.
The proposed model combining ProtTrans embeddings and multi-window convolutional neural networks achieved 86% sensitivity, 99% specificity and 98% overall accuracy in identifying secondary active transporters, outperforming conventional machine learning approaches.
This work demonstrates the promise of integrating pre-trained language models like ProtTrans with multi-scale deep neural networks to effectively interpret transporter sequences for functional analysis. Our approach enables more accurate computational identification of secondary active transporters, advancing membrane protein research.
E-Learning has become more and more popular in recent years with the advance of new technologies. Using their mobile devices, people can expand their knowledge anytime and anywhere. E-Learning also ...makes it possible for people to manage their learning progression freely and follow their own learning style. However, studies show that E-Learning can cause the user to experience feelings of isolation and detachment due to the lack of human-like interactions in most E-Learning platforms. These feelings could reduce the user's motivation to learn. In this paper, we explore and evaluate how well current chatbot technologies assist users' learning on E-Learning platforms and how these technologies could possibly reduce problems such as feelings of isolation and detachment. For evaluation, we specifically designed a chatbot to be an E-Learning assistant. The NLP core of our chatbot is based on two different models: a retrieval-based model and a QANet model. We designed this two-model hybrid chatbot to be used alongside an E-Learning platform. The core response context of our chatbot is not only designed with course materials in mind but also everyday conversation and chitchat, which make it feel more like a human companion. Experiment and questionnaire evaluation results show that chatbots could be helpful in learning and could potentially reduce E-Learning users' feelings of isolation and detachment. Our chatbot also performed better than the teacher counselling service in the E-Learning platform on which the chatbot is based.