Abstract
The prediction of protein subcellular localization is of great relevance for proteomics research. Here, we propose an update to the popular tool DeepLoc with multi-localization prediction ...and improvements in both performance and interpretability. For training and validation, we curate eukaryotic and human multi-location protein datasets with stringent homology partitioning and enriched with sorting signal information compiled from the literature. We achieve state-of-the-art performance in DeepLoc 2.0 by using a pre-trained protein language model. It has the further advantage that it uses sequence input rather than relying on slower protein profiles. We provide two means of better interpretability: an attention output along the sequence and highly accurate prediction of nine different types of protein sorting signals. We find that the attention output correlates well with the position of sorting signals. The webserver is available at services.healthtech.dtu.dk/service.php?DeepLoc-2.0.
Graphical Abstract
Graphical Abstract
DeepLoc 2.0 uses a transformer-based protein language model to predict multi-label subcellular localization and provides interpretability via the attention and sorting signal prediction.
The prediction of eukaryotic protein subcellular localization is a well-studied topic in bioinformatics due to its relevance in proteomics research. Many machine learning methods have been ...successfully applied in this task, but in most of them, predictions rely on annotation of homologues from knowledge databases. For novel proteins where no annotated homologues exist, and for predicting the effects of sequence variants, it is desirable to have methods for predicting protein properties from sequence information only.
Here, we present a prediction algorithm using deep neural networks to predict protein subcellular localization relying only on sequence information. At its core, the prediction model uses a recurrent neural network that processes the entire protein sequence and an attention mechanism identifying protein regions important for the subcellular localization. The model was trained and tested on a protein dataset extracted from one of the latest UniProt releases, in which experimentally annotated proteins follow more stringent criteria than previously. We demonstrate that our model achieves a good accuracy (78% for 10 categories; 92% for membrane-bound or soluble), outperforming current state-of-the-art algorithms, including those relying on homology information.
The method is available as a web server at http://www.cbs.dtu.dk/services/DeepLoc. Example code is available at https://github.com/JJAlmagro/subcellular_localization. The dataset is available at http://www.cbs.dtu.dk/services/DeepLoc/data.php.
jjalma@dtu.dk.
Signal peptides (SPs) are short amino acid sequences in the amino terminus of many newly synthesized proteins that target proteins into, or across, membranes. Bioinformatic tools can predict SPs from ...amino acid sequences, but most cannot distinguish between various types of signal peptides. We present a deep neural network-based approach that improves SP prediction across all domains of life and distinguishes between three types of prokaryotic SPs.
Despite recent advances in metagenomic binning, reconstruction of microbial species from metagenomics data remains challenging. Here we develop variational autoencoders for metagenomic binning ...(VAMB), a program that uses deep variational autoencoders to encode sequence coabundance and k-mer distribution information before clustering. We show that a variational autoencoder is able to integrate these two distinct data types without any previous knowledge of the datasets. VAMB outperforms existing state-of-the-art binners, reconstructing 29-98% and 45% more near-complete (NC) genomes on simulated and real data, respectively. Furthermore, VAMB is able to separate closely related strains up to 99.5% average nucleotide identity (ANI), and reconstructed 255 and 91 NC Bacteroides vulgatus and Bacteroides dorei sample-specific genomes as two distinct clusters from a dataset of 1,000 human gut microbiome samples. We use 2,606 NC bins from this dataset to show that species of the human gut microbiome have different geographical distribution patterns. VAMB can be run on standard hardware and is freely available at https://github.com/RasmussenLab/vamb .
The ability to predict local structural features of a protein from the primary sequence is of paramount importance for unraveling its function in absence of experimental structural information. Two ...main factors affect the utility of potential prediction tools: their accuracy must enable extraction of reliable structural information on the proteins of interest, and their runtime must be low to keep pace with sequencing data being generated at a constantly increasing speed. Here, we present NetSurfP‐2.0, a novel tool that can predict the most important local structural features with unprecedented accuracy and runtime. NetSurfP‐2.0 is sequence‐based and uses an architecture composed of convolutional and long short‐term memory neural networks trained on solved protein structures. Using a single integrated model, NetSurfP‐2.0 predicts solvent accessibility, secondary structure, structural disorder, and backbone dihedral angles for each residue of the input sequences. We assessed the accuracy of NetSurfP‐2.0 on several independent test datasets and found it to consistently produce state‐of‐the‐art predictions for each of its output features. We observe a correlation of 80% between predictions and experimental data for solvent accessibility, and a precision of 85% on secondary structure 3‐class predictions. In addition to improved accuracy, the processing time has been optimized to allow predicting more than 1000 proteins in less than 2 hours, and complete proteomes in less than 1 day.
In bioinformatics, machine learning methods have been used to predict features embedded in the sequences. In contrast to what is generally assumed, machine learning approaches can also provide new ...insights into the underlying biology. Here, we demonstrate this by presenting TargetP 2.0, a novel state-of-the-art method to identify N-terminal sorting signals, which direct proteins to the secretory pathway, mitochondria, and chloroplasts or other plastids. By examining the strongest signals from the attention layer in the network, we find that the second residue in the protein, that is, the one following the initial methionine, has a strong influence on the classification. We observe that two-thirds of chloroplast and thylakoid transit peptides have an alanine in position 2, compared with 20% in other plant proteins. We also note that in fungi and single-celled eukaryotes, less than 30% of the targeting peptides have an amino acid that allows the removal of the N-terminal methionine compared with 60% for the proteins without targeting peptide. The importance of this feature for predictions has not been highlighted before.
Abstract
Motivation
Models for analysing and making relevant biological inferences from massive amounts of complex single-cell transcriptomic data typically require several individual data-processing ...steps, each with their own set of hyperparameter choices. With deep generative models one can work directly with count data, make likelihood-based model comparison, learn a latent representation of the cells and capture more of the variability in different cell populations.
Results
We propose a novel method based on variational auto-encoders (VAEs) for analysis of single-cell RNA sequencing (scRNA-seq) data. It avoids data preprocessing by using raw count data as input and can robustly estimate the expected gene expression levels and a latent representation for each cell. We tested several count likelihood functions and a variant of the VAE that has a priori clustering in the latent space. We show for several scRNA-seq datasets that our method outperforms recently proposed scRNA-seq methods in clustering cells and that the resulting clusters reflect cell types.
Availability and implementation
Our method, called scVAE, is implemented in Python using the TensorFlow machine-learning library, and it is freely available at https://github.com/scvae/scvae.
Supplementary information
Supplementary data are available at Bioinformatics online.
The outbreak of SARS-CoV-2 (2019-nCoV) virus has highlighted the need for fast and efficacious vaccine development. Stimulation of a proper immune response that leads to protection is highly ...dependent on presentation of epitopes to circulating T-cells via the HLA complex. SARS-CoV-2 is a large RNA virus and testing of all of its overlapping peptides in vitro to deconvolute an immune response is not feasible. Therefore HLA-binding prediction tools are often used to narrow down the number of peptides to test. We tested NetMHC suite tools' predictions by using an in vitro peptide-MHC stability assay. We assessed 777 peptides that were predicted to be good binders across 11 MHC alleles in a complex-stability assay and tested a selection of 19 epitope-HLA-binding prediction tools against the assay. In this investigation of potential SARS-CoV-2 epitopes we found that current prediction tools vary in performance when assessing binding stability, and they are highly dependent on the MHC allele in question. Designing a COVID-19 vaccine where only a few epitope targets are included is therefore a very challenging task. Here, we present 174 SARS-CoV-2 epitopes with high prediction binding scores, validated to bind stably to 11 HLA alleles. Our findings may contribute to the design of an efficacious vaccine against COVID-19.
Background: Software systems using artificial intelligence for medical purposes have been developed in recent years. The success of deep neural networks (DNN) in 2012 in the image recognition ...challenge ImageNet LSVRC 2010 fueled expectations of the potential for using such systems in dermatology.
Objective: To evaluate the ways in which machine learning has been utilized in dermatology to date and provide an overview of the findings in current literature on the subject.
Methods: We conducted a systematic review of existing literature, identifying the literature through a systematic search of the PubMed database. Two doctors assessed screening and eligibility with respect to pre-determined inclusion and exclusion criteria.
Results: A total of 2175 publications were identified, and 64 publications were included. We identified eight major categories where machine learning tools were tested in dermatology. Most systems involved image recognition tools that were primarily aimed at binary classification of malignant melanoma (MM). Short system descriptions and results of all included systems are presented in tables.
Conclusions: We present a complete overview of artificial intelligence implemented in dermatology. Impressive outcomes were reported in all of the identified eight categories, but head-to-head comparison proved difficult. The many areas of dermatology where we identified machine learning tools indicate the diversity of machine learning.