We present a web server to predict the functional effect of single or multiple amino acid substitutions, insertions and deletions using the prediction tool PROVEAN. The server provides rapid analysis ...of protein variants from any organisms, and also supports high-throughput analysis for human and mouse variants at both the genomic and protein levels.
The web server is freely available and open to all users with no login requirements at http://provean.jcvi.org.
achan@jcvi.org
Supplementary data are available at Bioinformatics online.
As next-generation sequencing projects generate massive genome-wide sequence variation data, bioinformatics tools are being developed to provide computational predictions on the functional effects of ...sequence variations and narrow down the search of casual variants for disease phenotypes. Different classes of sequence variations at the nucleotide level are involved in human diseases, including substitutions, insertions, deletions, frameshifts, and non-sense mutations. Frameshifts and non-sense mutations are likely to cause a negative effect on protein function. Existing prediction tools primarily focus on studying the deleterious effects of single amino acid substitutions through examining amino acid conservation at the position of interest among related sequences, an approach that is not directly applicable to insertions or deletions. Here, we introduce a versatile alignment-based score as a new metric to predict the damaging effects of variations not limited to single amino acid substitutions but also in-frame insertions, deletions, and multiple amino acid substitutions. This alignment-based score measures the change in sequence similarity of a query sequence to a protein sequence homolog before and after the introduction of an amino acid variation to the query sequence. Our results showed that the scoring scheme performs well in separating disease-associated variants (n = 21,662) from common polymorphisms (n = 37,022) for UniProt human protein variations, and also in separating deleterious variants (n = 15,179) from neutral variants (n = 17,891) for UniProt non-human protein variations. In our approach, the area under the receiver operating characteristic curve (AUC) for the human and non-human protein variation datasets is ∼0.85. We also observed that the alignment-based score correlates with the deleteriousness of a sequence variation. In summary, we have developed a new algorithm, PROVEAN (Protein Variation Effect Analyzer), which provides a generalized approach to predict the functional effects of protein sequence variations including single or multiple amino acid substitutions, and in-frame insertions and deletions. The PROVEAN tool is available online at http://provean.jcvi.org.
Humans are a diploid species that inherit one set of chromosomes paternally and one homologous set of chromosomes maternally. Unfortunately, most human sequencing initiatives ignore this fact in that ...they do not directly delineate the nucleotide content of the maternal and paternal copies of the 23 chromosomes individuals possess (i.e., they do not 'phase' the genome) often because of the costs and complexities of doing so. We compared 11 different widely-used approaches to phasing human genomes using the publicly available 'Genome-In-A-Bottle' (GIAB) phased version of the NA12878 genome as a gold standard. The phasing strategies we compared included laboratory-based assays that prepare DNA in unique ways to facilitate phasing as well as purely computational approaches that seek to reconstruct phase information from general sequencing reads and constructs or population-level haplotype frequency information obtained through a reference panel of haplotypes. To assess the performance of the 11 approaches, we used metrics that included, among others, switch error rates, haplotype block lengths, the proportion of fully phase-resolved genes, phasing accuracy and yield between pairs of SNVs. Our comparisons suggest that a hybrid or combined approach that leverages: 1. population-based phasing using the SHAPEIT software suite, 2. either genome-wide sequencing read data or parental genotypes, and 3. a large reference panel of variant and haplotype frequencies, provides a fast and efficient way to produce highly accurate phase-resolved individual human genomes. We found that for population-based approaches, phasing performance is enhanced with the addition of genome-wide read data; e.g., whole genome shotgun and/or RNA sequencing reads. Further, we found that the inclusion of parental genotype data within a population-based phasing strategy can provide as much as a ten-fold reduction in phasing errors. We also considered a majority voting scheme for the construction of a consensus haplotype combining multiple predictions for enhanced performance and site coverage. Finally, we also identified DNA sequence signatures associated with the genomic regions harboring phasing switch errors, which included regions of low polymorphism or SNV density.
Summary
The flowering plant Arabidopsis thaliana is a dicot model organism for research in many aspects of plant biology. A comprehensive annotation of its genome paves the way for understanding the ...functions and activities of all types of transcripts, including mRNA, the various classes of non‐coding RNA, and small RNA. The TAIR10 annotation update had a profound impact on Arabidopsis research but was released more than 5 years ago. Maintaining the accuracy of the annotation continues to be a prerequisite for future progress. Using an integrative annotation pipeline, we assembled tissue‐specific RNA‐Seq libraries from 113 datasets and constructed 48 359 transcript models of protein‐coding genes in eleven tissues. In addition, we annotated various classes of non‐coding RNA including microRNA, long intergenic RNA, small nucleolar RNA, natural antisense transcript, small nuclear RNA, and small RNA using published datasets and in‐house analytic results. Altogether, we identified 635 novel protein‐coding genes, 508 novel transcribed regions, 5178 non‐coding RNAs, and 35 846 small RNA loci that were formerly unannotated. Analysis of the splicing events and RNA‐Seq based expression profiles revealed the landscapes of gene structures, untranslated regions, and splicing activities to be more intricate than previously appreciated. Furthermore, we present 692 uniformly expressed housekeeping genes, 43% of whose human orthologs are also housekeeping genes. This updated Arabidopsis genome annotation with a substantially increased resolution of gene models will not only further our understanding of the biological processes of this plant model but also of other species.
Significance Statement
The most recent annotation of the Arabidopsis thaliana genome (TAIR10), released more than 5 years ago, had a profound impact on Arabidopsis research. Here we present Araport11, a re‐annotation of the Col‐0 reference genome. We used extensive RNA‐seq data to update and extend structural gene models, thus identifying over 700 novel protein‐coding genes, 500 novel transcribed regions, 5000 non‐coding genes, and 35 000 small RNA loci that formerly eluded annotation.
To better combat the expansion of antibiotic resistance in pathogens, new compounds, particularly those with novel mechanisms-of-action MOA, represent a major research priority in biomedical science. ...However, rediscovery of known antibiotics demonstrates a need for approaches that accurately identify potential novelty with higher throughput and reduced labor. Here we describe an explainable artificial intelligence classification methodology that emphasizes prediction performance and human interpretability by using a Hierarchical Ensemble of Classifiers model optimized with a novel feature selection algorithm called Clairvoyance; collectively referred to as a CoHEC model. We evaluated our methods using whole transcriptome responses from Escherichia coli challenged with 41 known antibiotics and 9 crude extracts while depositing 122 transcriptomes unique to this study. Our CoHEC model can properly predict the primary MOA of previously unobserved compounds in both purified forms and crude extracts at an accuracy above 99%, while also correctly identifying darobactin, a newly discovered antibiotic, as having a novel MOA. In addition, we deploy our methods on a recent E. coli transcriptomics dataset from a different strain and a Mycobacterium smegmatis metabolomics timeseries dataset showcasing exceptionally high performance; improving upon the performance metrics of the original publications. We not only provide insight into the biological interpretation of our model but also that the concept of MOA is a non-discrete heuristic with diverse effects for different compounds within the same MOA, suggesting substantial antibiotic diversity awaiting discovery within existing MOA.
Display omitted
•Water/biodiesel emulsions were prepared from industrial biodiesel.•Water content and average droplet size were experimentally determined.•NIR spectra was collected in order to ...develop PLS and ANN models.•PLS model showed better performance compared to ANN model to predict water content.•To predict the average droplet size, ANN model achieved better results.
NIR spectroscopy was used to predict the average droplet size and water content in water/biodiesel emulsions. The emulsions were prepared from industrial biodiesel obtained from soybean oil (85wt%) and animal fat (15wt%) by methylic route. NIR spectra was collected in the transmittance mode with the diffuse reflectance technique. Based on the NIR spectroscopy results, it can be pointed out that this methodology has the sensibility to infer the droplet size and water content in the biodiesel emulsions. Two techniques were used to obtain the multivariate models: the partial least squares (PLS) and artificial neural network (ANN) models. Satisfactory values of mean error for the external validation were obtained, with 9.53% to PLS model for average droplet size and 8.79% for water content, since they are close to the experimental standard deviation. The performance of ANN models demonstrated that this technique allowed the prediction of average droplet size and water content with mean errors of 6.10% and 13.20%, respectively. These errors are close to the analytical error associated to the method used indicating that the NIR spectroscopy is a good alternative to be used for this purpose.