Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different ...chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.
Introns can be extraordinarily large and they account for the majority of the DNA sequence in human genes. However, little is known about their population patterns of structural variation and their ...functional implication. By combining the most extensive maps of CNVs in human populations, we have found that intronic losses are the most frequent copy number variants (CNVs) in protein-coding genes in human, with 12,986 intronic deletions, affecting 4,147 genes (including 1,154 essential genes and 1,638 disease-related genes). This intronic length variation results in dozens of genes showing extreme population variability in size, with 40 genes with 10 or more different sizes and up to 150 allelic sizes. Intronic losses are frequent in evolutionarily ancient genes that are highly conserved at the protein sequence level. This result contrasts with losses overlapping exons, which are observed less often than expected by chance and almost exclusively affect primate-specific genes. An integrated analysis of CNVs and RNA-seq data showed that intronic loss can be associated with significant differences in gene expression levels in the population (CNV-eQTLs). These intronic CNV-eQTLs regions are enriched for intronic enhancers and can be associated with expression differences of other genes showing long distance intron-promoter 3D interactions. Our data suggests that intronic structural variation of protein-coding genes makes an important contribution to the variability of gene expression and splicing in human populations.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
The protein structure field is experiencing a revolution. From the increased throughput of techniques to determine experimental structures, to developments such as cryo-EM that allow us to find the ...structures of large protein complexes or, more recently, the development of artificial intelligence tools, such as AlphaFold, that can predict with high accuracy the folding of proteins for which the availability of homology templates is limited. Here we quantify the effect of the recently released AlphaFold database of protein structural models in our knowledge on human proteins. Our results indicate that our current baseline for structural coverage of 48%, considering experimentally-derived or template-based homology models, elevates up to 76% when including AlphaFold predictions. At the same time the fraction of dark proteome is reduced from 26% to just 10% when AlphaFold models are considered. Furthermore, although the coverage of disease-associated genes and mutations was near complete before AlphaFold release (69% of Clinvar pathogenic mutations and 88% of oncogenic mutations), AlphaFold models still provide an additional coverage of 3% to 13% of these critically important sets of biomedical genes and mutations. Finally, we show how the contribution of AlphaFold models to the structural coverage of non-human organisms, including important pathogenic bacteria, is significantly larger than that of the human proteome. Overall, our results show that the sequence-structure gap of human proteins has almost disappeared, an outstanding success of direct consequences for the knowledge on the human genome and the derived medical applications.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
The analysis of evolutionary amino acid correlations has recently attracted a surge of renewed interest, also due to their successful use in de novo protein native structure prediction. However, many ...aspects of protein function, such as substrate binding and product release in enzymatic activity, can be fully understood only in terms of an equilibrium ensemble of alternative structures, rather than a single static structure. In this paper we combine coevolutionary data and molecular dynamics simulations to study protein conformational heterogeneity. To that end, we adapt the Boltzmann-learning algorithm to the analysis of homologous protein sequences and develop a coarse-grained protein model specifically tailored to convert the resulting contact predictions to a protein structural ensemble. By means of exhaustive sampling simulations, we analyze the set of conformations that are consistent with the observed residue correlations for a set of representative protein domains, showing that (i) the most representative structure is consistent with the experimental fold and (ii) the various regions of the sequence display different stability, related to multiple biologically relevant conformations and to the cooperativity of the coevolving pairs. Moreover, we show that the proposed protocol is able to reproduce the essential features of a protein folding mechanism as well as to account for regions involved in conformational transitions through the correct sampling of the involved conformers.
The Ras superfamily is a fascinating example of functional diversification in the context of a preserved structural framework and a prototypic GTP binding site. Thanks to the availability of complete ...genome sequences of species representing important evolutionary branch points, we have analyzed the composition and organization of this superfamily at a greater level than was previously possible. Phylogenetic analysis of gene families at the organism and sequence level revealed complex relationships between the evolution of this protein superfamily sequence and the acquisition of distinct cellular functions. Together with advances in computational methods and structural studies, the sequence information has helped to identify features important for the recognition of molecular partners and the functional specialization of different members of the Ras superfamily.
Most proteins fold into 3D structures that determine how they function and orchestrate the biological processes of the cell. Recent developments in computational methods for protein structure ...predictions have reached the accuracy of experimentally determined models. Although this has been independently verified, the implementation of these methods across structural-biology applications remains to be tested. Here, we evaluate the use of AlphaFold2 (AF2) predictions in the study of characteristic structural elements; the impact of missense variants; function and ligand binding site predictions; modeling of interactions; and modeling of experimental structural data. For 11 proteomes, an average of 25% additional residues can be confidently modeled when compared with homology modeling, identifying structural features rarely seen in the Protein Data Bank. AF2-based predictions of protein disorder and complexes surpass dedicated tools, and AF2 models can be used across diverse applications equally well compared with experimentally determined structures, when the confidence metrics are critically considered. In summary, we find that these advances are likely to have a transformative impact in structural biology and broader life-science research.
Co‐evolution has an important function in the evolution of species and it is clearly manifested in certain scenarios such as host–parasite and predator–prey interactions, symbiosis and mutualism. The ...extrapolation of the concepts and methodologies developed for the study of species co‐evolution at the molecular level has prompted the development of a variety of computational methods able to predict protein interactions through the characteristics of co‐evolution. Particularly successful have been those methods that predict interactions at the genomic level based on the detection of pairs of protein families with similar evolutionary histories (similarity of phylogenetic trees: mirrortree). Future advances in this field will require a better understanding of the molecular basis of the co‐evolution of protein families. Thus, it will be important to decipher the molecular mechanisms underlying the similarity observed in phylogenetic trees of interacting proteins, distinguishing direct specific molecular interactions from other general functional constraints. In particular, it will be important to separate the effects of physical interactions within protein complexes (‘co‐adaptation’) from other forces that, in a less specific way, can also create general patterns of co‐evolution.
Assessing functional associations between an experimentally derived gene or protein set of interest and a database of known gene/protein sets is a common task in the analysis of large-scale ...functional genomics data. For this purpose, a frequently used approach is to apply an over-representation-based enrichment analysis. However, this approach has four drawbacks: (i) it can only score functional associations of overlapping gene/proteins sets; (ii) it disregards genes with missing annotations; (iii) it does not take into account the network structure of physical interactions between the gene/protein sets of interest and (iv) tissue-specific gene/protein set associations cannot be recognized.
To address these limitations, we introduce an integrative analysis approach and web-application called EnrichNet. It combines a novel graph-based statistic with an interactive sub-network visualization to accomplish two complementary goals: improving the prioritization of putative functional gene/protein set associations by exploiting information from molecular interaction networks and tissue-specific gene expression data and enabling a direct biological interpretation of the results. By using the approach to analyse sets of genes with known involvement in human diseases, new pathway associations are identified, reflecting a dense sub-network of interactions between their corresponding proteins.
EnrichNet is freely available at http://www.enrichnet.org.
Natalio.Krasnogor@nottingham.ac.uk, reinhard.schneider@uni.lu or avalencia@cnio.es
Supplementary data are available at Bioinformatics Online.
Many common variants have been associated with hematological traits, but identification of causal genes and pathways has proven challenging. We performed a genome-wide association analysis in the UK ...Biobank and INTERVAL studies, testing 29.5 million genetic variants for association with 36 red cell, white cell, and platelet properties in 173,480 European-ancestry participants. This effort yielded hundreds of low frequency (<5%) and rare (<1%) variants with a strong impact on blood cell phenotypes. Our data highlight general properties of the allelic architecture of complex traits, including the proportion of the heritable component of each blood trait explained by the polygenic signal across different genome regulatory domains. Finally, through Mendelian randomization, we provide evidence of shared genetic pathways linking blood cell indices with complex pathologies, including autoimmune diseases, schizophrenia, and coronary heart disease and evidence suggesting previously reported population associations between blood cell indices and cardiovascular disease may be non-causal.
Display omitted
•Genome-wide association study interrogates 36 traits across the hematopoietic system•A total of 2,706 associated variants, including 130 rare and 230 low frequency•Describes allelic spectrum and heritability of coding and regulatory variants•Unravels causal contributions to cardiovascular, immune, and psychiatric disease
As part of the IHEC Consortium, this study probes the allelic architecture and regulatory landscape of cellular complex traits with power to identify causal pathways and links to diseases such as schizophrenia. Explore the Cell Press IHEC web portal at http://www.cell.com/consortium/IHEC.
This letter analyzes two large-scale proteomics studies published in the same issue of Nature. At the time of the release, both studies were portrayed as draft maps of the human proteome and great ...advances in the field. As with the initial publication of the human genome, these papers have broad appeal and will no doubt lead to a great deal of further analysis by the scientific community. However, we were intrigued by the number of protein-coding genes detected by the two studies, numbers that far exceeded what has been reported for the multinational Human Proteome Project effort. We carried out a simple quality test on the data using the olfactory receptor family. A high-quality proteomics experiment that does not specifically analyze nasal tissues should not expect to detect many peptides for olfactory receptors. Neither of the studies carried out experiments on nasal tissues, yet we found peptide evidence for more than 100 olfactory receptors in the two studies. These results suggest that the two studies are substantially overestimating the number of protein coding genes they identify. We conclude that the experimental data from these two studies should be used with caution.