Reactome aims to provide bioinformatics tools for visualisation, interpretation and analysis of pathway knowledge to support basic research, genome analysis, modelling, systems biology and education. ...Pathway analysis methods have a broad range of applications in physiological and biomedical research; one of the main problems, from the analysis methods performance point of view, is the constantly increasing size of the data samples.
Here, we present a new high-performance in-memory implementation of the well-established over-representation analysis method. To achieve the target, the over-representation analysis method is divided in four different steps and, for each of them, specific data structures are used to improve performance and minimise the memory footprint. The first step, finding out whether an identifier in the user's sample corresponds to an entity in Reactome, is addressed using a radix tree as a lookup table. The second step, modelling the proteins, chemicals, their orthologous in other species and their composition in complexes and sets, is addressed with a graph. The third and fourth steps, that aggregate the results and calculate the statistics, are solved with a double-linked tree.
Through the use of highly optimised, in-memory data structures and algorithms, Reactome has achieved a stable, high performance pathway analysis service, enabling the analysis of genome-wide datasets within seconds, allowing interactive exploration and analysis of high throughput data. The proposed pathway analysis approach is available in the Reactome production web site either via the AnalysisService for programmatic access or the user submission interface integrated into the PathwayBrowser. Reactome is an open data and open source project and all of its source code, including the one described here, is available in the AnalysisTools repository in the Reactome GitHub ( https://github.com/reactome/ ).
Human blood metagenomics has revealed the presence of different types of viruses in apparently healthy subjects. By far, anelloviruses constitute the viral family that is more frequently found in ...human blood, although amplification biases and contaminations pose a major challenge in this field. To investigate this further, we subjected pooled plasma samples from 120 healthy donors in Spain to high-speed centrifugation, RNA and DNA extraction, random amplification, and massive parallel sequencing. Our results confirm the extensive presence of anelloviruses in such samples, which represented nearly 97% of the total viral sequence reads obtained. We assembled 114 different viral genomes belonging to this family, revealing remarkable diversity. Phylogenetic analysis of ORF1 suggested 28 potentially novel anellovirus species, 24 of which were validated by Sanger sequencing to discard artifacts. These findings underscore the importance of implementing more efficient purification procedures that enrich the viral fraction as an essential step in virome studies and question the suggested pathological role of anelloviruses.
Major advances in sequencing technologies and the sharing of data and metadata in science have resulted in a wealth of publicly available datasets. However, working with and especially curating ...public omics datasets remains challenging despite these efforts. While a growing number of initiatives aim to re-use previous results, these present limitations that often lead to the need for further in-house curation and processing.
Here, we present the Omics Dataset Curation Toolkit (OMD Curation Toolkit), a python3 package designed to accompany and guide the researcher during the curation process of metadata and fastq files of public omics datasets. This workflow provides a standardized framework with multiple capabilities (collection, control check, treatment and integration) to facilitate the arduous task of curating public sequencing data projects. While centered on the European Nucleotide Archive (ENA), the majority of the provided tools are generic and can be used to curate datasets from different sources.
Thus, it offers valuable tools for the in-house curation previously needed to re-use public omics data. Due to its workflow structure and capabilities, it can be easily used and benefit investigators in developing novel omics meta-analyses based on sequencing data.
Ribosomal DNA (rDNA) is the genetic loci that encodes rRNA in eukaryotes. It is typically arranged as tandem repeats that vary in copy number within the same species. We have recently shown that rDNA ...repeats copy number in the yeast Saccharomyces cerevisiae is controlled by cell volume via a feedback circuit that senses cell volume by means of the concentration of the free upstream activator factor (UAF). The UAF strongly binds the rDNA gene promoter, but is also able to repress SIR2 deacetylase gene transcription that, in turn, represses rDNA amplification. In this way, the cells with a smaller DNA copy number than what is optimal evolve to increase that copy number until they reach a number that sequestrates free UAF and provokes SIR2 derepression that, in turn, blocks rDNA amplification. Here we propose a mathematical model to show that this evolutionary process can amplify rDNA repeats independently of the selective advantage of yeast cells having bigger or smaller rDNA copy numbers. We test several variants of this process and show that it can explain the observed experimental results independently of natural selection. These results predict that an autoregulated feedback circuit may, in some instances, drive to non Darwinian deterministic evolution for a limited time period.
Uncertainties in the satellite world lines lead to dominant positioning errors. In the present work, using the approach presented in Puchades and Sáez (Astrophys. Space Sci. 352, 307–320,
2014
), a ...new analysis of these errors is developed inside a great region surrounding Earth. This analysis is performed in the framework of the so-called Relativistic Positioning Systems (RPS). Schwarzschild metric is used to describe the satellite orbits corresponding to the Galileo Satellites Constellation. Those orbits are circular with the Earth as their centre. They are defined as the nominal orbits. The satellite orbits are not circular due to the perturbations they have and to achieve a more realistic description such perturbations need to be taken into account. In Puchades and Sáez (Astrophys. Space Sci. 352, 307–320,
2014
) perturbations of the nominal orbits were statistically simulated. Using the formula from Coll et al. (Class. Quantum Gravity. 27, 065013,
2010
) a user location is determined with the four satellites proper times that the user receives and with the satellite world lines. This formula can be used with any satellite description, although photons need to travel in a Minkowskian space-time. For our purposes, the computation of the photon geodesics in Minkowski space-time is sufficient as demonstrated in Puchades and Sáez (Adv. Space Res. 57, 499–508,
2016
). The difference of the user position determined with the nominal and the perturbed satellite orbits is computed. This difference is defined as the U-error. Now we compute the perturbed orbits of the satellites considering a metric that takes into account the gravitational effects of the Earth, the Moon and the Sun and also the Earth oblateness. A study of the satellite orbits in this new metric is first introduced. Then we compute the U-errors comparing the positions given with the Schwarzschild metric and the metric introduced here. A Runge-Kutta method is used to solve the satellite geodesic equations. Some improvements in the computation of the U-errors using both metrics are introduced with respect to our previous works. Conclusions and perspectives are also presented.
Display omitted
The human gut holds a special place in the study of different microbial environments due to growing evidence that the gut microbiota is related to host health. However, despite ...extensive research, there is still a lack of knowledge about the core taxa forming the gut microbiota and, moreover, available information is biased towards western microbiomes in both genome databases and most core taxa studies. To tackle these limitations, we tested a database enrichment strategy and analyzed public datasets of whole-genome shotgun data, generated from 545 fecal samples, comprising three gradients of westernization. The NT database was selected as a baseline of biological diversity, subsequently being combined with various studies of interest related to the human microbiota. This enrichment strategy made it possible to improve classification capacity, compared to the original unenriched database, regarding the various lifestyles and populations studied. The effects of incomplete-taxonomy metagenome-assembled genomes on genome database enrichment were also examined, revealing that, while they are helpful, they should be used with caution depending on the taxonomic level of interest. Moreover, in terms of high prevalence, the core analysis revealed a conserved set of bacterial taxa in the healthy human gut microbiota worldwide, despite apparent lifestyle differences. Such taxa show a set of traits, metabolic roles, and ancestral status, making them suitable candidates for a hypothetical phylogenetic core of mutualistic microorganisms co-evolving with the human species.
Organisms are unique physical entities in which information is stored and continuously processed. The digital nature of DNA sequences enables the construction of a dynamic information reservoir. ...However, the distinction between the hardware and software components in the information flow is crucial to identify the mechanisms generating specific genomic signatures. In this work, we perform a bibliometric analysis to identify the different purposes of looking for particular patterns in DNA sequences associated with a given phenotype. This study has enabled us to make a conceptual breakdown of the genomic signature and differentiate the leading applications. On the one hand, it refers to gene expression profiling associated with a biological function, which may be shared across taxa. This signature is the focus of study in precision medicine. On the other hand, it also refers to characteristic patterns in species-specific DNA sequences. This interpretation plays a key role in comparative genomics, identifying evolutionary relationships. Looking at the relevant studies in our bibliographic database, we highlight the main factors causing heterogeneities in genome composition and how they can be quantified. All these findings lead us to reformulate some questions relevant to evolutionary biology.
The environmental impact of uncultured phages is shaped by their preferred life cycle (lytic or lysogenic). However, our ability to predict it is very limited. We aimed to discriminate between lytic ...and lysogenic phages by comparing the similarity of their genomic signatures to those of their hosts, reflecting their co-evolution. We tested two approaches: (1) similarities of tetramer relative frequencies, (2) alignment-free comparisons based on exact k = 14 oligonucleotide matches. First, we explored 5126 reference bacterial host strains and 284 associated phages and found an approximate threshold for distinguishing lysogenic and lytic phages using both oligonucleotide-based methods. The analysis of 6482 plasmids revealed the potential for horizontal gene transfer between different host genera and, in some cases, distant bacterial taxa. Subsequently, we experimentally analyzed combinations of 138
strains and their 41 phages and found that the phages with the largest number of interactions with these strains in the laboratory had the shortest genomic distances to
. We then applied our methods to 24 single-cells from a hot spring biofilm containing 41 uncultured phage-host pairs, and the results were compatible with the lysogenic life cycle of phages detected in this environment. In conclusion, oligonucleotide-based genome analysis methods can be used for predictions of (1) life cycles of environmental phages, (2) phages with the broadest host range in culture collections, and (3) potential horizontal gene transfer by plasmids.
The generation of different types of defective viral genomes (DVG) is an unavoidable consequence of the error-prone replication of RNA viruses. In recent years, a particular class of DVGs, those ...containing long deletions or genome rearrangements, has gain interest due to their potential therapeutic and biotechnological applications. Identifying such DVGs in high-throughput sequencing (HTS) data has become an interesting computational problem. Several algorithms have been proposed to accomplish this goal, though all incur false positives, a problem of practical interest if such DVGs have to be synthetized and tested in the laboratory. We present a metasearch tool, DVGfinder, that wraps the two most commonly used DVG search algorithms in a single workflow for the identification of the DVGs in HTS data. DVGfinder processes the results of ViReMa-a and DI-tector and uses a gradient boosting classifier machine learning algorithm to reduce the number of false-positive events. The program also generates output files in user-friendly HTML format, which can help users to explore the DVGs identified in the sample. We evaluated the performance of DVGfinder compared to the two search algorithms used separately and found that it slightly improves sensitivities for low-coverage synthetic HTS data and DI-tector precision for high-coverage samples. The metasearch program also showed higher sensitivity on a real sample for which a set of copy-backs were previously validated.
A major challenge in microbial ecology is to understand the principles and processes by which microbes associate and interact in community assemblages. Microbial communities in mountain glaciers are ...unique as first colonizers and nutrient enrichment drivers for downstream ecosystems. However, mountain glaciers have been distinctively sensitive to climate perturbations and have suffered a severe retreat over the past 40 years, compelling us to understand glacier ecosystems before their disappearance. This is the first study in an Andean glacier in Ecuador offering insights into the relationship of physicochemical variables and altitude on the diversity and structure of bacterial communities. Our study covered extreme Andean altitudes at the Cayambe Volcanic Complex, from 4,783 to 5,583 masl. Glacier soil and ice samples were used as the source for 16S rRNA gene amplicon libraries. We found (1) effects of altitude on diversity and community structure, (2) the presence of few significantly correlated nutrients to community structure, (3) sharp differences between glacier soil and glacier ice in diversity and community structure, where, as quantified by the Shannon γ-diversity distribution, the meta-community in glacier soil showed more diversity than in glacier ice; this pattern was related to the higher variability of the physicochemical distribution of variables in the former substrate, and (4) significantly abundant genera associated with either high or low altitudes that could serve as biomarkers for studies on climate change. Our results provide the first assessment of these unexplored communities, before their potential disappearance due to glacier retreat and climate change.