Many problems of modern genetics and functional genomics require the assessment of functional effects of sequence variants, including gene expression changes. Machine learning is considered to be a ...promising approach for solving this task, but its practical applications remain a challenge due to the insufficient volume and diversity of training data. A promising source of valuable data is a saturation mutagenesis massively parallel reporter assay, which quantitatively measures changes in transcription activity caused by sequence variants. Here, we explore the computational predictions of the effects of individual single-nucleotide variants on gene transcription measured in the massively parallel reporter assays, based on the data from the recent “Regulation Saturation” Critical Assessment of Genome Interpretation challenge. We show that the estimated prediction quality strongly depends on the structure of the training and validation data. Particularly, training on the sequence segments located next to the validation data results in the “information leakage” caused by the local context. This information leakage allows reproducing the prediction quality of the best CAGI challenge submissions with a fairly simple machine learning approach, and even obtaining notably better-than-random predictions using irrelevant genomic regions. Validation scenarios preventing such information leakage dramatically reduce the measured prediction quality. The performance at independent regulatory regions entirely excluded from the training set appears to be much lower than needed for practical applications, and even the performance estimation will become reliable only in the future with richer data from multiple reporters. The source code and data are available at
https://bitbucket.org/autosomeru_cagi2018/cagi2018_regsat
and
https://genomeinterpretation.org/content/expression-variants
.
Gray whale, Eschrichtius robustus (E. robustus), is a single member of the family Eschrichtiidae, which is considered to be the most primitive in the class Cetacea. Gray whale is often described as a ..."living fossil". It is adapted to extreme marine conditions and has a high life expectancy (77 years). The assembly of a gray whale genome and transcriptome will allow to carry out further studies of whale evolution, longevity, and resistance to extreme environment.
In this work, we report the first de novo assembly and primary analysis of the E. robustus genome and transcriptome based on kidney and liver samples. The presented draft genome assembly is complete by 55% in terms of a total genome length, but only by 24% in terms of the BUSCO complete gene groups, although 10,895 genes were identified. Transcriptome annotation and comparison with other whale species revealed robust expression of DNA repair and hypoxia-response genes, which is expected for whales.
This preliminary study of the gray whale genome and transcriptome provides new data to better understand the whale evolution and the mechanisms of their adaptation to the hypoxic conditions.
We investigated the diversity of CRISPR spacers of Thermus communities from two locations in Italy, two in Chile and one location in Russia. Among the five sampling sites, a total of more than 7200 ...unique spacers belonging to different CRISPR-Cas systems types and subtypes were identified. Most of these spacers are not found in CRISPR arrays of sequenced Thermus strains. Comparison of spacer sets revealed that samples within the same area (separated by few to hundreds of metres) have similar spacer sets, which appear to be largely stable at least over the course of several years. While at further distances (hundreds of kilometres and more) the similarity of spacer sets is decreased, there are still multiple common spacers in Thermus communities from different continents. The common spacers can be reconstructed in identical or similar CRISPR arrays, excluding their independent appearance and suggesting an extensive migration of thermophilic bacteria over long distances. Several new Thermus phages were isolated in the sampling sites. Mapping of spacers to bacteriophage sequences revealed examples of local acquisition of spacers from some phages and distinct patterns of targeting of phage genomes by different CRISPR-Cas systems. This article is part of a discussion meeting issue 'The ecology and evolution of prokaryotic CRISPR-Cas adaptive immune systems'.
Abstract
The Eukaryotic Pathogen, Vector and Host Informatics Resource (VEuPathDB, https://veupathdb.org) represents the 2019 merger of VectorBase with the EuPathDB projects. As a Bioinformatics ...Resource Center funded by the National Institutes of Health, with additional support from the Welllcome Trust, VEuPathDB supports >500 organisms comprising invertebrate vectors, eukaryotic pathogens (protists and fungi) and relevant free-living or non-pathogenic species or hosts. Designed to empower researchers with access to Omics data and bioinformatic analyses, VEuPathDB projects integrate >1700 pre-analysed datasets (and associated metadata) with advanced search capabilities, visualizations, and analysis tools in a graphic interface. Diverse data types are analysed with standardized workflows including an in-house OrthoMCL algorithm for predicting orthology. Comparisons are easily made across datasets, data types and organisms in this unique data mining platform. A new site-wide search facilitates access for both experienced and novice users. Upgraded infrastructure and workflows support numerous updates to the web interface, tools, searches and strategies, and Galaxy workspace where users can privately analyse their own data. Forthcoming upgrades include cloud-ready application architecture, expanded support for the Galaxy workspace, tools for interrogating host-pathogen interactions, and improved interactions with affiliated databases (ClinEpiDB, MicrobiomeDB) and other scientific resources, and increased interoperability with the Bacterial & Viral BRC.
Abstract
Ensembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species, complementing the resources for vertebrate genomics developed in ...the context of the Ensembl project (http://www.ensembl.org). Together, the two resources provide a consistent set of interfaces to genomic data across the tree of life, including reference genome sequence, gene models, transcriptional data, genetic variation and comparative analysis. Data may be accessed via our website, online tools platform and programmatic interfaces, with updates made four times per year (in synchrony with Ensembl). Here, we provide an overview of Ensembl Genomes, with a focus on recent developments. These include the continued growth, more robust and reproducible sets of orthologues and paralogues, and enriched views of gene expression and gene function in plants. Finally, we report on our continued deeper integration with the Ensembl project, which forms a key part of our future strategy for dealing with the increasing quantity of available genome-scale data across the tree of life.
Abstract
Ensembl Genomes (https://www.ensemblgenomes.org) provides access to non-vertebrate genomes and analysis complementing vertebrate resources developed by the Ensembl project ...(https://www.ensembl.org). The two resources collectively present genome annotation through a consistent set of interfaces spanning the tree of life presenting genome sequence, annotation, variation, transcriptomic data and comparative analysis. Here, we present our largest increase in plant, metazoan and fungal genomes since the project's inception creating one of the world's most comprehensive genomic resources and describe our efforts to reduce genome redundancy in our Bacteria portal. We detail our new efforts in gene annotation, our emerging support for pangenome analysis, our efforts to accelerate data dissemination through the Ensembl Rapid Release resource and our new AlphaFold visualization. Finally, we present details of our future plans including updates on our integration with Ensembl, and how we plan to improve our support for the microbial research community. Software and data are made available without restriction via our website, online tools platform and programmatic interfaces (available under an Apache 2.0 license). Data updates are synchronised with Ensembl's release cycle.
Target binding by CRISPR-Cas ribonucleoprotein effectors is initiated by the recognition of double-stranded PAM motifs by the Cas protein moiety followed by destabilization, localized melting, and ...interrogation of the target by the guide part of CRISPR RNA moiety. The latter process depends on seed sequences, parts of the target that must be strictly complementary to CRISPR RNA guide. Mismatches between the target and CRISPR RNA guide outside the seed have minor effects on target binding, thus contributing to off-target activity of CRISPR-Cas effectors. Here, we define the seed sequence of the Type V Cas12b effector from Bacillus thermoamylovorans. While the Cas12b seed is just five bases long, in contrast to all other effectors characterized to date, the nucleotide base at the site of target cleavage makes a very strong contribution to target binding. The generality of this additional requirement was confirmed during analysis of target recognition by Cas12b effector from Alicyclobacillus acidoterrestris. Thus, while the short seed may contribute to Cas12b promiscuity, the additional specificity determinant at the site of cleavage may have a compensatory effect making Cas12b suitable for specialized genome editing applications.
Natural diversity of CRISPR spacers of Thermus Lopatina, Anna; Medvedeva, Sofia; Artamonova, Daria ...
Philosophical transactions of the Royal Society of London. Series B. Biological sciences,
05/2019, Letnik:
374, Številka:
1772
Journal Article
Recenzirano
We investigated the diversity of CRISPR spacers of Thermus communities from two locations in Italy, two in Chile and one location in Russia. Among the five sampling sites, a total of more than 7200 ...unique spacers belonging to different CRISPR-Cas systems types and subtypes were identified. Most of these spacers are not found in CRISPR arrays of sequenced Thermus strains. Comparison of spacer sets revealed that samples within the same area (separated by few to hundreds of metres) have similar spacer sets, which appear to be largely stable at least over the course of several years. While at further distances (hundreds of kilometres and more) the similarity of spacer sets is decreased, there are still multiple common spacers in Thermus communities from different continents. The common spacers can be reconstructed in identical or similar CRISPR arrays, excluding their independent appearance and suggesting an extensive migration of thermophilic bacteria over long distances. Several new Thermus phages were isolated in the sampling sites. Mapping of spacers to bacteriophage sequences revealed examples of local acquisition of spacers from some phages and distinct patterns of targeting of phage genomes by different CRISPR-Cas systems.
This article is part of a discussion meeting issue 'The ecology and evolution of prokaryotic CRISPR-Cas adaptive immune systems'.
Summary
CRISPR interference occurs when a protospacer recognized by the CRISPR RNA is destroyed by Cas effectors. In Type I CRISPR‐Cas systems, protospacer recognition can lead to «primed adaptation» ...– acquisition of new spacers from in cis located sequences. Type I CRISPR‐Cas systems require the presence of a trinucleotide protospacer adjacent motif (PAM) for efficient interference. Here, we investigated the ability of each of 64 possible trinucleotides located at the PAM position to induce CRISPR interference and primed adaptation by the Escherichia coli Type I‐E CRISPR‐Cas system. We observed clear separation of PAM variants into three groups: those unable to cause interference, those that support rapid interference and those that lead to reduced interference that occurs over extended periods of time. PAM variants unable to support interference also did not support primed adaptation; those that supported rapid interference led to no or low levels of adaptation, while those that caused attenuated levels of interference consistently led to highest levels of adaptation. The results suggest that primed adaptation is fueled by the products of CRISPR interference. Extended over time interference with targets containing «attenuated» PAM variants provides a continuous source of new spacers leading to high overall level of spacer acquisition.
All possible of 64 PAM combinations were tested. 36 trinucleotides were completely unable to support interference. The remaining 28 PAM trinucleotides fell into two groups: those that supported fast interference and those that supported intermediate, delayed interference that was apparently countered by plasmid copy number maintenance mechanisms over extended periods of time. PAM variants that did not lead to CRISPR interference also did not support primed adaptation. PAM variants supporting intermediate‐rate interference caused strong priming.
Abstract
The Eukaryotic Pathogen, Vector and Host Informatics Resource (VEuPathDB, https://veupathdb.org) is a Bioinformatics Resource Center funded by the National Institutes of Health with ...additional funding from the Wellcome Trust. VEuPathDB supports >600 organisms that comprise invertebrate vectors, eukaryotic pathogens (protists and fungi) and relevant free-living or non-pathogenic species or hosts. Since 2004, VEuPathDB has analyzed omics data from the public domain using contemporary bioinformatic workflows, including orthology predictions via OrthoMCL, and integrated the analysis results with analysis tools, visualizations, and advanced search capabilities. The unique data mining platform coupled with >3000 pre-analyzed data sets facilitates the exploration of pertinent omics data in support of hypothesis driven research. Comparisons are easily made across data sets, data types and organisms. A Galaxy workspace offers the opportunity for the analysis of private large-scale datasets and for porting to VEuPathDB for comparisons with integrated data. The MapVEu tool provides a platform for exploration of spatially resolved data such as vector surveillance and insecticide resistance monitoring. To address the growing body of omics data and advances in laboratory techniques, VEuPathDB has added several new data types, searches and features, improved the Galaxy workspace environment, redesigned the MapVEu interface and updated the infrastructure to accommodate these changes.
Graphical Abstract
Graphical Abstract