Ensembl 2021 Howe, Kevin L; Achuthan, Premanand; Allen, James ...
Nucleic acids research,
01/2021, Letnik:
49, Številka:
D1
Journal Article
Recenzirano
Odprti dostop
Abstract
The Ensembl project (https://www.ensembl.org) annotates genomes and disseminates genomic data for vertebrate species. We create detailed and comprehensive annotation of gene structures, ...regulatory elements and variants, and enable comparative genomics by inferring the evolutionary history of genes and genomes. Our integrated genomic data are made available in a variety of ways, including genome browsers, search interfaces, specialist tools such as the Ensembl Variant Effect Predictor, download files and programmatic interfaces. Here, we present recent Ensembl developments including two new website portals. Ensembl Rapid Release (http://rapid.ensembl.org) is designed to provide core tools and services for genomes as soon as possible and has been deployed to support large biodiversity sequencing projects. Our SARS-CoV-2 genome browser (https://covid-19.ensembl.org) integrates our own annotation with publicly available genomic data from numerous sources to facilitate the use of genomics in the international scientific response to the COVID-19 pandemic. We also report on other updates to our annotation resources, tools and services. All Ensembl data and software are freely available without restriction.
We have developed a new set of algorithms, collectively called "Velvet," to manipulate de Bruijn graphs for genomic sequence assembly. A de Bruijn graph is a compact representation based on short ...words (k-mers) that is ideal for high coverage, very short read (25-50 bp) data sets. Applying Velvet to very short reads and paired-ends information only, one can produce contigs of significant length, up to 50-kb N50 length in simulations of prokaryotic data and 3-kb N50 on simulated mammalian BACs. When applied to real Solexa data sets without read pairs, Velvet generated contigs of approximately 8 kb in a prokaryote and 2 kb in a mammalian BAC, in close agreement with our simulated results without read-pair information. Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies.
The Velvet de novo assembler was designed to build contigs and eventually scaffolds from short-read sequencing data. This protocol describes how to use Velvet, interpret its output, and tune its ...parameters for optimal results. It also covers practical issues such as configuration, using the VelvetOptimiser routine, and processing colorspace data.
MOTIVATION: High-throughput sequencing has made the analysis of new model organisms more affordable. Although assembling a new genome can still be costly and difficult, it is possible to use RNA-seq ...to sequence mRNA. In the absence of a known genome, it is necessary to assemble these sequences de novo, taking into account possible alternative isoforms and the dynamic range of expression values. RESULTS: We present a software package named Oases designed to heuristically assemble RNA-seq reads in the absence of a reference genome, across a broad spectrum of expression values and in presence of alternative isoforms. It achieves this by using an array of hash lengths, a dynamic filtering of noise, a robust resolution of alternative splicing events and the efficient merging of multiple assemblies. It was tested on human and mouse RNA-seq data and is shown to improve significantly on the transABySS and Trinity de novo transcriptome assemblers. Availability and implementation: Oases is freely available under the GPL license at www.ebi.ac.uk/~zerbino/oases/ CONTACT: dzerbino@ucsc.edu Supplementary information: Supplementary data are available at Bioinformatics online.
Abstract
The Ensembl (https://www.ensembl.org) is a system for generating and distributing genome annotation such as genes, variation, regulation and comparative genomics across the vertebrate ...subphylum and key model organisms. The Ensembl annotation pipeline is capable of integrating experimental and reference data from multiple providers into a single integrated resource. Here, we present 94 newly annotated and re-annotated genomes, bringing the total number of genomes offered by Ensembl to 227. This represents the single largest expansion of the resource since its inception. We also detail our continued efforts to improve human annotation, developments in our epigenome analysis and display, a new tool for imputing causal genes from genome-wide association studies and visualisation of variation within a 3D protein model. Finally, we present information on our new website. Both software and data are made available without restriction via our website, online tools platform and programmatic interfaces (available under an Apache 2.0 license) and data updates made available four times a year.
Many gene expression quantitative trait locus (eQTL) studies have published their summary statistics, which can be used to gain insight into complex human traits by downstream analyses, such as fine ...mapping and co-localization. However, technical differences between these datasets are a barrier to their widespread use. Consequently, target genes for most genome-wide association study (GWAS) signals have still not been identified. In the present study, we present the eQTL Catalogue ( https://www.ebi.ac.uk/eqtl ), a resource of quality-controlled, uniformly re-computed gene expression and splicing QTLs from 21 studies. We find that, for matching cell types and tissues, the eQTL effect sizes are highly reproducible between studies. Although most QTLs were shared between most bulk tissues, we identified a greater diversity of cell-type-specific QTLs from purified cell types, a subset of which also manifested as new disease co-localizations. Our summary statistics are freely available to enable the systematic interpretation of human GWAS associations across many cell types and tissues.
Despite the short length of their reads, micro-read sequencing technologies have shown their usefulness for de novo sequencing. However, especially in eukaryotic genomes, complex repeat patterns are ...an obstacle to large assemblies.
We present a novel heuristic algorithm, Pebble, which uses paired-end read information to resolve repeats and scaffold contigs to produce large-scale assemblies. In simulations, we can achieve weighted median scaffold lengths (N50) of above 1 Mbp in Bacteria and above 100 kbp in more complex organisms. Using real datasets we obtained a 96 kbp N50 in Pseudomonas syringae and a unique 147 kbp scaffold of a ferret BAC clone. We also present an efficient algorithm called Rock Band for the resolution of repeats in the case of mixed length assemblies, where different sequencing platforms are combined to obtain a cost-effective assembly.
These algorithms extend the utility of short read only assemblies into large complex genomes. They have been implemented and made available within the open-source Velvet short-read de novo assembler.
The ensembl regulatory build Zerbino, Daniel R; Wilder, Steven P; Johnson, Nathan ...
Genome Biology,
03/2015, Letnik:
16, Številka:
1
Journal Article
Recenzirano
Odprti dostop
Most genomic variants associated with phenotypic traits or disease do not fall within gene coding regions, but in regulatory regions, rendering their interpretation difficult. We collected public ...data on epigenetic marks and transcription factor binding in human cell types and used it to construct an intuitive summary of regulatory regions in the human genome. We verified it against independent assays for sensitivity. The Ensembl Regulatory Build will be progressively enriched when more data is made available. It is freely available on the Ensembl browser, from the Ensembl Regulation MySQL database server and in a dedicated track hub.
Our understanding of the human genome has continuously expanded since its draft publication in 2001. Over the years, novel assays have allowed us to progressively overlay layers of knowledge above ...the raw sequence of A's, T's, G's, and C's. The reference human genome sequence is now a complex knowledge base maintained under the shared stewardship of multiple specialist communities. Its complexity stems from the fact that it is simultaneously a template for transcription, a record of evolution, a vehicle for genetics, and a functional molecule. In short, the human genome serves as a frame of reference at the intersection of a diversity of scientific fields. In recent years, the progressive fall in sequencing costs has given increasing importance to the quality of the human reference genome, as hundreds of thousands of individuals are being sequenced yearly, often for clinical applications. Also, novel sequencing-based assays shed light on novel functions of the genome, especially with respect to gene expression regulation. Keeping the human genome annotation up to date and accurate is therefore an ongoing partnership between reference annotation projects and the greater community worldwide.
Abstract
The Ensembl project has been aggregating, processing, integrating and redistributing genomic datasets since the initial releases of the draft human genome, with the aim of accelerating ...genomics research through rapid open distribution of public data. Large amounts of raw data are thus transformed into knowledge, which is made available via a multitude of channels, in particular our browser (http://www.ensembl.org). Over time, we have expanded in multiple directions. First, our resources describe multiple fields of genomics, in particular gene annotation, comparative genomics, genetics and epigenomics. Second, we cover a growing number of genome assemblies; Ensembl Release 90 contains exactly 100. Third, our databases feed simultaneously into an array of services designed around different use cases, ranging from quick browsing to genome-wide bioinformatic analysis. We present here the latest developments of the Ensembl project, with a focus on managing an increasing number of assemblies, supporting efforts in genome interpretation and improving our browser.