A draft human pangenome reference Liao, Wen-Wei; Asri, Mobin; Ebler, Jana ...
Nature (London),
05/2023, Volume:
617, Issue:
7960
Journal Article
Peer reviewed
Open access
Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse ...individuals
. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample.
Basecalling, the computational process of translating raw electrical signal to nucleotide sequence, is of critical importance to the sequencing platforms produced by Oxford Nanopore Technologies ...(ONT). Here, we examine the performance of different basecalling tools, looking at accuracy at the level of bases within individual reads and at majority-rule consensus basecalls in an assembly. We also investigate some additional aspects of basecalling: training using a taxon-specific dataset, using a larger neural network model and improving consensus basecalls in an assembly by additional signal-level analysis with Nanopolish.
Training basecallers on taxon-specific data results in a significant boost in consensus accuracy, mostly due to the reduction of errors in methylation motifs. A larger neural network is able to improve both read and consensus accuracy, but at a cost to speed. Improving consensus sequences ('polishing') with Nanopolish somewhat negates the accuracy differences in basecallers, but pre-polish accuracy does have an effect on post-polish accuracy.
Basecalling accuracy has seen significant improvements over the last 2 years. The current version of ONT's Guppy basecaller performs well overall, with good accuracy and fast performance. If higher accuracy is required, users should consider producing a custom model using a larger neural network and/or training data from the same species.
We used a deeply sequenced dataset of 910 individuals, all of African descent, to construct a set of DNA sequences that is present in these individuals but missing from the reference human genome. We ...aligned 1.19 trillion reads from the 910 individuals to the reference genome (GRCh38), collected all reads that failed to align, and assembled these reads into contiguous sequences (contigs). We then compared all contigs to one another to identify a set of unique sequences representing regions of the African pan-genome missing from the reference genome. Our analysis revealed 296,485,284 bp in 125,715 distinct contigs present in the populations of African descent, demonstrating that the African pan-genome contains ~10% more DNA than the current human reference genome. Although the functional significance of nearly all of this sequence is unknown, 387 of the novel contigs fall within 315 distinct protein-coding genes, and the rest appear to be intergenic.
Human coronaviruses (HCoVs), including severe acute respiratory syndrome coronavirus (SARS-CoV) and 2019 novel coronavirus (2019-nCoV, also known as SARS-CoV-2), lead global epidemics with high ...morbidity and mortality. However, there are currently no effective drugs targeting 2019-nCoV/SARS-CoV-2. Drug repurposing, representing as an effective drug discovery strategy from existing drugs, could shorten the time and reduce the cost compared to de novo drug discovery. In this study, we present an integrative, antiviral drug repurposing methodology implementing a systems pharmacology-based network medicine platform, quantifying the interplay between the HCoV-host interactome and drug targets in the human protein-protein interaction network. Phylogenetic analyses of 15 HCoV whole genomes reveal that 2019-nCoV/SARS-CoV-2 shares the highest nucleotide sequence identity with SARS-CoV (79.7%). Specifically, the envelope and nucleocapsid proteins of 2019-nCoV/SARS-CoV-2 are two evolutionarily conserved regions, having the sequence identities of 96% and 89.6%, respectively, compared to SARS-CoV. Using network proximity analyses of drug targets and HCoV-host interactions in the human interactome, we prioritize 16 potential anti-HCoV repurposable drugs (e.g., melatonin, mercaptopurine, and sirolimus) that are further validated by enrichment analyses of drug-gene signatures and HCoV-induced transcriptomics data in human cell lines. We further identify three potential drug combinations (e.g., sirolimus plus dactinomycin, mercaptopurine plus melatonin, and toremifene plus emodin) captured by the "
" pattern: the targets of the drugs both hit the HCoV-host subnetwork, but target separate neighborhoods in the human interactome network. In summary, this study offers powerful network-based methodologies for rapid identification of candidate repurposable drugs and potential drug combinations targeting 2019-nCoV/SARS-CoV-2.
Recent advances in viral metagenomics have enabled the rapid discovery of an unprecedented catalogue of phages in numerous environments, from the human gut to the deep ocean. Although these advances ...have expanded our understanding of phage genomic diversity, they also revealed that we have only scratched the surface in the discovery of novel viruses. Yet, despite the remarkable diversity of phages at the nucleotide sequence level, the structural proteins that form viral particles show strong similarities and conservation. Phages are uniquely interconnected from an evolutionary perspective and undergo multiple events of genetic exchange in response to the selective pressure of their hosts, which drives their diversity. In this Review, we explore phage diversity at the structural, genomic and community levels as well as the complex evolutionary relationships between phages, moulded by the mosaicity of their genomes.
Pan-genomics in the human genome era Sherman, Rachel M; Salzberg, Steven L
Nature reviews. Genetics,
04/2020, Volume:
21, Issue:
4
Journal Article
Peer reviewed
Open access
Since the early days of the genome era, the scientific community has relied on a single 'reference' genome for each species, which is used as the basis for a wide range of genetic analyses, including ...studies of variation within and across species. As sequencing costs have dropped, thousands of new genomes have been sequenced, and scientists have come to realize that a single reference genome is inadequate for many purposes. By sampling a diverse set of individuals, one can begin to assemble a pan-genome: a collection of all the DNA sequences that occur in a species. Here we review efforts to create pan-genomes for a range of species, from bacteria to humans, and we further consider the computational methods that have been proposed in order to capture, interpret and compare pan-genome data. As scientists continue to survey and catalogue the genomic variation across human populations and begin to assemble a human pan-genome, these efforts will increase our power to connect variation to human diversity, disease and beyond.
We are in a phase of unprecedented progress in identifying genetic loci that cause variation in traits ranging from growth and fitness in simple organisms to disease in humans. However, a mechanistic ...understanding of how these loci influence traits is lacking for the majority of loci. Studies of the genetics of gene expression have emerged as a key tool for linking DNA sequence variation to phenotypes. Here, we review recent insights into the molecular nature of regulatory variants and describe their influence on the transcriptome and the proteome. We discuss conceptual advances from studies in model organisms and present examples of complete chains of causality that link individual polymorphisms to changes in gene expression, which in turn result in physiological changes and, ultimately, disease risk.
We present two standards developed by the Genomic Standards Consortium (GSC) for reporting bacterial and archaeal genome sequences. Both are extensions of the Minimum Information about Any (x) ...Sequence (MIxS). The standards are the Minimum Information about a Single Amplified Genome (MISAG) and the Minimum Information about a Metagenome-Assembled Genome (MIMAG), including, but not limited to, assembly quality, and estimates of genome completeness and contamination. These standards can be used in combination with other GSC checklists, including the Minimum Information about a Genome Sequence (MIGS), Minimum Information about a Metagenomic Sequence (MIMS), and Minimum Information about a Marker Gene Sequence (MIMARKS). Community-wide adoption of MISAG and MIMAG will facilitate more robust comparative genomic analyses of bacterial and archaeal diversity.
Tea, one of the world’s most important beverage crops, provides numerous secondary metabolites that account for its rich taste and health benefits. Here we present a high-quality sequence of the ...genome of tea, Camellia sinensis var. sinensis (CSS), using both Illumina and PacBio sequencing technologies. At least 64% of the 3.1-Gb genome assembly consists of repetitive sequences, and the rest yields 33,932 high-confidence predictions of encoded proteins. Divergence between two major lineages, CSS and Camellia sinensis var. assamica (CSA), is calculated to ∼0.38 to 1.54 million years ago (Mya). Analysis of genic collinearity reveals that the tea genome is the product of two rounds of whole-genome duplications (WGDs) that occurred ∼30 to 40 and ∼90 to 100 Mya. We provide evidence that these WGD events, and subsequent paralogous duplications, had major impacts on the copy numbers of secondary metabolite genes, particularly genes critical to producing three key quality compounds: catechins, theanine, and caffeine. Analyses of transcriptome and phytochemistry data show that amplification and transcriptional divergence of genes encoding a large acyltransferase family and leucoanthocyanidin reductases are associated with the characteristic young leaf accumulation of monomeric galloylated catechins in tea, while functional divergence of a single member of the glutamine synthetase gene family yielded theanine synthetase. This genome sequence will facilitate understanding of tea genome evolution and tea metabolite pathways, and will promote germplasm utilization for breeding improved tea varieties.