We are entering into an exciting era of genomics where truly complete, high-quality assemblies of human chromosomes are available end-to-end, or from ‘telomere-to-telomere’ (T2T). This technological ...advance offers a new opportunity to include endogenous human centromeric regions in high-resolution, sequence-based studies. These emerging reference maps are expected to reveal a new functional landscape in the human genome, where centromere proteins, transcriptional regulation, and spatial organization can be examined with base-level resolution across different stages of development and disease. Such studies will depend on innovative assembly methods of extremely long tandem repeats (ETRs), or satellite DNAs, paired with the development of new, orthogonal validation methods to ensure accuracy and completeness. This review reflects the progress in centromere genomics, credited by recent advancements in long-read sequencing and assembly methods. In doing so, I will discuss the challenges that remain and the promise for a new period of scientific discovery for satellite DNA biology and centromere function.
Highlighting the Distinguished Speakers Symposium on “The Future of Human Genetics and Genomics,” this collection of articles is based on presentations at the ASHG 2023 Annual Meeting in Washington, ...DC, in celebration of all our field has accomplished in the past 75 years, since the founding of ASHG in 1948.
The central goal of medical genomics is to understand the inherited basis of sequence variation that underlies human physiology, evolution, and disease. Functional association studies currently ...ignore millions of bases that span each centromeric region and acrocentric short arm. These regions are enriched in long arrays of tandem repeats, or satellite DNAs, that are known to vary extensively in copy number and repeat structure in the human population. Satellite sequence variation in the human genome is often so large that it is detected cytogenetically, yet due to the lack of a reference assembly and informatics tools to measure this variability, contemporary high-resolution disease association studies are unable to detect causal variants in these regions. Nevertheless, recently uncovered associations between satellite DNA variation and human disease support that these regions present a substantial and biologically important fraction of human sequence variation. Therefore, there is a pressing and unmet need to detect and incorporate this uncharacterized sequence variation into broad studies of human evolution and medical genomics. Here I discuss the current knowledge of satellite DNA variation in the human genome, focusing on centromeric satellites and their potential implications for disease.
The reference human genome sequence is inarguably the most important and widely used resource in the fields of human genetics and genomics. It has transformed the conduct of biomedical sciences and ...brought invaluable benefits to the understanding and improvement of human health. However, the commonly used reference sequence has profound limitations, because across much of its span, it represents the sequence of just one human haplotype. This single, monoploid reference structure presents a critical barrier to representing the broad genomic diversity in the human population. In this review, we discuss the modernization of the reference human genome sequence to a more complete reference of human genomic diversity, known as a human pangenome.
Complete and accurate genome assemblies form the basis of most downstream genomic analyses and are of critical importance. Recent genome assembly projects have relied on a combination of noisy ...long-read sequencing and accurate short-read sequencing, with the former offering greater assembly continuity and the latter providing higher consensus accuracy. The recently introduced Pacific Biosciences (PacBio) HiFi sequencing technology bridges this divide by delivering long reads (>10 kbp) with high per-base accuracy (>99.9%). Here we present HiCanu, a modification of the Canu assembler designed to leverage the full potential of HiFi reads via homopolymer compression, overlap-based error correction, and aggressive false overlap filtering. We benchmark HiCanu with a focus on the recovery of haplotype diversity, major histocompatibility complex (MHC) variants, satellite DNAs, and segmental duplications. For diploid human genomes sequenced to 30× HiFi coverage, HiCanu achieved superior accuracy and allele recovery compared to the current state of the art. On the effectively haploid CHM13 human cell line, HiCanu achieved an NG50 contig size of 77 Mbp with a per-base consensus accuracy of 99.999% (QV50), surpassing recent assemblies of high-coverage, ultralong Oxford Nanopore Technologies (ONT) reads in terms of both accuracy and continuity. This HiCanu assembly correctly resolves 337 out of 341 validation BACs sampled from known segmental duplications and provides the first preliminary assemblies of nine complete human centromeric regions. Although gaps and errors still remain within the most challenging regions of the genome, these results represent a significant advance toward the complete assembly of human genomes.
Genomic studies rely on accurate chromosome assemblies to explore sequence-based models of cell biology, evolution and biomedical disease. However, even the extensively studied human genome has not ...yet reached a complete, ‘telomere-to-telomere’, chromosome assembly. The largest assembly gaps remain in centromeric regions and acrocentric short arms, sites known to contain megabase-sized arrays of tandem repeats, or satellite DNAs. This review aims to briefly address the progress and challenges of generating correct assemblies of satellite DNA arrays. Although the focus is placed on the human genome, many concepts presented here are applicable to other genomes.
The human genome reference sequence remains incomplete owing to the challenge of assembling long tracts of near-identical tandem repeats in centromeres. We implemented a nanopore sequencing strategy ...to generate high-quality reads that span hundreds of kilobases of highly repetitive DNA in a human Y chromosome centromere. Combining these data with short-read variant validation, we assembled and characterized the centromeric region of a human Y chromosome.
Since the human genome was published in 2001, many of the gaps in the original sequence have been filled in, offering a more detailed understanding of genome regulation, structure and function.
Advances in long-read sequencing technologies have broadened our understanding of genetic variation in the human population, uncovered new complex structural variants and offered an opportunity to ...elucidate new variant associations with disease.
Speed, single-base sensitivity and long read lengths make nanopores a promising technology for high-throughput sequencing. We evaluated and optimized the performance of the MinION nanopore sequencer ...using M13 genomic DNA and used expectation maximization to obtain robust maximum-likelihood estimates for insertion, deletion and substitution error rates (4.9%, 7.8% and 5.1%, respectively). Over 99% of high-quality 2D MinION reads mapped to the reference at a mean identity of 85%. We present a single-nucleotide-variant detection tool that uses maximum-likelihood parameter estimates and marginalization over many possible read alignments to achieve precision and recall of up to 99%. By pairing our high-confidence alignment strategy with long MinION reads, we resolved the copy number for a cancer-testis gene family (CT47) within an unresolved region of human chromosome Xq24.