The complete sequence of a human genome Nurk, Sergey; Koren, Sergey; Rhie, Arang ...
Science (American Association for the Advancement of Science),
04/2022, Volume:
376, Issue:
6588
Journal Article
Peer reviewed
Open access
Since its initial release in 2000, the human reference genome has covered only the euchromatic fraction of the genome, leaving important heterochromatic regions unfinished. Addressing the remaining ...8% of the genome, the Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion-base pair sequence of a human genome, T2T-CHM13, that includes gapless assemblies for all chromosomes except Y, corrects errors in the prior references, and introduces nearly 200 million base pairs of sequence containing 1956 gene predictions, 99 of which are predicted to be protein coding. The completed regions include all centromeric satellite arrays, recent segmental duplications, and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies.
A key component of the ongoing ENCODE project involves rigorous comparative sequence analyses for the initially targeted 1% of the human genome. Here, we present orthologous sequence generation, ...alignment, and evolutionary constraint analyses of 23 mammalian species for all ENCODE targets. Alignments were generated using four different methods; comparisons of these methods reveal large-scale consistency but substantial differences in terms of small genomic rearrangements, sensitivity (sequence coverage), and specificity (alignment accuracy). We describe the quantitative and qualitative trade-offs concomitant with alignment method choice and the levels of technical error that need to be accounted for in applications that require multisequence alignments. Using the generated alignments, we identified constrained regions using three different methods. While the different constraint-detecting methods are in general agreement, there are important discrepancies relating to both the underlying alignments and the specific algorithms. However, by integrating the results across the alignments and constraint-detecting methods, we produced constraint annotations that were found to be robust based on multiple independent measures. Analyses of these annotations illustrate that most classes of experimentally annotated functional elements are enriched for constrained sequences; however, large portions of each class (with the exception of protein-coding sequences) do not overlap constrained regions. The latter elements might not be under primary sequence constraint, might not be constrained across all mammals, or might have expendable molecular functions. Conversely, 40% of the constrained sequences do not overlap any of the functional elements that have been experimentally identified. Together, these findings demonstrate and quantify how many genomic functional elements await basic molecular characterization.
Nephropathic cystinosis is an autosomal recessive disorder caused by the defective transport of cystine out of lysosomes. Recently, the causative gene (CTNS) was identified and presumed to encode an ...integral membrane protein called cystinosin. Many of the disease-associated mutations in CTNS are deletions, including one >55 kb in size that represents the most common cystinosis allele encountered to date. In an effort to determine the precise genomic organization of CTNS and to gain sequence-based insight about the DNA within and flanking cystinosis-associated deletions, we mapped and sequenced the region of human chromosome 17p13 encompassing CTNS. Specifically, a bacterial artificial chromosome (BAC)-based physical map spanning CTNS was constructed by sequence-tagged site (STS)-content mapping. The resulting BAC contig provided the relative order of 43 STSs. Two overlapping BACs, which together contain all of the CTNS exons as well as extensive amounts of flanking DNA, were selected and subjected to shotgun sequencing. A total of 200,237 bp of contiguous, high-accuracy sequence was generated. Analysis of the resulting data revealed a number of interesting features about this genomic region, including the long-range organization of CTNS, insight about the breakpoints and intervening DNA associated with the common cystinosis-causing deletion, and structural information about five genes neighboring CTNS (human ortholog of rat vanilloid receptor subtype 1 gene, CARKL, TIP-1, P2X5, and HUMINAE). In particular, sequence analysis detected the presence of a novel gene (CARKL) residing within the most common cystinosis-causing deletion. This gene encodes a previously unknown protein that is predicted to function as a carbohydrate kinase. Interestingly, both CTNS and CARKL are absent in nearly half of all cystinosis patients (i.e., those homozygous for the common deletion). The sequence data described in this paper have been submitted to the GenBank data library under accession nos. AF168787 and AF163573.
One major challenge encountered with interpreting human genetic variants is the limited understanding of the functional impact of genetic alterations on biological processes. Furthermore, there ...remains an unmet demand for an efficient survey of the wealth of information on human homologs in model organisms across numerous databases. To efficiently assess the large volume of publically available information, it is important to provide a concise summary of the most relevant information in a rapid user-friendly format. To this end, we created MARRVEL (model organism aggregated resources for rare variant exploration). MARRVEL is a publicly available website that integrates information from six human genetic databases and seven model organism databases. For any given variant or gene, MARRVEL displays information from OMIM, ExAC, ClinVar, Geno2MP, DGV, and DECIPHER. Importantly, it curates model organism-specific databases to concurrently display a concise summary regarding the human gene homologs in budding and fission yeast, worm, fly, fish, mouse, and rat on a single webpage. Experiment-based information on tissue expression, protein subcellular localization, biological process, and molecular function for the human gene and homologs in the seven model organisms are arranged into a concise output. Hence, rather than visiting multiple separate databases for variant and gene analysis, users can obtain important information by searching once through MARRVEL. Altogether, MARRVEL dramatically improves efficiency and accessibility to data collection and facilitates analysis of human genes and variants by cross-disciplinary integration of 18 million records available in public databases to facilitate clinical diagnosis and basic research.
Diagnosis at the edges of our knowledge calls upon clinicians to be data driven, cross-disciplinary, and collaborative in unprecedented ways. Exact disease recognition, an element of the concept of ...precision in medicine, requires new infrastructure that spans geography, institutional boundaries, and the divide between clinical care and research. The National Institutes of Health (NIH) Common Fund supports the Undiagnosed Diseases Network (UDN) as an exemplar of this model of precise diagnosis. Its goals are to forge a strategy to accelerate the diagnosis of rare or previously unrecognized diseases, to improve recommendations for clinical management, and to advance research, especially into disease mechanisms. The network will achieve these objectives by evaluating patients with undiagnosed diseases, fostering a breadth of expert collaborations, determining best practices for translating the strategy into medical centers nationwide, and sharing findings, data, specimens, and approaches with the scientific and medical communities. Building the UDN has already brought insights to human and medical geneticists. The initial focus has been on data sharing, establishing common protocols for institutional review boards and data sharing, creating protocols for referring and evaluating patients, and providing DNA sequencing, metabolomic analysis, and functional studies in model organisms. By extending this precision diagnostic model nationally, we strive to meld clinical and research objectives, improve patient outcomes, and contribute to medical science.
Efforts to identify the genetic underpinnings of rare undiagnosed diseases increasingly involve the use of next-generation sequencing and comparative genomic hybridization methods. These efforts are ...limited by a lack of knowledge regarding gene function, and an inability to predict the impact of genetic variation on the encoded protein function. Diagnostic challenges posed by undiagnosed diseases have solutions in model organism research, which provides a wealth of detailed biological information. Model organism geneticists are by necessity experts in particular genes, gene families, specific organs, and biological functions. Here, we review the current state of research into undiagnosed diseases, highlighting large efforts in North America and internationally, including the Undiagnosed Diseases Network (UDN) (Supplemental Material, File S1) and UDN International (UDNI), the Centers for Mendelian Genomics (CMG), and the Canadian Rare Diseases Models and Mechanisms Network (RDMM). We discuss how merging human genetics with model organism research guides experimental studies to solve these medical mysteries, gain new insights into disease pathogenesis, and uncover new therapeutic strategies.
We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data ...have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, including its inter-relationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome. Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function.
Comparison is a fundamental tool for analyzing DNA sequence. Interspecies sequence comparison is particularly powerful for inferring genome function and is based on the simple premise that conserved ...sequences are likely to be important. Thus, the comparison of a genomic sequence with its orthologous counterpart from another species is increasingly becoming an integral component of genome analysis. In ideal situations, such comparisons are performed with orthologous sequences from multiple species. To facilitate multispecies comparative sequence analysis, a robust and scalable strategy for simultaneously constructing sequence-ready bacterial artificial chromosome (BAC) contig maps from targeted genomic regions has been developed. Central to this approach is the generation and utilization of "universal" oligonucleotide-based hybridization probes ("overgo" probes), which are designed from sequences that are highly conserved between distantly related species. Large collections of these probes are used en masse to screen BAC libraries from multiple species in parallel, with the isolated clones assembled into physical contig maps. To validate the effectiveness of this strategy, efforts were focused on the construction of BAC-based physical maps from multiple mammalian species (chimpanzee, baboon, cat, dog, cow, and pig). Using available human and mouse genomic sequence and a newly developed computer program to design the requisite probes, sequence-ready maps were constructed in all species for a series of targeted regions totaling approximately 16 Mb in the human genome. The described approach can be used to facilitate the multispecies comparative sequencing of targeted genomic regions and can be adapted for constructing BAC contig maps in other vertebrates.
Background
Rare variants (RV) in immunoglobulin mu‐binding protein 2 (IGHMBP2) OMIM 600502 can cause an autosomal recessive type of Charcot‐Marie‐Tooth (CMT) disease OMIM 616155, an inherited ...peripheral neuropathy. Over 40 different genes are associated with CMT, with different possible inheritance patterns.
Methods and Results
An 11‐year‐old female with motor delays was found to have distal atrophy, weakness, and areflexia without bulbar or sensory findings. Her clinical evaluation was unrevealing. Whole exome sequencing (WES) revealed a maternally inherited IGHMBP2 RV (c.1730T>C) predicted to be pathogenic, but no variant on the other allele was identified. Deletion and duplication analysis was negative. She was referred to the Undiagnosed Disease Network (UDN) for further evaluation.
Whole genome sequencing (WGS) confirmed the previously identified IGHMBP2 RV and identified a paternally inherited non‐coding IGHMBP2 RV. This was predicted to activate a cryptic splice site perturbing IGHMBP2 splicing. Reverse transcriptase polymerase chain reaction (RT‐PCR) analysis was consistent with activation of the cryptic splice site. The abnormal transcript was shown to undergo nonsense‐mediated decay (NMD), resulting in halpoinsufficiency.
Conclusion
This case demonstrates the deficiencies of WES and traditional molecular analyses and highlights the advantages of utilization of WGS and functional studies.
An 11‐year‐old girl with a clinical presentation consistent with Charcot‐Marie‐Tooth syndrome was referred to the Undiagnosed Diseases Network after her genetic testing, including whole exome sequencing, revealed only a single pathogenic variant in IGHMBP2. Whole genome sequencing revealed a second intronic variant on the other allele of IGHMBP2. This was shown to activate a cryptic splice site and is transcribed to a protein product that undergoes non‐sense mediated decay.
Duplications have long been postulated to be an important mechanism by which genomes evolve. Interspecies genomic comparisons are one method by which the origin and molecular mechanism of ...duplications can be inferred. By comparative mapping in human, mouse, and rat, we previously found evidence for a recent chromosome-fission event that occurred in the mouse lineage. Cytogenetic mapping revealed that the genomic segments flanking the fission site appeared to be duplicated, with copies residing near the centromere of multiple mouse chromosomes. Here we report the mapping and sequencing of the regions of mouse chromosomes 5 and 6 involved in this chromosome-fission event as well as the results of comparative sequence analysis with the orthologous human and rat genomic regions. Our data indicate that the duplications associated with mouse chromosomes 5 and 6 are recent and that the resulting duplicated segments share significant sequence similarity with a series of regions near the centromeres of the mouse chromosomes previously identified by cytogenetic mapping. We also identified pericentromeric duplicated segments shared between mouse chromosomes 5 and 1. Finally, novel mouse satellite sequences as well as putative chimeric transcripts were found to be associated with the duplicated segments. Together, these findings demonstrate that pericentromeric duplications are not restricted to primates and may be a common mechanism for genome evolution in mammals.