Despite their importance in disease and evolution, highly identical segmental duplications (SDs) are among the last regions of the human reference genome (GRCh38) to be fully sequenced. Using a ...complete telomere-to-telomere human genome (T2T-CHM13), we present a comprehensive view of human SD organization. SDs account for nearly one-third of the additional sequence, increasing the genome-wide estimate from 5.4 to 7.0% 218 million base pairs (Mbp). An analysis of 268 human genomes shows that 91% of the previously unresolved T2T-CHM13 SD sequence (68.3 Mbp) better represents human copy number variation. Comparing long-read assemblies from human (
= 12) and nonhuman primate (
= 5) genomes, we systematically reconstruct the evolution and structural haplotype diversity of biomedically relevant and duplicated genes. This analysis reveals patterns of structural heterozygosity and evolutionary differences in SD organization between humans and other primates.
Analysis of cell-free fetal DNA in maternal plasma holds promise for the development of noninvasive prenatal genetic diagnostics. Previous studies have been restricted to detection of fetal ...trisomies, to specific paternally inherited mutations, or to genotyping common polymorphisms using material obtained invasively, for example, through chorionic villus sampling. Here, we combine genome sequencing of two parents, genome-wide maternal haplotyping, and deep sequencing of maternal plasma DNA to noninvasively determine the genome sequence of a human fetus at 18.5 weeks of gestation. Inheritance was predicted at 2.8 × 10(6) parental heterozygous sites with 98.1% accuracy. Furthermore, 39 of 44 de novo point mutations in the fetal genome were detected, albeit with limited specificity. Subsampling these data and analyzing a second family trio by the same approach indicate that parental haplotype blocks of ~300 kilo-base pairs combined with shallow sequencing of maternal plasma DNA is sufficient to substantially determine the inherited complement of a fetal genome. However, ultradeep sequencing of maternal plasma DNA is necessary for the practical detection of fetal de novo mutations genome-wide. Although technical and analytical challenges remain, we anticipate that noninvasive analysis of inherited variation and de novo mutations in fetal genomes will facilitate prenatal diagnosis of both recessive and dominant Mendelian disorders.
The HeLa cell line was established in 1951 from cervical cancer cells taken from a patient, Henrietta Lacks. This was the first successful attempt to immortalize human-derived cells in vitro. The ...robust growth and unrestricted distribution of HeLa cells resulted in its broad adoption--both intentionally and through widespread cross-contamination--and for the past 60 years it has served a role analogous to that of a model organism. The cumulative impact of the HeLa cell line on research is demonstrated by its occurrence in more than 74,000 PubMed abstracts (approximately 0.3%). The genomic architecture of HeLa remains largely unexplored beyond its karyotype, partly because like many cancers, its extensive aneuploidy renders such analyses challenging. We carried out haplotype-resolved whole-genome sequencing of the HeLa CCL-2 strain, examined point- and indel-mutation variations, mapped copy-number variations and loss of heterozygosity regions, and phased variants across full chromosome arms. We also investigated variation and copy-number profiles for HeLa S3 and eight additional strains. We find that HeLa is relatively stable in terms of point variation, with few new mutations accumulating after early passaging. Haplotype resolution facilitated reconstruction of an amplified, highly rearranged region of chromosome 8q24.21 at which integration of the human papilloma virus type 18 (HPV-18) genome occurred and that is likely to be the event that initiated tumorigenesis. We combined these maps with RNA-seq and ENCODE Project data sets to phase the HeLa epigenome. This revealed strong, haplotype-specific activation of the proto-oncogene MYC by the integrated HPV-18 genome approximately 500 kilobases upstream, and enabled global analyses of the relationship between gene dosage and expression. These data provide an extensively phased, high-quality reference genome for past and future experiments relying on HeLa, and demonstrate the value of haplotype resolution for characterizing cancer genomes and epigenomes.
Copy number variants (CNVs) are subject to stronger selective pressure than single-nucleotide variants, but their roles in archaic introgression and adaptation have not been systematically ...investigated. We show that stratified CNVs are significantly associated with signatures of positive selection in Melanesians and provide evidence for adaptive introgression of large CNVs at chromosomes 16p11.2 and 8p21.3 from Denisovans and Neanderthals, respectively. Using long-read sequence data, we reconstruct the structure and complex evolutionary history of these polymorphisms and show that both encode positively selected genes absent from most human populations. Our results collectively suggest that large CNVs originating in archaic hominins and introgressed into modern humans have played an important role in local population adaptation and represent an insufficiently studied source of large-scale genetic variation.
Despite widespread clinical genetic testing, many individuals with suspected genetic conditions lack a precise diagnosis, limiting their opportunity to take advantage of state-of-the-art treatments. ...In some cases, testing reveals difficult-to-evaluate structural differences, candidate variants that do not fully explain the phenotype, single pathogenic variants in recessive disorders, or no variants in genes of interest. Thus, there is a need for better tools to identify a precise genetic diagnosis in individuals when conventional testing approaches have been exhausted. We performed targeted long-read sequencing (T-LRS) using adaptive sampling on the Oxford Nanopore platform on 40 individuals, 10 of whom lacked a complete molecular diagnosis. We computationally targeted up to 151 Mbp of sequence per individual and searched for pathogenic substitutions, structural variants, and methylation differences using a single data source. We detected all genomic aberrations—including single-nucleotide variants, copy number changes, repeat expansions, and methylation differences—identified by prior clinical testing. In 8/8 individuals with complex structural rearrangements, T-LRS enabled more precise resolution of the mutation, leading to changes in clinical management in one case. In ten individuals with suspected Mendelian conditions lacking a precise genetic diagnosis, T-LRS identified pathogenic or likely pathogenic variants in six and variants of uncertain significance in two others. T-LRS accurately identifies pathogenic structural variants, resolves complex rearrangements, and identifies Mendelian variants not detected by other technologies. T-LRS represents an efficient and cost-effective strategy to evaluate high-priority genes and regions or complex clinical testing results.
Abstract
TRP channel-associated factor 1/2 (TCAF1/TCAF2) proteins antagonistically regulate the cold-sensor protein TRPM8 in multiple human tissues. Understanding their significance has been ...complicated given the locus spans a gap-ridden region with complex segmental duplications in GRCh38. Using long-read sequencing, we sequence-resolve the locus, annotate full-length
TCAF
models in primate genomes, and show substantial human-specific
TCAF
copy number variation. We identify two human super haplogroups, H4 and H5, and establish that
TCAF
duplications originated ~1.7 million years ago but diversified only in
Homo sapiens
by recurrent structural mutations. Conversely, in all archaic-hominin samples the fixation for a specific H4 haplotype without duplication is likely due to positive selection. Here, our results of
TCAF
copy number expansion, selection signals in hominins, and differential
TCAF2
expression between haplogroups and high
TCAF2
and
TRPM8
expression in liver and prostate in modern-day humans imply
TCAF
diversification among hominins potentially in response to cold or dietary adaptations.
Primary ciliary dyskinesia (PCD) is a genetically heterogeneous, autosomal-recessive disorder, characterized by oto-sino-pulmonary disease and situs abnormalities. PCD-causing mutations have been ...identified in 20 genes, but collectively they account for only ∼65% of all PCDs. To identify mutations in additional genes that cause PCD, we performed exome sequencing on three unrelated probands with ciliary outer and inner dynein arm (ODA+IDA) defects. Mutations in SPAG1 were identified in one family with three affected siblings. Further screening of SPAG1 in 98 unrelated affected individuals (62 with ODA+IDA defects, 35 with ODA defects, 1 without available ciliary ultrastructure) revealed biallelic loss-of-function mutations in 11 additional individuals (including one sib-pair). All 14 affected individuals with SPAG1 mutations had a characteristic PCD phenotype, including 8 with situs abnormalities. Additionally, all individuals with mutations who had defined ciliary ultrastructure had ODA+IDA defects. SPAG1 was present in human airway epithelial cell lysates but was not present in isolated axonemes, and immunofluorescence staining showed an absence of ODA and IDA proteins in cilia from an affected individual, thus indicating that SPAG1 probably plays a role in the cytoplasmic assembly and/or trafficking of the axonemal dynein arms. Zebrafish morpholino studies of spag1 produced cilia-related phenotypes previously reported for PCD-causing mutations in genes encoding cytoplasmic proteins. Together, these results demonstrate that mutations in SPAG1 cause PCD with ciliary ODA+IDA defects and that exome sequencing is useful to identify genetic causes of heterogeneous recessive disorders.
Primary ciliary dyskinesia (PCD) is a genetically heterogeneous, autosomal-recessive disorder, characterized by oto-sino-pulmonary disease and situs abnormalities. PCD-causing mutations have been ...identified in 14 genes, but they collectively account for only ∼60% of all PCD. To identify mutations that cause PCD, we performed exome sequencing on six unrelated probands with ciliary outer dynein arm (ODA) defects. Mutations in CCDC114, an ortholog of the Chlamydomonas reinhardtii motility gene DCC2, were identified in a family with two affected siblings. Sanger sequencing of 67 additional individuals with PCD with ODA defects from 58 families revealed CCDC114 mutations in 4 individuals in 3 families. All 6 individuals with CCDC114 mutations had characteristic oto-sino-pulmonary disease, but none had situs abnormalities. In the remaining 5 individuals with PCD who underwent exome sequencing, we identified mutations in two genes (DNAI2, DNAH5) known to cause PCD, including an Ashkenazi Jewish founder mutation in DNAI2. These results revealed that mutations in CCDC114 are a cause of ciliary dysmotility and PCD and further demonstrate the utility of exome sequencing to identify genetic causes in heterogeneous recessive disorders.
Studies of de novo mutation (DNM) have typically excluded some of the most repetitive and complex regions of the genome because these regions cannot be unambiguously mapped with short-read sequencing ...data. To better understand the genome-wide pattern of DNM, we generated long-read sequence data from an autism parent-child quad with an affected female where no pathogenic variant had been discovered in short-read Illumina sequence data. We deeply sequenced all four individuals by using three sequencing platforms (Illumina, Oxford Nanopore, and Pacific Biosciences) and three complementary technologies (Strand-seq, optical mapping, and 10X Genomics). Using long-read sequencing, we initially discovered and validated 171 DNMs across two children—a 20% increase in the number of de novo single-nucleotide variants (SNVs) and indels when compared to short-read callsets. The number of DNMs further increased by 5% when considering a more complete human reference (T2T-CHM13) because of the recovery of events in regions absent from GRCh38 (e.g., three DNMs in heterochromatic satellites). In total, we validated 195 de novo germline mutations and 23 potential post-zygotic mosaic mutations across both children; the overall true substitution rate based on this integrated callset is at least 1.41 × 10−8 substitutions per nucleotide per generation. We also identified six de novo insertions and deletions in tandem repeats, two of which represent structural variants. We demonstrate that long-read sequencing and assembly, especially when combined with a more complete reference genome, increases the number of DNMs by >25% compared to previous studies, providing a more complete catalog of DNM compared to short-read data alone.
Display omitted
Variable number tandem repeats (VNTRs) are composed of consecutive repetitive DNA with hypervariable repeat count and composition. They include protein coding sequences and associations with clinical ...disorders. It has been difficult to incorporate VNTR analysis in disease studies that use short-read sequencing because the traditional approach of mapping to the human reference is less effective for repetitive and divergent sequences. In this work, we solve VNTR mapping for short reads with a repeat-pangenome graph (RPGG), a data structure that encodes both the population diversity and repeat structure of VNTR loci from multiple haplotype-resolved assemblies. We develop software to build a RPGG, and use the RPGG to estimate VNTR composition with short reads. We use this to discover VNTRs with length stratified by continental population, and expression quantitative trait loci, indicating that RPGG analysis of VNTRs will be critical for future studies of diversity and disease.