In order to provide a comprehensive resource for human structural variants (SVs), we generated long-read sequence data and analyzed SVs for fifteen human genomes. We sequence resolved 99,604 ...insertions, deletions, and inversions including 2,238 (1.6 Mbp) that are shared among all discovery genomes with an additional 13,053 (6.9 Mbp) present in the majority, indicating minor alleles or errors in the reference. Genotyping in 440 additional genomes confirms the most common SVs in unique euchromatin are now sequence resolved. We report a ninefold SV bias toward the last 5 Mbp of human chromosomes with nearly 55% of all VNTRs (variable number of tandem repeats) mapping to this portion of the genome. We identify SVs affecting coding and noncoding regulatory loci improving annotation and interpretation of functional variation. These data provide the framework to construct a canonical human reference and a resource for developing advanced representations capable of capturing allelic diversity.
Display omitted
•We sequence resolve and annotate 99,604 common human structural variants•55% of VNTRs map to the end of chromosomes and correlate with double-strand breaks•Alternate alleles facilitate accurate genotyping with short reads and new associations•We patch the reference and add diversity needed for developing a pan human genome
Long-read sequencing allows generation of a large catalog of human structural variants and the development of an algorithm for genotyping SVs from short-read data, clarifying the spectrum and importance of structural variation in the human genome.
Abstract
TRP channel-associated factor 1/2 (TCAF1/TCAF2) proteins antagonistically regulate the cold-sensor protein TRPM8 in multiple human tissues. Understanding their significance has been ...complicated given the locus spans a gap-ridden region with complex segmental duplications in GRCh38. Using long-read sequencing, we sequence-resolve the locus, annotate full-length
TCAF
models in primate genomes, and show substantial human-specific
TCAF
copy number variation. We identify two human super haplogroups, H4 and H5, and establish that
TCAF
duplications originated ~1.7 million years ago but diversified only in
Homo sapiens
by recurrent structural mutations. Conversely, in all archaic-hominin samples the fixation for a specific H4 haplotype without duplication is likely due to positive selection. Here, our results of
TCAF
copy number expansion, selection signals in hominins, and differential
TCAF2
expression between haplogroups and high
TCAF2
and
TRPM8
expression in liver and prostate in modern-day humans imply
TCAF
diversification among hominins potentially in response to cold or dietary adaptations.
Copy number variants (CNVs) are subject to stronger selective pressure than single-nucleotide variants, but their roles in archaic introgression and adaptation have not been systematically ...investigated. We show that stratified CNVs are significantly associated with signatures of positive selection in Melanesians and provide evidence for adaptive introgression of large CNVs at chromosomes 16p11.2 and 8p21.3 from Denisovans and Neanderthals, respectively. Using long-read sequence data, we reconstruct the structure and complex evolutionary history of these polymorphisms and show that both encode positively selected genes absent from most human populations. Our results collectively suggest that large CNVs originating in archaic hominins and introgressed into modern humans have played an important role in local population adaptation and represent an insufficiently studied source of large-scale genetic variation.
Structural variation and single-nucleotide variation of the complement factor H (CFH) gene family underlie several complex genetic diseases, including age-related macular degeneration (AMD) and ...atypical hemolytic uremic syndrome (AHUS). To understand its diversity and evolution, we performed high-quality sequencing of this ∼360-kbp locus in six primate lineages, including multiple human haplotypes. Comparative sequence analyses reveal two distinct periods of gene duplication leading to the emergence of four CFH-related (CFHR) gene paralogs (CFHR2 and CFHR4 ∼25–35 Mya and CFHR1 and CFHR3 ∼7–13 Mya). Remarkably, all evolutionary breakpoints share a common ∼4.8-kbp segment corresponding to an ancestral CFHR gene promoter that has expanded independently throughout primate evolution. This segment is recurrently reused and juxtaposed with a donor duplication containing exons 8 and 9 from ancestral CFH, creating four CFHR fusion genes that include lineage-specific members of the gene family. Combined analysis of >5,000 AMD cases and controls identifies a significant burden of a rare missense mutation that clusters at the N terminus of CFH P = 5.81 × 10−8, odds ratio (OR) = 9.8 (3.67-Infinity). A bipolar clustering pattern of rare nonsynonymous mutations in patients with AMD (P < 10−3) and AHUS (P = 0.0079) maps to functional domains that show evidence of positive selection during primate evolution. Our structural variation analysis in >2,400 individuals reveals five recurrent rearrangement breakpoints that show variable frequency among AMD cases and controls. These data suggest a dynamic and recurrent pattern of mutation critical to the emergence of new CFHR genes but also in the predisposition to complex human genetic disease phenotypes.
The complex interspersed pattern of segmental duplications in humans is responsible for rearrangements associated with neurodevelopmental disease, including the emergence of novel genes important in ...human brain evolution. We investigate the evolution of LCR16a, a putative driver of this phenomenon that encodes one of the most rapidly evolving human-ape gene families, nuclear pore interacting protein (NPIP).
Comparative analysis shows that LCR16a has independently expanded in five primate lineages over the last 35 million years of primate evolution. The expansions are associated with independent lineage-specific segmental duplications flanking LCR16a leading to the emergence of large interspersed duplication blocks at non-orthologous chromosomal locations in each primate lineage. The intron-exon structure of the NPIP gene family has changed dramatically throughout primate evolution with different branches showing characteristic gene models yet maintaining an open reading frame. In the African ape lineage, we detect signatures of positive selection that occurred after a transition to more ubiquitous expression among great ape tissues when compared to Old World and New World monkeys. Mouse transgenic experiments from baboon and human genomic loci confirm these expression differences and suggest that the broader ape expression pattern arose due to mutational changes that emerged in cis.
LCR16a promotes serial interspersed duplications and creates hotspots of genomic instability that appear to be an ancient property of primate genomes. Dramatic changes to NPIP gene structure and altered tissue expression preceded major bouts of positive selection in the African ape lineage, suggestive of a gene undergoing strong adaptive evolution.
Genome structural variation shows remarkable complexity with respect to copy number, sequence content and distribution. While the discovery of copy number polymorphisms (CNP) has increased ...exponentially in recent years, the transition from discovery to genotyping has proved challenging, particularly for CNPs embedded in complex regions of the genome. CNPs that are collectively common in the population and possess a dynamic range of copy numbers have proved the most difficult to genotype in association studies. This is in some part due to technical limitations of genotyping assays and the sequence properties of the genomic region being analyzed. Here we describe in detail the basis of a number of molecular techniques used to genotype complex CNPs, compare and contrast these approaches for determination of multi-allelic copy number, and discuss the potential application of these techniques in genetic studies.
► We describe popular approaches for analyzing complex copy number polymorphisms. ► Accurate assessment of integer copy number is challenging for complex CNPs. ► Using multiple methods increases the accuracy of copy number genotyping. ► Sequence read depth can determine absolute copy number of complex CNPs.
Intrachromosomal segmental duplications provide the substrate for non-allelic homologous recombination, facilitating extensive copy number variation in the human genome. Many multi-copy gene families ...are embedded within genomic regions with high levels of sequence identity (>95%) and therefore pose considerable analytical challenges. In some cases, the complexity involved in analyzing such regions is largely underestimated. Rapid, cost effective analysis of multi-copy gene regions have typically implemented quantitative approaches, however quantitative data are not an absolute means of certainty. Therefore any technique prone to degrees of measurement error can produce ambiguous results that may lead to spurious associations with complex disease.
In this study we have focused on testing the accuracy and reproducibility of quantitative analysis techniques. With reference to the C-C Chemokine Ligand-3-like-1 (CCL3L1) gene, we performed analysis using real-time Quantitative PCR (QPCR), Multiplex Ligation-dependent Probe Amplification (MLPA) and Paralogue Ratio Test (PRT). After controlling for potential outside variables on assay performance, including DNA concentration, quality, preparation and storage conditions, we find that real-time QPCR produces data that does not cluster tightly around copy number integer values, with variation substantially greater than that of the MLPA or PRT systems. We find that the method of rounding real-time QPCR measurements can potentially lead to mis-scoring of copy number genotypes and suggest caution should be exercised in interpreting QPCR data.
We conclude that real-time QPCR is inherently prone to measurement error, even under conditions that would seem favorable for association studies. Our results indicate that potential variability in the physicochemical properties of the DNA samples cannot solely explain the poor performance exhibited by the real-time QPCR systems. We recommend that more robust approaches such as PRT or MLPA should be used to genotype multi-allelic copy number variation in disease association studies and suggest several approaches which can be implemented to ensure the quality of the copy number typing using quantitative methods.
Copy Number Variants (CNVs) are now recognized as playing a significant role in complex disease etiology. Age-related macular degeneration (AMD) is the most common cause of irreversible vision loss ...in the western world. While a number of genes and environmental factors have been associated with both risk and protection in AMD, the role of CNVs has remained largely unexplored. We analyzed the two major AMD risk-associated regions on chromosome 1q32 and 10q26 for CNVs using Multiplex Ligation-dependant Probe Amplification. The analysis targeted nine genes in these two key regions, including the Complement Factor H (CFH) gene, the 5 CFH-related (CFHR) genes representing a known copy number "hotspot", the F13B gene as well as the ARMS2 and HTRA1 genes in 387 cases of late AMD and 327 controls. No copy number variation was detected at the ARMS2 and HTRA1 genes in the chromosome 10 region, nor for the CFH and F13B genes at the chromosome 1 region. However, significant association was identified for the CFHR3-1 deletion in AMD cases (p = 2.38 × 10(-12)) OR = 0.31, CI-0.95 (0.23-0.44), for both neovascular disease (nAMD) (p = 8.3 × 10(-9)) OR = 0.36 CI-0.95 (0.25-0.52) and geographic atrophy (GA) (p = 1.5 × 10(-6)) OR = 0.36 CI-0.95 (0.25-0.52) compared to controls. In addition, a significant association with deletion of CFHR1-4 was identified only in patients who presented with bilateral GA (p = 0.02) (OR = 7.6 CI-0.95 1.38-41.8). This is the first report of a phenotype specific association of a CNV for a major subtype of AMD and potentially allows for pre-diagnostic identification of individuals most likely to proceed to this end stage of disease.
The human genome contains a significant amount of sequence variation, from single nucleotide polymorphisms to large stretches of DNA that may be present in a range of different copies between ...individuals. Several such regions are variable in >1% of the population (referred to as copy number polymorphisms or CNPs), and many studies have looked for associations between the copy number of genes within multiallelic CNPs and disease susceptibility. Associations have indeed been described for several genes, including the β‐defensins (DEFB4, DEFB103, DEFB104), chemokine ligand 3 like 1 (CCL3L1), Fc gamma receptor 3B (FCGR3B), and complement component C4 (C4). However, follow‐up replication in independent cohorts has failed to reproduce a number of these associations. It is clear that replicated associations such as those between C4 and systemic lupus erythematosus, and β‐defensin and psoriasis, have used robust genotyping methodologies. Technical issues associated with genotyping sequences of high identity may therefore account for failure to replicate other associations. Here, we compare and contrast the most popular approaches that have been used to genotype CNPs, describe how they have been applied in different situations, and discuss potential reasons for the difficulty in reproducibly linking multiallelic CNPs to complex diseases.