Although many studies have been conducted to identify single nucleotide polymorphisms (SNPs) in humans, few studies have been conducted to identify alternative forms of natural genetic variation, ...such as insertion and deletion (INDEL) polymorphisms. In this report, we describe an initial map of human INDEL variation that contains 415,436 unique INDEL polymorphisms. These INDELs were identified with a computational approach using DNA re-sequencing traces that originally were generated for SNP discovery projects. They range from 1 bp to 9989 bp in length and are split almost equally between insertions and deletions, relative to the chimpanzee genome sequence. Five major classes of INDELs were identified, including (1) insertions and deletions of single-base pairs, (2) monomeric base pair expansions, (3) multi-base pair expansions of 2-15 bp repeat units, (4) transposon insertions, and (5) INDELs containing random DNA sequences. Our INDELs are distributed throughout the human genome with an average density of one INDEL per 7.2 kb of DNA. Variation hotspots were identified with up to 48-fold regional increases in INDEL and/or SNP variation compared with the chromosomal averages for the same chromosomes. Over 148,000 INDELs (35.7%) were identified within known genes, and 5542 of these INDELs were located in the promoters and exons of genes, where gene function would be expected to be influenced the greatest. All INDELs in this study have been deposited into dbSNP and have been integrated into maps of human genetic variation that are available to the research community.
Structural variants are implicated in numerous diseases and make up the majority of varying nucleotides among human genomes. Here we describe an integrated set of eight structural variant classes ...comprising both balanced and unbalanced variants, which we constructed using short-read DNA sequencing data and statistically phased onto haplotype blocks in 26 human populations. Analysing this set, we identify numerous gene-intersecting structural variants exhibiting population stratification and describe naturally occurring homozygous gene knockouts that suggest the dispensability of a variety of human genes. We demonstrate that structural variants are enriched on haplotypes identified by genome-wide association studies and exhibit enrichment for expression quantitative trait loci. Additionally, we uncover appreciable levels of structural variant complexity at different scales, including genic loci subject to clusters of repeated rearrangement and complex structural variants with multiple breakpoints likely to have formed through individual mutational events. Our catalogue will enhance future studies into structural variant demography, functional impact and disease association.
New technologies and analysis methods are enabling genomic structural variants (SVs) to be detected with ever-increasing accuracy, resolution and comprehensiveness. To help translate these methods to ...routine research and clinical practice, we developed a sequence-resolved benchmark set for identification of both false-negative and false-positive germline large insertions and deletions. To create this benchmark for a broadly consented son in a Personal Genome Project trio with broadly available cells and DNA, the Genome in a Bottle Consortium integrated 19 sequence-resolved variant calling methods from diverse technologies. The final benchmark set contains 12,745 isolated, sequence-resolved insertion (7,281) and deletion (5,464) calls ≥50 base pairs (bp). The Tier 1 benchmark regions, for which any extra calls are putative false positives, cover 2.51 Gbp and 5,262 insertions and 4,095 deletions supported by ≥1 diploid assembly. We demonstrate that the benchmark set reliably identifies false negatives and false positives in high-quality SV callsets from short-, linked- and long-read sequencing and optical mapping.
In this review, we focus on progress that has been made with detecting small insertions and deletions (INDELs) in human genomes. Over the past decade, several million small INDELs have been ...discovered in human populations and personal genomes. The amount of genetic variation that is caused by these small INDELs is substantial. The number of INDELs in human genomes is second only to the number of single nucleotide polymorphisms (SNPs), and, in terms of base pairs of variation, INDELs cause similar levels of variation as SNPs. Many of these INDELs map to functionally important sites within human genes, and thus, are likely to influence human traits and diseases. Therefore, small INDEL variation will play a prominent role in personalized medicine.
Candida albicans is a frequent colonizer of human mucosal surfaces as well as an opportunistic pathogen. C. albicans is remarkably versatile in its ability to colonize diverse host sites with ...differences in oxygen and nutrient availability, pH, immune responses, and resident microbes, among other cues. It is unclear how the genetic background of a commensal colonizing population can influence the shift to pathogenicity. Therefore, we examined 910 commensal isolates from 35 healthy donors to identify host niche-specific adaptations. We demonstrate that healthy people are reservoirs for genotypically and phenotypically diverse C. albicans strains. Using limited diversity exploitation, we identified a single nucleotide change in the uncharacterized ZMS1 transcription factor that was sufficient to drive hyper invasion into agar. We found that SC5314 was significantly different from the majority of both commensal and bloodstream isolates in its ability to induce host cell death. However, our commensal strains retained the capacity to cause disease in the Galleria model of systemic infection, including outcompeting the SC5314 reference strain during systemic competition assays. This study provides a global view of commensal strain variation and within-host strain diversity of C. albicans and suggests that selection for commensalism in humans does not result in a fitness cost for invasive disease.
Upstream open reading frames (uORFs) initiate translation within mRNA 5' leaders, and have the potential to alter main coding sequence (CDS) translation on transcripts in which they reside. Ribosome ...profiling (RP) studies suggest that translating ribosomes are pervasive within 5' leaders across model systems. However, the significance of this observation remains unclear. To explore a role for uORF usage in a model of neuronal differentiation, we performed RP on undifferentiated and differentiated human neuroblastoma cells.
Using a spectral coherence algorithm (SPECtre), we identify 4954 consistently translated uORFs across 31% of all neuroblastoma transcripts. These uORFs predominantly utilize non-AUG initiation codons and exhibit translational efficiencies (TE) comparable to annotated coding regions. On a population basis, the global impact of both AUG and non-AUG initiated uORFs on basal CDS translation were small, even when analysis is limited to conserved and consistently translated uORFs. However, uORFs did alter the translation of a subset of genes, including the Diamond-Blackfan Anemia associated ribosomal gene RPS24. With retinoic acid induced differentiation, we observed an overall positive correlation in translational shifts between uORF/CDS pairs. However, CDSs downstream of uORFs show smaller shifts in TE with differentiation relative to CDSs without a predicted uORF, suggesting that uORF translation buffers cell state dependent fluctuations in CDS translation.
This work provides insights into the dynamic relationships and potential regulatory functions of uORF/CDS pairs in a model of neuronal differentiation.
Comprehensive and accurate identification of structural variations (SVs) from next generation sequencing data remains a major challenge. We develop FusorSV, which uses a data mining approach to ...assess performance and merge callsets from an ensemble of SV-calling algorithms. It includes a fusion model built using analysis of 27 deep-coverage human genomes from the 1000 Genomes Project. We identify 843 novel SV calls that were not reported by the 1000 Genomes Project for these 27 samples. Experimental validation of a subset of these calls yields a validation rate of 86.7%. FusorSV is available at https://github.com/TheJacksonLaboratory/SVE .
Alu retrotransposons evolved from 7SL RNA approximately 65 million years ago and underwent several rounds of massive expansion in primate genomes. Consequently, the human genome currently harbors 1.1 ...million Alu copies. Some of these copies remain actively mobile and continue to produce both genetic variation and diseases by "jumping" to new genomic locations. However, it is unclear how many active Alu copies exist in the human genome and which Alu subfamilies harbor such copies. Here, we present a comprehensive functional analysis of Alu copies across the human genome. We cloned Alu copies from a variety of genomic locations and tested these copies in a plasmid-based mobilization assay. We show that functionally intact core Alu elements are highly abundant and far outnumber all other active transposons in humans. A range of Alu lineages were found to harbor such copies, including all modern AluY subfamilies and most AluS subfamilies. We also identified two major determinants of Alu activity: (1) The primary sequence of a given Alu copy, and (2) the ability of the encoded RNA to interact with SRP9/14 to form RNA/protein (RNP) complexes. We conclude that Alu elements pose the largest transposon-based mutagenic threat to the human genome. On the basis of our data, we have begun to identify Alu copies that are likely to produce genetic variation and diseases in humans.
We defined the genetic landscape of balanced chromosomal rearrangements at nucleotide resolution by sequencing 141 breakpoints from cytogenetically interpreted translocations and inversions. We ...confirm that the recently described phenomenon of 'chromothripsis' (massive chromosomal shattering and reorganization) is not unique to cancer cells but also occurs in the germline, where it can resolve to a relatively balanced state with frequent inversions. We detected a high incidence of complex rearrangements (19.2%) and substantially less reliance on microhomology (31%) than previously observed in benign copy-number variants (CNVs). We compared these results to experimentally generated DNA breakage-repair by sequencing seven transgenic animals, revealing extensive rearrangement of the transgene and host genome with similar complexity to human germline alterations. Inversion was the most common rearrangement, suggesting that a combined mechanism involving template switching and non-homologous repair mediates the formation of balanced complex rearrangements that are viable, stably replicated and transmitted unaltered to subsequent generations.
Human genetic variation is expected to play a central role in personalized medicine. Yet only a fraction of the natural genetic variation that is harbored by humans has been discovered to date. Here ...we report almost 2 million small insertions and deletions (INDELs) that range from 1 bp to 10,000 bp in length in the genomes of 79 diverse humans. These variants include 819,363 small INDELs that map to human genes. Small INDELs frequently were found in the coding exons of these genes, and several lines of evidence indicate that such variation is a major determinant of human biological diversity. Microarray-based genotyping experiments revealed several interesting observations regarding the population genetics of small INDEL variation. For example, we found that many of our INDELs had high levels of linkage disequilibrium (LD) with both HapMap SNPs and with high-scoring SNPs from genome-wide association studies. Overall, our study indicates that small INDEL variation is likely to be a key factor underlying inherited traits and diseases in humans.