The telomere-to-telomere (T2T) complete human reference has significantly improved our ability to characterize genome structural variation. To understand its impact on inversion polymorphisms, we ...remapped data from 41 genomes against the T2T reference genome and compared it to the GRCh38 reference. We find a ~ 21% increase in sensitivity improving mapping of 63 inversions on the T2T reference. We identify 26 misorientations within GRCh38 and show that the T2T reference is three times more likely to represent the correct orientation of the major human allele. Analysis of 10 additional samples reveals novel rare inversions at chromosomes 15q25.2, 16p11.2, 16q22.1-23.1, and 22q11.21.
Haplotype-resolved genome assemblies are important for understanding how combinations of variants impact phenotypes. To date, these assemblies have been best created with complex protocols, such as ...cultured cells that contain a single-haplotype (haploid) genome, single cells where haplotypes are separated, or co-sequencing of parental genomes in a trio-based approach. These approaches are impractical in most situations. To address this issue, we present FALCON-Phase, a phasing tool that uses ultra-long-range Hi-C chromatin interaction data to extend phase blocks of partially-phased diploid assembles to chromosome or scaffold scale. FALCON-Phase uses the inherent phasing information in Hi-C reads, skipping variant calling, and reduces the computational complexity of phasing. Our method is validated on three benchmark datasets generated as part of the Vertebrate Genomes Project (VGP), including human, cow, and zebra finch, for which high-quality, fully haplotype-resolved assemblies are available using the trio-based approach. FALCON-Phase is accurate without having parental data and performance is better in samples with higher heterozygosity. For cow and zebra finch the accuracy is 97% compared to 80-91% for human. FALCON-Phase is applicable to any draft assembly that contains long primary contigs and phased associate contigs.
The diploid nature of the human genome is neglected in many analyses done today, where a genome is perceived as a set of unphased variants with respect to a reference genome. This lack of ...haplotype-level analyses can be explained by a lack of methods that can produce dense and accurate chromosome-length haplotypes at reasonable costs. Here we introduce an integrative phasing strategy that combines global, but sparse haplotypes obtained from strand-specific single-cell sequencing (Strand-seq) with dense, yet local, haplotype information available through long-read or linked-read sequencing. We provide comprehensive guidance on the required sequencing depths and reliably assign more than 95% of alleles (NA12878) to their parental haplotypes using as few as 10 Strand-seq libraries in combination with 10-fold coverage PacBio data or, alternatively, 10X Genomics linked-read sequencing data. We conclude that the combination of Strand-seq with different technologies represents an attractive solution to chart the genetic variation of diploid genomes.
There has been tremendous progress in phased genome assembly production by combining long-read data with parental information or linked-read data. Nevertheless, a typical phased genome assembly ...generated by trio-hifiasm still generates more than 140 gaps. We perform a detailed analysis of gaps, assembly breaks, and misorientations from 182 haploid assemblies obtained from a diversity panel of 77 unique human samples. Although trio-based approaches using HiFi are the current gold standard, chromosome-wide phasing accuracy is comparable when using Strand-seq instead of parental data. Importantly, the majority of assembly gaps cluster near the largest and most identical repeats (including segmental duplications 35.4%, satellite DNA 22.3%, or regions enriched in GA/AT-rich DNA 27.4%). Consequently, 1513 protein-coding genes overlap assembly gaps in at least one haplotype, and 231 are recurrently disrupted or missing from five or more haplotypes. Furthermore, we estimate that 6-7 Mbp of DNA are misorientated per haplotype irrespective of whether trio-free or trio-based approaches are used. Of these misorientations, 81% correspond to bona fide large inversion polymorphisms in the human species, most of which are flanked by large segmental duplications. We also identify large-scale alignment discontinuities consistent with 11.9 Mbp of deletions and 161.4 Mbp of insertions per haploid genome. Although 99% of this variation corresponds to satellite DNA, we identify 230 regions of euchromatic DNA with frequent expansions and contractions, nearly half of which overlap with 197 protein-coding genes. Such variable and incompletely assembled regions are important targets for future algorithmic development and pangenome representation.
The sequence and assembly of human genomes using long‐read sequencing technologies has revolutionized our understanding of structural variation and genome organization. We compared the accuracy, ...continuity, and gene annotation of genome assemblies generated from either high‐fidelity (HiFi) or continuous long‐read (CLR) datasets from the same complete hydatidiform mole human genome. We find that the HiFi sequence data assemble an additional 10% of duplicated regions and more accurately represent the structure of tandem repeats, as validated with orthogonal analyses. As a result, an additional 5 Mbp of pericentromeric sequences are recovered in the HiFi assembly, resulting in a 2.5‐fold increase in the NG50 within 1 Mbp of the centromere (HiFi 480.6 kbp, CLR 191.5 kbp). Additionally, the HiFi genome assembly was generated in significantly less time with fewer computational resources than the CLR assembly. Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers. Despite these shortcomings, our results suggest that HiFi may be the most effective standalone technology for de novo assembly of human genomes.
Full text
Available for:
DOBA, FZAB, GIS, IJS, IZUM, KILJ, NLZOH, NUK, OILJ, PILJ, PNG, SAZU, SBCE, SBMB, UILJ, UKNU, UL, UM, UPUK
Copy number variants (CNVs) are subject to stronger selective pressure than single-nucleotide variants, but their roles in archaic introgression and adaptation have not been systematically ...investigated. We show that stratified CNVs are significantly associated with signatures of positive selection in Melanesians and provide evidence for adaptive introgression of large CNVs at chromosomes 16p11.2 and 8p21.3 from Denisovans and Neanderthals, respectively. Using long-read sequence data, we reconstruct the structure and complex evolutionary history of these polymorphisms and show that both encode positively selected genes absent from most human populations. Our results collectively suggest that large CNVs originating in archaic hominins and introgressed into modern humans have played an important role in local population adaptation and represent an insufficiently studied source of large-scale genetic variation.
Induced pluripotent stem cells (iPSCs) are commonly generated by transduction of Oct4, Sox2, Klf4, and Myc (OSKM) into cells. Although iPSCs are pluripotent, they frequently exhibit high variation in ...terms of quality, as measured in mice by chimera contribution and tetraploid complementation. Reliably high-quality iPSCs will be needed for future therapeutic applications. Here, we show that one major determinant of iPSC quality is the combination of reprogramming factors used. Based on tetraploid complementation, we found that ectopic expression of Sall4, Nanog, Esrrb, and Lin28 (SNEL) in mouse embryonic fibroblasts (MEFs) generated high-quality iPSCs more efficiently than other combinations of factors including OSKM. Although differentially methylated regions, transcript number of master regulators, establishment of specific superenhancers, and global aneuploidy were comparable between high- and low-quality lines, aberrant gene expression, trisomy of chromosome 8, and abnormal H2A.X deposition were distinguishing features that could potentially also be applicable to human.
Display omitted
•Interplay between reprogramming factors is an important determinant of iPSC quality•High expression of Sall4, Nanog, Esrrb, and Lin28 in MEFs produces high-quality iPSCs•Trisomy 8 is a frequent genomic aberration in OSK and OSKM-iPSCs•High-quality SNEL-iPSCs are correlated with faithful H2A.X deposition patterns
Buganim and colleagues show that the specific combination of reprogramming factors used to make iPSCs influences the quality of the resulting cells, and the expression of Sall4, Nanog, Esrrb, and Lin28A produces high-quality iPSCs.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
Sister chromatid exchanges (SCEs) are products of joint DNA molecule resolution, and are considered to form through homologous recombination (HR). Indeed, SCE induction upon irradiation requires the ...canonical HR factors BRCA1, BRCA2 and RAD51. In contrast, replication-blocking agents, including PARP inhibitors, induce SCEs independently of BRCA1, BRCA2 and RAD51. PARP inhibitor-induced SCEs are enriched at difficult-to-replicate genomic regions, including common fragile sites (CFSs). PARP inhibitor-induced replication lesions are transmitted into mitosis, suggesting that SCEs can originate from mitotic processing of under-replicated DNA. Proteomics analysis reveals mitotic recruitment of DNA polymerase theta (POLQ) to synthetic DNA ends. POLQ inactivation results in reduced SCE numbers and severe chromosome fragmentation upon PARP inhibition in HR-deficient cells. Accordingly, analysis of CFSs in cancer genomes reveals frequent allelic deletions, flanked by signatures of POLQ-mediated repair. Combined, we show PARP inhibition generates under-replicated DNA, which is processed into SCEs during mitosis, independently of canonical HR factors.
The impact of segmental duplications on human evolution and disease is only just starting to unfold, thanks to advancements in sequencing technologies that allow for their discovery and precise ...genotyping. The 15q11-q13 locus is a hotspot of recurrent copy number variation associated with Prader–Willi/Angelman syndromes, developmental delay, autism, and epilepsy and is mediated by complex segmental duplications, many of which arose recently during evolution. To gain insight into the instability of this region, we characterized its architecture in human and nonhuman primates, reconstructing the evolutionary history of five different inversions that rearranged the region in different species primarily by accumulation of segmental duplications. Comparative analysis of human and nonhuman primate duplication structures suggests a human-specific gain of directly oriented duplications in the regions flanking the GOLGA cores and HERC segmental duplications, representing potential genomic drivers for the human-specific expansions. The increasing complexity of segmental duplication organization over the course of evolution underlies its association with human susceptibility to recurrent disease-associated rearrangements.
Full text
Available for:
IZUM, KILJ, NUK, PILJ, PNG, SAZU, UL, UM, UPUK
Haplotypes are fundamental to fully characterize the diploid genome of an individual, yet methods to directly chart the unique genetic makeup of each parental chromosome are lacking. Here we ...introduce single-cell DNA template strand sequencing (Strand-seq) as a novel approach to phasing diploid genomes along the entire length of all chromosomes. We demonstrate this by building a complete haplotype for a HapMap individual (NA12878) at high accuracy (concordance 99.3%), without using generational information or statistical inference. By use of this approach, we mapped all meiotic recombination events in a family trio with high resolution (median range ∼14 kb) and phased larger structural variants like deletions, indels, and balanced rearrangements like inversions. Lastly, the single-cell resolution of Strand-seq allowed us to observe loss of heterozygosity regions in a small number of cells, a significant advantage for studies of heterogeneous cell populations, such as cancer cells. We conclude that Strand-seq is a unique and powerful approach to completely phase individual genomes and map inheritance patterns in families, while preserving haplotype differences between single cells.