Generation of long (>5 Kb) DNA sequencing reads provides an approach for interrogation of complex regions in the human genome. Currently, large-insert whole genome sequencing (WGS) technologies from ...Pacific Biosciences (PacBio) enable analysis of chromosomal structural variations (SVs), but the cost to achieve the required sequence coverage across the entire human genome is high.
We developed a method (termed PacBio-LITS) that combines oligonucleotide-based DNA target-capture enrichment technologies with PacBio large-insert library preparation to facilitate SV studies at specific chromosomal regions. PacBio-LITS provides deep sequence coverage at the specified sites at substantially reduced cost compared with PacBio WGS. The efficacy of PacBio-LITS is illustrated by delineating the breakpoint junctions of low copy repeat (LCR)-associated complex structural rearrangements on chr17p11.2 in patients diagnosed with Potocki-Lupski syndrome (PTLS; MIM#610883). We successfully identified previously determined breakpoint junctions in three PTLS cases, and also were able to discover novel junctions in repetitive sequences, including LCR-mediated breakpoints. The new information has enabled us to propose mechanisms for formation of these structural variants.
The new method leverages the cost efficiency of targeted capture-sequencing as well as the mappability and scaffolding capabilities of long sequencing reads generated by the PacBio platform. It is therefore suitable for studying complex SVs, especially those involving LCRs, inversions, and the generation of chimeric Alu elements at the breakpoints. Other genomic research applications, such as haplotype phasing and small insertion and deletion validation could also benefit from this technology.
Abstract
Background
The growing volume and heterogeneity of next-generation sequencing (NGS) data complicate the further optimization of identifying DNA variation, especially considering that curated ...high-confidence variant call sets frequently used to validate these methods are generally developed from the analysis of comparatively small and homogeneous sample sets.
Findings
We have developed xAtlas, a single-sample variant caller for single-nucleotide variants (SNVs) and small insertions and deletions (indels) in NGS data. xAtlas features rapid runtimes, support for CRAM and gVCF file formats, and retraining capabilities. xAtlas reports SNVs with 99.11% recall and 98.43% precision across a reference HG002 sample at 60× whole-genome coverage in less than 2 CPU hours. Applying xAtlas to 3,202 samples at 30× whole-genome coverage from the 1000 Genomes Project achieves an average runtime of 1.7 hours per sample and a clear separation of the individual populations in principal component analysis across called SNVs.
Conclusions
xAtlas is a fast, lightweight, and accurate SNV and small indel calling method. Source code for xAtlas is available under a BSD 3-clause license at https://github.com/jfarek/xatlas.
Characterization of genomic structural variation (SV) is essential to expanding the research and clinical applications of genome sequencing. Reliance upon short DNA fragment paired end sequencing has ...yielded a wealth of single nucleotide variants and internal sequencing read insertions-deletions, at the cost of limited SV detection. Multi-kilobase DNA fragment mate pair sequencing has supplemented the void in SV detection, but introduced new analytic challenges requiring SV detection tools specifically designed for mate pair sequencing data. Here, we introduce SVachra - Structural Variation Assessment of CHRomosomal Aberrations, a breakpoint calling program that identifies large insertions-deletions, inversions, inter- and intra-chromosomal translocations utilizing both inward and outward facing read types generated by mate pair sequencing.
We demonstrate SVachra's utility by executing the program on large-insert (Illumina Nextera) mate pair sequencing data from the personal genome of a single subject (HS1011). An additional data set of long-read (Pacific BioSciences RSII) was also generated to validate SV calls from SVachra and other comparison SV calling programs. SVachra exhibited the highest validation rate and reported the widest distribution of SV types and size ranges when compared to other SV callers.
SVachra is a highly specific breakpoint calling program that exhibits a more unbiased SV detection methodology than other callers.
Around 14% of protein-coding genes of Arabidopsis thaliana genes from the TAIR9 genome release are annotated as producing multiple transcript variants through alternative splicing. However, for most ...alternatively spliced genes in Arabidopsis, the relative expression level of individual splicing variants is unknown.
We investigated prevalence of alternative splicing (AS) events in Arabidopsis thaliana using ESTs. We found that for most AS events with ample EST coverage, the majority of overlapping ESTs strongly supported one major splicing choice, with less than 10% of ESTs supporting the minor form. Analysis of ESTs also revealed a small but noteworthy subset of genes for which alternative choices appeared with about equal prevalence, suggesting that for these genes the variant splicing forms co-occur in the same cell types. Of the AS events in which both forms were about equally prevalent, more than 80% affected untranslated regions or involved small changes to the encoded protein sequence.
Currently available evidence from ESTs indicates that alternative splicing in Arabidopsis occurs and affects many genes, but for most genes with documented alternative splicing, one AS choice predominates. To aid investigation of the role AS may play in modulating function of Arabidopsis genes, we provide an on-line resource (ArabiTag) that supports searching AS events by gene, by EST library keyword search, and by relative prevalence of minor and major forms.
In October 2019, 46 scientists from around the world participated in the first National Center for Biotechnology Information (NCBI) Structural Variation (SV) Codeathon at Baylor College of Medicine. ...The charge of this first annual working session was to identify ongoing challenges around the topics of SV and graph genomes, and in response to design reliable methods to facilitate their study. Over three days, seven working groups each designed and developed new open-sourced methods to improve the bioinformatic analysis of genomic SVs represented in next-generation sequencing (NGS) data. The groups' approaches addressed a wide range of problems in SV detection and analysis, including quality control (QC) assessments of metagenome assemblies and population-scale VCF files,
de novo copy number variation (CNV) detection based on continuous long sequence reads, the representation of sequence variation using graph genomes, and the development of an SV annotation pipeline. A summary of the questions and developments that arose during the daily discussions between groups is outlined. The new methods are publicly available at
https://github.com/NCBI-Codeathons/, and demonstrate that a codeathon devoted to SV analysis can produce valuable new insights both for participants and for the broader research community.
This book presents formal testplanning guidelines with examples focused on creating assertion-based verification IP. It demonstrates a systematic process for formal specification and formal ...testplanning and is the first book published on this subject.
Maurice Blondel, best known for his 1893 work on Action, offers a window on the world of philosophers who negotiated the scientific disciplines at the turn of the twentieth century. During this ...amazing era of discoveries, Blondel encouraged the bold, encyclopedic spirit of science as well as the new standards coming into use for accumulating and judging observational evidence. However, he warned of reductionism, determinism, and phenomenism, trends which could be avoided or corrected if the nature and scope of science were broadened. Such a broadening would introduce a more integrated and holistic understanding of the scientific quest.
Abstract
Background
Structural variants (SV), genomic rearrangements of >50 base pairs, are an important source of genetic variation and have a great impact on gene expression and protein function. ...However, their contribution to the genetic architecture of Alzheimer’s Disease (AD) has not been comprehensively investigated. Identification of SVs by short‐read alignment can be inaccurate and biased. We used a novel SV calling pipeline that leverages assembly‐based methods and graph‐based representation, to identify SVs with high accuracy in the diverse sample of the Alzheimer’s Disease Sequencing Project (ADSP). We further examined the association of common SVs with AD.
Methods
We applied Biograph for SV calling on the ADSP 17K whole genome sequence data. After filtering out low‐quality SVs and performing sample‐level quality control, we identified 222,956 deletions, 177,139 insertions, and 10,558 inversions in 12,908 individuals (6,304 AD cases; 6,604 controls). We performed association analyses of common SVs (MAF > 1%) with AD using a logistic mixed‐effects model adjusting for sex, technical covariates, principal components, and relatedness. Statistical significance was evaluated using a Bonferroni‐corrected threshold (P< 9.19.x10
−7
).
Results
There were 28,634 deletions, 25,522 insertions, and 198 inversions with MAF > 1%. We identified 2 intronic deletions significantly associated with AD. The first deletion is located in CCDC88B, encoding a protein that regulates T‐cell maturation and inflammation. The second deletion is located in CCDC11, predicted to be part of the spliceosomal complex. Among 11 insertions significantly associated with AD, 3 mapped to exonic sequences. A relatively common insertion located in an exon of FOXO6 showed the most significant association with AD (P = 1.41×10
−12
). FOXO6 is a member of Forkhead transcription factors that have been implicated in aging. Along with its high level of expression in the hippocampus, mice lacking FOXO6 had impaired memory consolidation. We identified one inversion suggestively associated with AD (P = 3.32×10
−6
) and located in an intron of PDE5A. PDE5A inhibitors have been reported as potential therapeutic targets for AD.
Conclusion
Using novel data structures and algorithms to call SV, we discovered 14 SVs significantly associated with AD. Validation and gene prioritization of these findings are warranted.