Whole-genome re-sequencing Bentley, David R
Current opinion in genetics & development,
12/2006, Letnik:
16, Številka:
6
Journal Article
Recenzirano
DNA sequencing can be used to gain important information on genes, genetic variation and gene function for biological and medical studies. The growing collection of publicly available reference ...genome sequences will underpin a new era of whole genome re-sequencing, but sequencing costs need to fall and throughput needs to rise by several orders of magnitude. Novel technologies are being developed to meet this need by generating massive amounts of sequence that can be aligned to the reference sequence. The challenge is to maintain the high standards of accuracy and completeness that are hallmarks of the previous genome projects. One or more new sequencing technologies are expected to become the mainstay of future research, and to make DNA sequencing centre stage as a routine tool in genetic research in the coming years.
Abstract
Summary
We describe a novel computational method for genotyping repeats using sequence graphs. This method addresses the long-standing need to accurately genotype medically important loci ...containing repeats adjacent to other variants or imperfect DNA repeats such as polyalanine repeats. Here we introduce a new version of our repeat genotyping software, ExpansionHunter, that uses this method to perform targeted genotyping of a broad class of such loci.
Availability and implementation
ExpansionHunter is implemented in C++ and is available under the Apache License Version 2.0. The source code, documentation, and Linux/macOS binaries are available at https://github.com/Illumina/ExpansionHunter/.
Supplementary information
Supplementary data are available at Bioinformatics online.
Improvement of variant calling in next-generation sequence data requires a comprehensive, genome-wide catalog of high-confidence variants called in a set of genomes for use as a benchmark. We ...generated deep, whole-genome sequence data of 17 individuals in a three-generation pedigree and called variants in each genome using a range of currently available algorithms. We used haplotype transmission information to create a phased "Platinum" variant catalog of 4.7 million single-nucleotide variants (SNVs) plus 0.7 million small (1-50 bp) insertions and deletions (indels) that are consistent with the pattern of inheritance in the parents and 11 children of this pedigree. Platinum genotypes are highly concordant with the current catalog of the National Institute of Standards and Technology for both SNVs (>99.99%) and indels (99.92%) and add a validated truth catalog that has 26% more SNVs and 45% more indels. Analysis of 334,652 SNVs that were consistent between informatics pipelines yet inconsistent with haplotype transmission ("nonplatinum") revealed that the majority of these variants are de novo and cell-line mutations or reside within previously unidentified duplications and deletions. The reference materials from this study are a resource for objective assessment of the accuracy of variant calls throughout genomes.
Accurate detection and genotyping of structural variations (SVs) from short-read data is a long-standing area of development in genomics research and clinical sequencing pipelines. We introduce ...Paragraph, an accurate genotyper that models SVs using sequence graphs and SV annotations. We demonstrate the accuracy of Paragraph on whole-genome sequence data from three samples using long-read SV calls as the truth set, and then apply Paragraph at scale to a cohort of 100 short-read sequenced samples of diverse ancestry. Our analysis shows that Paragraph has better accuracy than other existing genotypers and can be applied to population-scale studies.
Repeat expansions are responsible for over 40 monogenic disorders, and undoubtedly more pathogenic repeat expansions remain to be discovered. Existing methods for detecting repeat expansions in ...short-read sequencing data require predefined repeat catalogs. Recent discoveries emphasize the need for methods that do not require pre-specified candidate repeats. To address this need, we introduce ExpansionHunter Denovo, an efficient catalog-free method for genome-wide repeat expansion detection. Analysis of real and simulated data shows that our method can identify large expansions of 41 out of 44 pathogenic repeats, including nine recently reported non-reference repeat expansions not discoverable via existing methods.
The genomic landscape of breast cancer is complex, and inter- and intra-tumour heterogeneity are important challenges in treating the disease. In this study, we sequence 173 genes in 2,433 primary ...breast tumours that have copy number aberration (CNA), gene expression and long-term clinical follow-up data. We identify 40 mutation-driver (Mut-driver) genes, and determine associations between mutations, driver CNA profiles, clinical-pathological parameters and survival. We assess the clonal states of Mut-driver mutations, and estimate levels of intra-tumour heterogeneity using mutant-allele fractions. Associations between PIK3CA mutations and reduced survival are identified in three subgroups of ER-positive cancer (defined by amplification of 17q23, 11q13-14 or 8q24). High levels of intra-tumour heterogeneity are in general associated with a worse outcome, but highly aggressive tumours with 11q13-14 amplification have low levels of intra-tumour heterogeneity. These results emphasize the importance of genome-based stratification of breast cancer, and have important implications for designing therapeutic strategies.
Fresh-frozen (FF) tissue is the optimal source of DNA for whole-genome sequencing (WGS) of cancer patients. However, it is not always available, limiting the widespread application of WGS in clinical ...practice. We explored the viability of using formalin-fixed, paraffin-embedded (FFPE) tissues, available routinely for cancer patients, as a source of DNA for clinical WGS.
We conducted a prospective study using DNAs from matched FF, FFPE, and peripheral blood germ-line specimens collected from 52 cancer patients (156 samples) following routine diagnostic protocols. We compared somatic variants detected in FFPE and matching FF samples.
We found the single-nucleotide variant agreement reached 71% across the genome and somatic copy-number alterations (CNAs) detection from FFPE samples was suboptimal (0.44 median correlation with FF) due to nonuniform coverage. CNA detection was improved significantly with lower reverse crosslinking temperature in FFPE DNA extraction (80 °C or 65 °C depending on the methods). Our final data showed somatic variant detection from FFPE for clinical decision making is possible. We detected 98% of clinically actionable variants (including 30/31 CNAs).
We present the first prospective WGS study of cancer patients using FFPE specimens collected in a routine clinical environment proving WGS can be applied in the clinic.
Current diagnostic testing for genetic disorders involves serial use of specialized assays spanning multiple technologies. In principle, genome sequencing (GS) can detect all genomic pathogenic ...variant types on a single platform. Here we evaluate copy-number variant (CNV) calling as part of a clinically accredited GS test.
We performed analytical validation of CNV calling on 17 reference samples, compared the sensitivity of GS-based variants with those from a clinical microarray, and set a bound on precision using orthogonal technologies. We developed a protocol for family-based analysis of GS-based CNV calls, and deployed this across a clinical cohort of 79 rare and undiagnosed cases.
We found that CNV calls from GS are at least as sensitive as those from microarrays, while only creating a modest increase in the number of variants interpreted (~10 CNVs per case). We identified clinically significant CNVs in 15% of the first 79 cases analyzed, all of which were confirmed by an orthogonal approach. The pipeline also enabled discovery of a uniparental disomy (UPD) and a 50% mosaic trisomy 14. Directed analysis of select CNVs enabled breakpoint level resolution of genomic rearrangements and phasing of de novo CNVs.
Robust identification of CNVs by GS is possible within a clinical testing environment.
The nature and scale of recombination rate variation are largely unknown for most species. In humans, pedigree analysis has documented variation at the chromosomal level, and sperm studies have ...identified specific hotspots in which crossing-over events cluster. To address whether this picture is representative of the genome as a whole, we have developed and validated a method for estimating recombination rates from patterns of genetic variation. From extensive single-nucleotide polymorphism surveys in European and African populations, we find evidence for extreme local rate variation spanning four orders in magnitude, in which 50% of all recombination events take place in less than 10% of the sequence. We demonstrate that recombination hotspots are a ubiquitous feature of the human genome, occurring on average every 200 kilobases or less, but recombination occurs preferentially outside genes.
Biologic heterogeneity is a feature of diffuse large B-cell lymphoma (DLBCL), and the existence of a subgroup with poor prognosis and phenotypic proximity to Burkitt lymphoma is well known. ...Conventional cytogenetics identifies some patients with rearrangements of MYC and BCL2 and/or BCL6 (double-hit lymphomas) who are increasingly treated with more intensive chemotherapy, but a more biologically coherent and clinically useful definition of this group is required.
We defined a molecular high-grade (MHG) group by applying a gene expression-based classifier to 928 patients with DLBCL from a clinical trial that investigated the addition of bortezomib to standard rituximab plus cyclophosphamide, doxorubicin, vincristine, and prednisone (R-CHOP) therapy. The prognostic significance of MHG was compared with existing biomarkers. We performed targeted sequencing of 70 genes in 400 patients and explored molecular pathology using gene expression signature databases. Findings were validated in an independent data set.
The MHG group comprised 83 patients (9%), with 75 in the cell-of-origin germinal center B-cell-like group. MYC rearranged and double-hit groups were strongly over-represented in MHG but comprised only one half of the total. Gene expression analysis revealed a proliferative phenotype with a relationship to centroblasts. Progression-free survival rate at 36 months after R-CHOP in the MHG group was 37% (95% CI, 24% to 55%) compared with 72% (95% CI, 68% to 77%) for others, and an analysis of treatment effects suggested a possible positive effect of bortezomib. Double-hit lymphomas lacking the MHG signature showed no evidence of worse outcome than other germinal center B-cell-like cases.
MHG defines a biologically coherent high-grade B-cell lymphoma group with distinct molecular features and clinical outcomes that effectively doubles the size of the poor-prognosis, double-hit group. Patients with MHG may benefit from intensified chemotherapy or novel targeted therapies.