Sequencing errors are key confounding factors for detecting low-frequency genetic variants that are important for cancer molecular diagnosis, treatment, and surveillance using deep next-generation ...sequencing (NGS). However, there is a lack of comprehensive understanding of errors introduced at various steps of a conventional NGS workflow, such as sample handling, library preparation, PCR enrichment, and sequencing. In this study, we use current NGS technology to systematically investigate these questions.
By evaluating read-specific error distributions, we discover that the substitution error rate can be computationally suppressed to 10
to 10
, which is 10- to 100-fold lower than generally considered achievable (10
) in the current literature. We then quantify substitution errors attributable to sample handling, library preparation, enrichment PCR, and sequencing by using multiple deep sequencing datasets. We find that error rates differ by nucleotide substitution types, ranging from 10
for A>C/T>G, C>A/G>T, and C>G/G>C changes to 10
for A>G/T>C changes. Furthermore, C>T/G>A errors exhibit strong sequence context dependency, sample-specific effects dominate elevated C>A/G>T errors, and target-enrichment PCR led to ~ 6-fold increase of overall error rate. We also find that more than 70% of hotspot variants can be detected at 0.1 ~ 0.01% frequency with the current NGS technology by applying in silico error suppression.
We present the first comprehensive analysis of sequencing error sources in conventional NGS workflows. The error profiles revealed by our study highlight new directions for further improving NGS analysis accuracy both experimentally and computationally, ultimately enhancing the precision of deep sequencing.
Neuroblastoma is a pediatric malignancy with heterogeneous clinical outcomes. To better understand neuroblastoma pathogenesis, here we analyze whole-genome, whole-exome and/or transcriptome data from ...702 neuroblastoma samples. Forty percent of samples harbor at least one recurrent driver gene alteration and most aberrations, including MYCN, ATRX, and TERT alterations, differ in frequency by age. MYCN alterations occur at median 2.3 years of age, TERT at 3.8 years, and ATRX at 5.6 years. COSMIC mutational signature 18, previously associated with reactive oxygen species, is the most common cause of driver point mutations in neuroblastoma, including most ALK and Ras-activating variants. Signature 18 appears early and is continuous throughout disease evolution. Signature 18 is enriched in neuroblastomas with MYCN amplification, 17q gain, and increased expression of mitochondrial ribosome and electron transport-associated genes. Recurrent FGFR1 variants in six patients, and ALK N-terminal structural alterations in five samples, identify additional patients potentially amenable to precision therapy.
Pediatric high-grade glioma (HGG) is a devastating disease with a less than 20% survival rate 2 years after diagnosis. We analyzed 127 pediatric HGGs, including diffuse intrinsic pontine gliomas ...(DIPGs) and non-brainstem HGGs (NBS-HGGs), by whole-genome, whole-exome and/or transcriptome sequencing. We identified recurrent somatic mutations in ACVR1 exclusively in DIPGs (32%), in addition to previously reported frequent somatic mutations in histone H3 genes, TP53 and ATRX, in both DIPGs and NBS-HGGs. Structural variants generating fusion genes were found in 47% of DIPGs and NBS-HGGs, with recurrent fusions involving the neurotrophin receptor genes NTRK1, NTRK2 and NTRK3 in 40% of NBS-HGGs in infants. Mutations targeting receptor tyrosine kinase-RAS-PI3K signaling, histone modification or chromatin remodeling, and cell cycle regulation were found in 68%, 73% and 59% of pediatric HGGs, respectively, including in DIPGs and NBS-HGGs. This comprehensive analysis provides insights into the unique and shared pathways driving pediatric HGG within and outside the brainstem.
Infant acute lymphoblastic leukemia (ALL) with MLL rearrangements (MLL-R) represents a distinct leukemia with a poor prognosis. To define its mutational landscape, we performed whole-genome, exome, ...RNA and targeted DNA sequencing on 65 infants (47 MLL-R and 18 non-MLL-R cases) and 20 older children (MLL-R cases) with leukemia. Our data show that infant MLL-R ALL has one of the lowest frequencies of somatic mutations of any sequenced cancer, with the predominant leukemic clone carrying a mean of 1.3 non-silent mutations. Despite this paucity of mutations, we detected activating mutations in kinase-PI3K-RAS signaling pathway components in 47% of cases. Surprisingly, these mutations were often subclonal and were frequently lost at relapse. In contrast to infant cases, MLL-R leukemia in older children had more somatic mutations (mean of 6.5 mutations/case versus 1.3 mutations/case, P = 7.15 × 10(-5)) and had frequent mutations (45%) in epigenetic regulators, a category of genes that, with the exception of MLL, was rarely mutated in infant MLL-R ALL.
To study the mechanisms of relapse in acute lymphoblastic leukemia (ALL), we performed whole-genome sequencing of 103 diagnosis-relapse-germline trios and ultra-deep sequencing of 208 serial samples ...in 16 patients. Relapse-specific somatic alterations were enriched in 12 genes (NR3C1, NR3C2, TP53, NT5C2, FPGS, CREBBP, MSH2, MSH6, PMS2, WHSC1, PRPS1, and PRPS2) involved in drug response. Their prevalence was 17% in very early relapse (<9 months from diagnosis), 65% in early relapse (9-36 months), and 32% in late relapse (>36 months) groups. Convergent evolution, in which multiple subclones harbor mutations in the same drug resistance gene, was observed in 6 relapses and confirmed by single-cell sequencing in 1 case. Mathematical modeling and mutational signature analysis indicated that early relapse resistance acquisition was frequently a 2-step process in which a persistent clone survived initial therapy and later acquired bona fide resistance mutations during therapy. In contrast, very early relapses arose from preexisting resistant clone(s). Two novel relapse-specific mutational signatures, one of which was caused by thiopurine treatment based on in vitro drug exposure experiments, were identified in early and late relapses but were absent from 2540 pan-cancer diagnosis samples and 129 non-ALL relapses. The novel signatures were detected in 27% of relapsed ALLs and were responsible for 46% of acquired resistance mutations in NT5C2, PRPS1, NR3C1, and TP53. These results suggest that chemotherapy-induced drug resistance mutations facilitate a subset of pediatric ALL relapses.
•Chemotherapy-induced mutagenesis may cause drug resistance mutations in ALL, leading to relapse.
•Thiopurines in particular likely cause drug resistance mutations in NT5C2, NR3C1, and TP53.
Display omitted
Members of the nuclear factor-κB (NF-κB) family of transcriptional regulators are central mediators of the cellular inflammatory response. Although constitutive NF-κB signalling is present in most ...human tumours, mutations in pathway members are rare, complicating efforts to understand and block aberrant NF-κB activity in cancer. Here we show that more than two-thirds of supratentorial ependymomas contain oncogenic fusions between RELA, the principal effector of canonical NF-κB signalling, and an uncharacterized gene, C11orf95. In each case, C11orf95-RELA fusions resulted from chromothripsis involving chromosome 11q13.1. C11orf95-RELA fusion proteins translocated spontaneously to the nucleus to activate NF-κB target genes, and rapidly transformed neural stem cells--the cell of origin of ependymoma--to form these tumours in mice. Our data identify a highly recurrent genetic alteration of RELA in human cancer, and the C11orf95-RELA fusion protein as a potential therapeutic target in supratentorial ependymoma.
The most common pediatric brain tumors are low-grade gliomas (LGGs). We used whole-genome sequencing to identify multiple new genetic alterations involving BRAF, RAF1, FGFR1, MYB, MYBL1 and genes ...with histone-related functions, including H3F3A and ATRX, in 39 LGGs and low-grade glioneuronal tumors (LGGNTs). Only a single non-silent somatic alteration was detected in 24 of 39 (62%) tumors. Intragenic duplications of the portion of FGFR1 encoding the tyrosine kinase domain (TKD) and rearrangements of MYB were recurrent and mutually exclusive in 53% of grade II diffuse LGGs. Transplantation of Trp53-null neonatal astrocytes expressing FGFR1 with the duplication involving the TKD into the brains of nude mice generated high-grade astrocytomas with short latency and 100% penetrance. FGFR1 with the duplication induced FGFR1 autophosphorylation and upregulation of the MAPK/ERK and PI3K pathways, which could be blocked by specific inhibitors. Focusing on the therapeutically challenging diffuse LGGs, our study of 151 tumors has discovered genetic alterations and potential therapeutic targets across the entire range of pediatric LGGs and LGGNTs.
Single-cell RNA sequencing (scRNA-seq) is a powerful tool for characterizing the cell-to-cell variation and cellular dynamics in populations which appear homogeneous otherwise in basic and ...translational biological research. However, significant challenges arise in the analysis of scRNA-seq data, including the low signal-to-noise ratio with high data sparsity, potential batch effects, scalability problems when hundreds of thousands of cells are to be analyzed among others. The inherent complexities of scRNA-seq data and dynamic nature of cellular processes lead to suboptimal performance of many currently available algorithms, even for basic tasks such as identifying biologically meaningful heterogeneous subpopulations. In this study, we developed the Latent Cellular Analysis (LCA), a machine learning-based analytical pipeline that combines cosine-similarity measurement by latent cellular states with a graph-based clustering algorithm. LCA provides heuristic solutions for population number inference, dimension reduction, feature selection, and control of technical variations without explicit gene filtering. We show that LCA is robust, accurate, and powerful by comparison with multiple state-of-the-art computational methods when applied to large-scale real and simulated scRNA-seq data. Importantly, the ability of LCA to learn from representative subsets of the data provides scalability, thereby addressing a significant challenge posed by growing sample sizes in scRNA-seq data analysis.
Pediatric osteosarcoma is characterized by multiple somatic chromosomal lesions, including structural variations (SVs) and copy number alterations (CNAs). To define the landscape of somatic mutations ...in pediatric osteosarcoma, we performed whole-genome sequencing of DNA from 20 osteosarcoma tumor samples and matched normal tissue in a discovery cohort, as well as 14 samples in a validation cohort. Single-nucleotide variations (SNVs) exhibited a pattern of localized hypermutation called kataegis in 50% of the tumors. We identified p53 pathway lesions in all tumors in the discovery cohort, nine of which were translocations in the first intron of the TP53 gene. Beyond TP53, the RB1, ATRX, and DLG2 genes showed recurrent somatic alterations in 29%-53% of the tumors. These data highlight the power of whole-genome sequencing for identifying recurrent somatic alterations in cancer genomes that may be missed using other methods.
Acute megakaryoblastic leukemia (AMKL) is a subtype of acute myeloid leukemia (AML) in which cells morphologically resemble abnormal megakaryoblasts. While rare in adults, AMKL accounts for 4-15% of ...newly diagnosed childhood AML cases. AMKL in individuals without Down syndrome (non-DS-AMKL) is frequently associated with poor clinical outcomes. Previous efforts have identified chimeric oncogenes in a substantial number of non-DS-AMKL cases, including RBM15-MKL1, CBFA2T3-GLIS2, KMT2A gene rearrangements, and NUP98-KDM5A. However, the etiology of 30-40% of cases remains unknown. To better understand the genomic landscape of non-DS-AMKL, we performed RNA and exome sequencing on specimens from 99 patients (75 pediatric and 24 adult). We demonstrate that pediatric non-DS-AMKL is a heterogeneous malignancy that can be divided into seven subgroups with varying outcomes. These subgroups are characterized by chimeric oncogenes with cooperating mutations in epigenetic and kinase signaling genes. Overall, these data shed light on the etiology of AMKL and provide useful information for the tailoring of treatment.