To further our understanding of the genetic etiology of autism, we generated and analyzed genome sequence data from 516 idiopathic autism families (2,064 individuals). This resource includes >59 ...million single-nucleotide variants (SNVs) and 9,212 private copy number variants (CNVs), of which 133,992 and 88 are de novo mutations (DNMs), respectively. We estimate a mutation rate of ∼1.5 × 10−8 SNVs per site per generation with a significantly higher mutation rate in repetitive DNA. Comparing probands and unaffected siblings, we observe several DNM trends. Probands carry more gene-disruptive CNVs and SNVs, resulting in severe missense mutations and mapping to predicted fetal brain promoters and embryonic stem cell enhancers. These differences become more pronounced for autism genes (p = 1.8 × 10−3, OR = 2.2). Patients are more likely to carry multiple coding and noncoding DNMs in different genes, which are enriched for expression in striatal neurons (p = 3 × 10−3), suggesting a path forward for genetically characterizing more complex cases of autism.
Display omitted
•Comprehensive CNV/SNV dataset from whole-genome sequencing of 516 autism families•Estimated human germline mutation rate of ∼1.5 × 10−8 substitutions/site/generation•Autism probands enriched for de novo missense, promoter, and enhancer mutations•Oligogenic de novo mutation signals for genes enriched in striatal neuron expression
Genomic analysis of 516 families with an autistic child and an unaffected sibling suggests that simplex autism results from de novo mutation and is oligogenic.
Cancer genomes often harbor hundreds of somatic DNA rearrangement junctions, many of which cannot be easily classified into simple (e.g., deletion) or complex (e.g., chromothripsis) structural ...variant classes. Applying a novel genome graph computational paradigm to analyze the topology of junction copy number (JCN) across 2,778 tumor whole-genome sequences, we uncovered three novel complex rearrangement phenomena: pyrgo, rigma, and tyfonas. Pyrgo are "towers" of low-JCN duplications associated with early-replicating regions, superenhancers, and breast or ovarian cancers. Rigma comprise "chasms" of low-JCN deletions enriched in late-replicating fragile sites and gastrointestinal carcinomas. Tyfonas are "typhoons" of high-JCN junctions and fold-back inversions associated with expressed protein-coding fusions, breakend hypermutation, and acral, but not cutaneous, melanomas. Clustering of tumors according to genome graph-derived features identified subgroups associated with DNA repair defects and poor prognosis.
Coronavirus disease 2019 (COVID-19) is caused by SARS-CoV-2, an emerging virus that utilizes host proteins ACE2 and TMPRSS2 as entry factors. Understanding the factors affecting the pattern and ...levels of expression of these genes is important for deeper understanding of SARS-CoV-2 tropism and pathogenesis. Here we explore the role of genetics and co-expression networks in regulating these genes in the airway, through the analysis of nasal airway transcriptome data from 695 children. We identify expression quantitative trait loci for both ACE2 and TMPRSS2, that vary in frequency across world populations. We find TMPRSS2 is part of a mucus secretory network, highly upregulated by type 2 (T2) inflammation through the action of interleukin-13, and that the interferon response to respiratory viruses highly upregulates ACE2 expression. IL-13 and virus infection mediated effects on ACE2 expression were also observed at the protein level in the airway epithelium. Finally, we define airway responses to common coronavirus infections in children, finding that these infections generate host responses similar to other viral species, including upregulation of IL6 and ACE2. Our results reveal possible mechanisms influencing SARS-CoV-2 infectivity and COVID-19 clinical outcomes.
Rose-comb, a classical monogenic trait of chickens, is characterized by a drastically altered comb morphology compared to the single-combed wild-type. Here we show that Rose-comb is caused by a 7.4 ...Mb inversion on chromosome 7 and that a second Rose-comb allele arose by unequal crossing over between a Rose-comb and wild-type chromosome. The comb phenotype is caused by the relocalization of the MNR2 homeodomain protein gene leading to transient ectopic expression of MNR2 during comb development. We also provide a molecular explanation for the first example of epistatic interaction reported by Bateson and Punnett 104 years ago, namely that walnut-comb is caused by the combined effects of the Rose-comb and Pea-comb alleles. Transient ectopic expression of MNR2 and SOX5 (causing the Pea-comb phenotype) occurs in the same population of mesenchymal cells and with at least partially overlapping expression in individual cells in the comb primordium. Rose-comb has pleiotropic effects, as homozygosity in males has been associated with poor sperm motility. We postulate that this is caused by the disruption of the CCDC108 gene located at one of the inversion breakpoints. CCDC108 is a poorly characterized protein, but it contains a MSP (major sperm protein) domain and is expressed in testis. The study illustrates several characteristic features of the genetic diversity present in domestic animals, including the evolution of alleles by two or more consecutive mutations and the fact that structural changes have contributed to fast phenotypic evolution.
Hundreds of thousands of human whole genome sequencing (WGS) datasets will be generated over the next few years. These data are more valuable in aggregate: joint analysis of genomes from many sources ...increases sample size and statistical power. A central challenge for joint analysis is that different WGS data processing pipelines cause substantial differences in variant calling in combined datasets, necessitating computationally expensive reprocessing. This approach is no longer tenable given the scale of current studies and data volumes. Here, we define WGS data processing standards that allow different groups to produce functionally equivalent (FE) results, yet still innovate on data processing pipelines. We present initial FE pipelines developed at five genome centers and show that they yield similar variant calling results and produce significantly less variability than sequencing replicates. This work alleviates a key technical bottleneck for genome aggregation and helps lay the foundation for community-wide human genetics studies.
Marine stickleback fish have colonized and adapted to thousands of streams and lakes formed since the last ice age, providing an exceptional opportunity to characterize genomic mechanisms underlying ...repeated ecological adaptation in nature. Here we develop a high-quality reference genome assembly for threespine sticklebacks. By sequencing the genomes of twenty additional individuals from a global set of marine and freshwater populations, we identify a genome-wide set of loci that are consistently associated with marine-freshwater divergence. Our results indicate that reuse of globally shared standing genetic variation, including chromosomal inversions, has an important role in repeated evolution of distinct marine and freshwater sticklebacks, and in the maintenance of divergent ecotypes during early stages of reproductive isolation. Both coding and regulatory changes occur in the set of loci underlying marine-freshwater evolution, but regulatory changes appear to predominate in this well known example of repeated adaptive evolution in nature.
Extensive genetic diversity in viral populations within infected hosts and the divergence of variants from existing reference genomes impede the analysis of deep viral sequencing data. A de novo ...population consensus assembly is valuable both as a single linear representation of the population and as a backbone on which intra-host variants can be accurately mapped. The availability of consensus assemblies and robustly mapped variants are crucial to the genetic study of viral disease progression, transmission dynamics, and viral evolution. Existing de novo assembly techniques fail to robustly assemble ultra-deep sequence data from genetically heterogeneous populations such as viruses into full-length genomes due to the presence of extensive genetic variability, contaminants, and variable sequence coverage.
We present VICUNA, a de novo assembly algorithm suitable for generating consensus assemblies from genetically heterogeneous populations. We demonstrate its effectiveness on Dengue, Human Immunodeficiency and West Nile viral populations, representing a range of intra-host diversity. Compared to state-of-the-art assemblers designed for haploid or diploid systems, VICUNA recovers full-length consensus and captures insertion/deletion polymorphisms in diverse samples. Final assemblies maintain a high base calling accuracy. VICUNA program is publicly available at: http://www.broadinstitute.org/scientific-community/science/projects/viral-genomics/ viral-genomics-analysis-software.
We developed VICUNA, a publicly available software tool, that enables consensus assembly of ultra-deep sequence derived from diverse viral populations. While VICUNA was developed for the analysis of viral populations, its application to other heterogeneous sequence data sets such as metagenomic or tumor cell population samples may prove beneficial in these fields of research.
Motivation: Comparing the genomes from closely related species provides a powerful tool to identify functional elements in a reference genome. Many methods have been developed to identify conserved ...sequences across species; however, existing methods only model conservation as a decrease in the rate of mutation and have ignored selection acting on the pattern of mutations. Results: We present a new approach that takes advantage of deeply sequenced clades to identify evolutionary selection by uncovering not only signatures of rate-based conservation but also substitution patterns characteristic of sequence undergoing natural selection. We describe a new statistical method for modeling biased nucleotide substitutions, a learning algorithm for inferring site-specific substitution biases directly from sequence alignments and a hidden Markov model for detecting constrained elements characterized by biased substitutions. We show that the new approach can identify significantly more degenerate constrained sequences than rate-based methods. Applying it to the ENCODE regions, we identify as much as 10.2% of these regions are under selection. Availability: The algorithms are implemented in a Java software package, called SiPhy, freely available at http://www.broadinstitute.org/science/software/. Contact: xhx@ics.uci.edu Supplementary information: Supplementary data are available at Bioinformatics online.
Tilapia are an important global food source due to their omnivorous diet, tolerance for high-density aquaculture, and relative disease resistance. Since 2009, tilapia aquaculture has been threatened ...by mass die-offs in farmed fish in Israel and Ecuador. Here we report evidence implicating a novel orthomyxo-like virus in these outbreaks. The tilapia lake virus (TiLV) has a 10-segment, negative-sense RNA genome. The largest segment, segment 1, contains an open reading frame with weak sequence homology to the influenza C virus PB1 subunit. The other nine segments showed no homology to other viruses but have conserved, complementary sequences at their 5' and 3' termini, consistent with the genome organization found in other orthomyxoviruses. In situ hybridization indicates TiLV replication and transcription at sites of pathology in the liver and central nervous system of tilapia with disease.
The economic impact of worldwide trade in tilapia is estimated at $7.5 billion U.S. dollars (USD) annually. The infectious agent implicated in mass tilapia die-offs in two continents poses a threat to the global tilapia industry, which not only provides inexpensive dietary protein but also is a major employer in the developing world. Here we report characterization of the causative agent as a novel orthomyxo-like virus, tilapia lake virus (TiLV). We also describe complete genomic and protein sequences that will facilitate TiLV detection and containment and enable vaccine development.
Long-read and strand-specific sequencing technologies together facilitate the de novo assembly of high-quality haplotype-resolved human genomes without parent-child trio data. We present 64 assembled ...haplotypes from 32 diverse human genomes. These highly contiguous haplotype assemblies (average minimum contig length needed to cover 50% of the genome: 26 million base pairs) integrate all forms of genetic variation, even across complex loci. We identified 107,590 structural variants (SVs), of which 68% were not discovered with short-read sequencing, and 278 SV hotspots (spanning megabases of gene-rich sequence). We characterized 130 of the most active mobile element source elements and found that 63% of all SVs arise through homology-mediated mechanisms. This resource enables reliable graph-based genotyping from short reads of up to 50,340 SVs, resulting in the identification of 1526 expression quantitative trait loci as well as SV candidates for adaptive selection within the human population.