Massively parallel sequencing of cDNA has enabled deep and efficient probing of transcriptomes. Current approaches for transcript reconstruction from such data often rely on aligning reads to a ...reference genome, and are thus unsuitable for samples with a partial or missing reference genome. Here we present the Trinity method for de novo assembly of full-length transcripts and evaluate it on samples from fission yeast, mouse and whitefly, whose reference genome is not yet available. By efficiently constructing and analyzing sets of de Bruijn graphs, Trinity fully reconstructs a large fraction of transcripts, including alternatively spliced isoforms and transcripts from recently duplicated genes. Compared with other de novo transcriptome assemblers, Trinity recovers more full-length transcripts across a broad range of expression levels, with a sensitivity similar to methods that rely on genome alignments. Our approach provides a unified solution for transcriptome reconstruction in any sample, especially in the absence of a reference genome.
Marine stickleback fish have colonized and adapted to thousands of streams and lakes formed since the last ice age, providing an exceptional opportunity to characterize genomic mechanisms underlying ...repeated ecological adaptation in nature. Here we develop a high-quality reference genome assembly for threespine sticklebacks. By sequencing the genomes of twenty additional individuals from a global set of marine and freshwater populations, we identify a genome-wide set of loci that are consistently associated with marine-freshwater divergence. Our results indicate that reuse of globally shared standing genetic variation, including chromosomal inversions, has an important role in repeated evolution of distinct marine and freshwater sticklebacks, and in the maintenance of divergent ecotypes during early stages of reproductive isolation. Both coding and regulatory changes occur in the set of loci underlying marine-freshwater evolution, but regulatory changes appear to predominate in this well known example of repeated adaptive evolution in nature.
Motivation: Comparative genomics heavily relies on alignments of large and often complex DNA sequences. From an engineering perspective, the problem here is to provide maximum sensitivity (to find ...all there is to find), specificity (to only find real homology) and speed (to accommodate the billions of base pairs of vertebrate genomes). Results: Satsuma addresses all three issues through novel strategies: (i) cross-correlation, implemented via fast Fourier transform; (ii) a match scoring scheme that eliminates almost all false hits; and (iii) an asynchronous ‘battleship’-like search that allows for aligning two entire fish genomes (470 and 217 Mb) in 120 CPU hours using 15 processors on a single machine. Availability: Satsuma is part of the Spines software package, implemented in C++ on Linux. The latest version of Spines can be freely downloaded under the LGPL license from http://www.broadinstitute.org/science/programs/genome-biology/spines/ Contact: grabherr@broadinstitute.org
Lymphoma is the most common hematological malignancy in developed countries. Outcome is strongly determined by molecular subtype, reflecting a need for new and improved treatment options. Dogs ...spontaneously develop lymphoma, and the predisposition of certain breeds indicates genetic risk factors. Using the dog breed structure, we selected three lymphoma predisposed breeds developing primarily T-cell (boxer), primarily B-cell (cocker spaniel), and with equal distribution of B- and T-cell lymphoma (golden retriever), respectively. We investigated the somatic mutations in B- and T-cell lymphomas from these breeds by exome sequencing of tumor and normal pairs. Strong similarities were evident between B-cell lymphomas from golden retrievers and cocker spaniels, with recurrent mutations in TRAF3-MAP3K14 (28% of all cases), FBXW7 (25%), and POT1 (17%). The FBXW7 mutations recurrently occur in a specific codon; the corresponding codon is recurrently mutated in human cancer. In contrast, T-cell lymphomas from the predisposed breeds, boxers and golden retrievers, show little overlap in their mutation pattern, sharing only one of their 15 most recurrently mutated genes. Boxers, which develop aggressive T-cell lymphomas, are typically mutated in the PTEN-mTOR pathway. T-cell lymphomas in golden retrievers are often less aggressive, and their tumors typically showed mutations in genes involved in cellular metabolism. We identify genes with known involvement in human lymphoma and leukemia, genes implicated in other human cancers, as well as novel genes that could allow new therapeutic options.
Arrhythmogenic right ventricular cardiomyopathy (ARVC) is a familial cardiac disease characterized by ventricular arrhythmias and sudden cardiac death. It is most frequently inherited as an autosomal ...dominant trait with incomplete and age-related penetrance and variable clinical expression. The human disease is most commonly associated with a causative mutation in one of several genes encoding desmosomal proteins. We have previously described a spontaneous canine model of ARVC in the boxer dog. We phenotyped adult boxer dogs for ARVC by performing physical examination, echocardiogram and ambulatory electrocardiogram. Genome-wide association using the canine 50k SNP array identified several regions of association, of which the strongest resided on chromosome 17. Fine mapping and direct DNA sequencing identified an 8-bp deletion in the 3′ untranslated region (UTR) of the Striatin gene on chromosome 17 in association with ARVC in the boxer dog. Evaluation of the secondary structure of the 3′ UTR demonstrated that the deletion affects a stem loop structure of the mRNA and expression analysis identified a reduction in Striatin mRNA. Dogs that were homozygous for the deletion had a more severe form of disease based on a significantly higher number of ventricular premature complexes. Immunofluorescence studies localized Striatin to the intercalated disc region of the cardiac myocyte and co-localized it to three desmosomal proteins, Plakophilin-2, Plakoglobin and Desmoplakin, all involved in the pathogenesis of ARVC in human beings. We suggest that Striatin may serve as a novel candidate gene for human ARVC.
Hereditary periodic fever syndromes are characterized by recurrent episodes of fever and inflammation with no known pathogenic or autoimmune cause. In humans, several genes have been implicated in ...this group of diseases, but the majority of cases remain unexplained. A similar periodic fever syndrome is relatively frequent in the Chinese Shar-Pei breed of dogs. In the western world, Shar-Pei have been strongly selected for a distinctive thick and heavily folded skin. In this study, a mutation affecting both these traits was identified. Using genome-wide SNP analysis of Shar-Pei and other breeds, the strongest signal of a breed-specific selective sweep was located on chromosome 13. The same region also harbored the strongest genome-wide association (GWA) signal for susceptibility to the periodic fever syndrome (p(raw) = 2.3 × 10⁻⁶, p(genome) = 0.01). Dense targeted resequencing revealed two partially overlapping duplications, 14.3 Kb and 16.1 Kb in size, unique to Shar-Pei and upstream of the Hyaluronic Acid Synthase 2 (HAS2) gene. HAS2 encodes the rate-limiting enzyme synthesizing hyaluronan (HA), a major component of the skin. HA is up-regulated and accumulates in the thickened skin of Shar-Pei. A high copy number of the 16.1 Kb duplication was associated with an increased expression of HAS2 as well as the periodic fever syndrome (p < 0.0001). When fragmented, HA can act as a trigger of the innate immune system and stimulate sterile fever and inflammation. The strong selection for the skin phenotype therefore appears to enrich for a pleiotropic mutation predisposing these dogs to a periodic fever syndrome. The identification of HA as a major risk factor for this canine disease raises the potential of this glycosaminoglycan as a risk factor for human periodic fevers and as an important driver of chronic inflammation.
Dogs, with their breed-determined limited genetic background, are great models of human disease including cancer. Canine B-cell lymphoma and hemangiosarcoma are both malignancies of the hematologic ...system that are clinically and histologically similar to human B-cell non-Hodgkin lymphoma and angiosarcoma, respectively. Golden retrievers in the US show significantly elevated lifetime risk for both B-cell lymphoma (6%) and hemangiosarcoma (20%). We conducted genome-wide association studies for hemangiosarcoma and B-cell lymphoma, identifying two shared predisposing loci. The two associated loci are located on chromosome 5, and together contribute ~20% of the risk of developing these cancers. Genome-wide p-values for the top SNP of each locus are 4.6×10-7 and 2.7×10-6, respectively. Whole genome resequencing of nine cases and controls followed by genotyping and detailed analysis identified three shared and one B-cell lymphoma specific risk haplotypes within the two loci, but no coding changes were associated with the risk haplotypes. Gene expression analysis of B-cell lymphoma tumors revealed that carrying the risk haplotypes at the first locus is associated with down-regulation of several nearby genes including the proximal gene TRPC6, a transient receptor Ca2+-channel involved in T-cell activation, among other functions. The shared risk haplotype in the second locus overlaps the vesicle transport and release gene STX8. Carrying the shared risk haplotype is associated with gene expression changes of 100 genes enriched for pathways involved in immune cell activation. Thus, the predisposing germ-line mutations in B-cell lymphoma and hemangiosarcoma appear to be regulatory, and affect pathways involved in T-cell mediated immune response in the tumor. This suggests that the interaction between the immune system and malignant cells plays a common role in the tumorigenesis of these relatively different cancers.
Phenomena such as incomplete lineage sorting, horizontal gene transfer, gene duplication and subsequent sub- and neo-functionalisation can result in distinct local phylogenetic relationships that are ...discordant with species phylogeny. In order to assess the possible biological roles for these subdivisions, they must first be identified and characterised, preferably on a large scale and in an automated fashion.
We developed Saguaro, a combination of a Hidden Markov Model (HMM) and a Self Organising Map (SOM), to characterise local phylogenetic relationships among aligned sequences using cacti, matrices of pair-wise distance measures. While the HMM determines the genomic boundaries from aligned sequences, the SOM hypothesises new cacti in an unsupervised and iterative fashion based on the regions that were modelled least well by existing cacti. After testing the software on simulated data, we demonstrate the utility of Saguaro by testing two different data sets: (i) 181 Dengue virus strains, and (ii) 5 primate genomes. Saguaro identifies regions under lineage-specific constraint for the first set, and genomic segments that we attribute to incomplete lineage sorting in the second dataset. Intriguingly for the primate data, Saguaro also classified an additional ~3% of the genome as most incompatible with the expected species phylogeny. A substantial fraction of these regions was found to overlap genes associated with both the innate and adaptive immune systems.
Saguaro detects distinct cacti describing local phylogenetic relationships without requiring any a priori hypotheses. We have successfully demonstrated Saguaro's utility with two contrasting data sets, one containing many members with short sequences (Dengue viral strains: n = 181, genome size = 10,700 nt), and the other with few members but complex genomes (related primate species: n = 5, genome size = 3 Gb), suggesting that the software is applicable to a wide variety of experimental populations. Saguaro is written in C++, runs on the Linux operating system, and can be downloaded from http://saguarogw.sourceforge.net/.
Population-based newborn screening (NBS) allows early detection and treatment of inherited disorders. For certain medically-actionable conditions, however, NBS is limited by the absence of reliable ...biochemical signatures amenable to detection by current platforms. We sought to assess the analytic validity of an ATP7A targeted next generation DNA sequencing assay as a potential newborn screen for one such disorder, Menkes disease.
Dried blood spots from control or Menkes disease subjects (n = 22) were blindly analyzed for pathogenic variants in the copper transport gene, ATP7A. The analytical method was optimized to minimize cost and provide rapid turnaround time.
The algorithm correctly identified pathogenic ATP7A variants, including missense, nonsense, small insertions/deletions, and large copy number variants, in 21/22 (95.5%) of subjects, one of whom had inconclusive diagnostic sequencing previously. For one false negative that also had not been detected by commercial molecular laboratories, we identified a deep intronic variant that impaired ATP7A mRNA splicing.
Our results support proof-of-concept that primary DNA-based NBS would accurately detect Menkes disease, a disorder that fulfills Wilson and Jungner screening criteria and for which biochemical NBS is unavailable. Targeted next generation sequencing for NBS would enable improved Menkes disease clinical outcomes, establish a platform for early identification of other unscreened disorders, and complement current NBS by providing immediate data for molecular confirmation of numerous biochemically screened conditions.
We previously described the whole-genome assembly program Arachne, presenting assemblies of simulated data for small to mid-sized genomes. Here we describe algorithmic adaptations to the program, ...allowing for assembly of mammalian-size genomes, and also improving the assembly of smaller genomes. Three principal changes were simultaneously made and applied to the assembly of the mouse genome, during a six-month period of development: (1) Supercontigs (scaffolds) were iteratively broken and rejoined using several criteria, yielding a 64-fold increase in length (N50), and apparent elimination of all global misjoins; (2) gaps between contigs in supercontigs were filled (partially or completely) by insertion of reads, as suggested by pairing within the supercontig, increasing the N50 contig length by 50%; (3) memory usage was reduced fourfold. The outcome of this mouse assembly and its analysis are described in (Mouse Genome Sequencing Consortium 2002).