RNA-seq has shown huge potential for phylogenomic inferences in non-model organisms. However, error, incompleteness, and redundant assembled transcripts for each gene in de novo assembly of short ...reads cause noise in analyses and a large amount of missing data in the aligned matrix. To address these problems, we compare de novo assemblies of paired end 90 bp RNA-seq reads using Oases, Trinity, Trans-ABySS and SOAPdenovo-Trans to transcripts from genome annotation of the model plant Ricinus communis. By doing so we evaluate strategies for optimizing total gene coverage and minimizing assembly chimeras and redundancy.
We found that the frequency and structure of chimeras vary dramatically among different software packages. The differences were largely due to the number of trans-self chimeras that contain repeats in the opposite direction. More than half of the total chimeras in Oases and Trinity were trans-self chimeras. Within each package, we found a trade-off between maximizing reference coverage and minimizing redundancy and chimera rate. In order to reduce redundancy, we investigated three methods: 1) using cap3 and CD-HIT-EST to combine highly similar transcripts, 2) only retaining the transcript with the highest read coverage, or removing the transcript with the lowest read coverage for each subcomponent in Trinity, and 3) filtering Oases single k-mer assemblies by number of transcripts per locus and relative transcript length, and then finding the transcript with the highest read coverage. We then utilized results from blastx against model protein sequences to effectively remove trans chimeras. After optimization, seven assembly strategies among all four packages successfully assembled 42.9-47.1% of reference genes to more than 200 bp, with a chimera rate of 0.92-2.21%, and on average 1.8-3.1 transcripts per reference gene assembled.
With rapidly improving sequencing and assembly tools, our study provides a framework to benchmark and optimize performance before choosing tools or parameter combinations for analyzing short-read RNA-seq data. Our study demonstrates that choice of assembly package, k-mer sizes, post-assembly redundancy-reduction and chimera cleanup, and strand-specific RNA-seq library preparation and assembly dramatically improves gene coverage by non-redundant and non-chimeric transcripts that are optimized for downstream phylogenomic analyses.
Insertion sequences (IS) are small transposable elements, commonly found in bacterial genomes. Identifying the location of IS in bacterial genomes can be useful for a variety of purposes including ...epidemiological tracking and predicting antibiotic resistance. However IS are commonly present in multiple copies in a single genome, which complicates genome assembly and the identification of IS insertion sites. Here we present ISMapper, a mapping-based tool for identification of the site and orientation of IS insertions in bacterial genomes, directly from paired-end short read data.
ISMapper was validated using three types of short read data: (i) simulated reads from a variety of species, (ii) Illumina reads from 5 isolates for which finished genome sequences were available for comparison, and (iii) Illumina reads from 7 Acinetobacter baumannii isolates for which predicted IS locations were tested using PCR. A total of 20 genomes, including 13 species and 32 distinct IS, were used for validation. ISMapper correctly identified 97 % of known IS insertions in the analysis of simulated reads, and 98 % in real Illumina reads. Subsampling of real Illumina reads to lower depths indicated ISMapper was able to correctly detect insertions for average genome-wide read depths >20x, although read depths >50x were required to obtain confident calls that were highly-supported by evidence from reads. All ISAba1 insertions identified by ISMapper in the A. baumannii genomes were confirmed by PCR. In each A. baumannii genome, ISMapper successfully identified an IS insertion upstream of the ampC beta-lactamase that could explain phenotypic resistance to third-generation cephalosporins. The utility of ISMapper was further demonstrated by profiling genome-wide IS6110 insertions in 138 publicly available Mycobacterium tuberculosis genomes, revealing lineage-specific insertions and multiple insertion hotspots.
ISMapper provides a rapid and robust method for identifying IS insertion sites directly from short read data, with a high degree of accuracy demonstrated across a wide range of bacteria.
In the honeybee Apis mellifera, the bacterial gut community is consistently colonized by eight distinct phylotypes of bacteria. Managed bee colonies are of considerable economic interest and it is ...therefore important to elucidate the diversity and role of this microbiota in the honeybee. In this study, we have sequenced the genomes of eleven strains of lactobacilli and bifidobacteria isolated from the honey crop of the honeybee A. mellifera.
Single gene phylogenies confirmed that the isolated strains represent the diversity of lactobacilli and bifidobacteria in the gut, as previously identified by 16S rRNA gene sequencing. Core genome phylogenies of the lactobacilli and bifidobacteria further indicated extensive divergence between strains classified as the same phylotype. Phylotype-specific protein families included unique surface proteins. Within phylotypes, we found a remarkably high level of gene content diversity. Carbohydrate metabolism and transport functions contributed up to 45% of the accessory genes, with some genomes having a higher content of genes encoding phosphotransferase systems for the uptake of carbohydrates than any previously sequenced genome. These genes were often located in highly variable genomic segments that also contained genes for enzymes involved in the degradation and modification of sugar residues. Strain-specific gene clusters for the biosynthesis of exopolysaccharides were identified in two phylotypes. The dynamics of these segments contrasted with low recombination frequencies and conserved gene order structures for the core genes. Hits for CRISPR spacers were almost exclusively found within phylotypes, suggesting that the phylotypes are associated with distinct phage populations.
The honeybee gut microbiota has been described as consisting of a modest number of phylotypes; however, the genomes sequenced in the current study demonstrated a very high level of gene content diversity within all three described phylotypes of lactobacilli and bifidobacteria, particularly in terms of metabolic functions and surface structures, where many features were strain-specific. Together, these results indicate niche differentiation within phylotypes, suggesting that the honeybee gut microbiota is more complex than previously thought.
Fish species often exhibit significant sexual dimorphism for commercially important traits. Accordingly, the control of phenotypic sex, and in particular the production of monosex cultures, is of ...particular interest to the aquaculture industry. Sex determination in the widely farmed Nile tilapia (Oreochromis niloticus) is complex, involving genomic regions on at least three chromosomes (chromosomes 1, 3 and 23) and interacting in certain cases with elevated early rearing temperature as well. Thus, sex ratios may vary substantially from 50%.
This study focused on mapping sex-determining quantitative trait loci (QTL) in families with skewed sex ratios. These included four families that showed an excess of males (male ratio varied between 64% and 93%) when reared at standard temperature (28°C) and a fifth family in which an excess of males (96%) was observed when fry were reared at 36°C for ten days from first feeding. All the samples used in the current study were genotyped for two single-nucleotide polymorphisms (rs397507167 and rs397507165) located in the expected major sex-determining region in linkage group 1 (LG 1). The only misassigned individuals were phenotypic males with the expected female genotype, suggesting that those offspring had undergone sex-reversal with respect to the major sex-determining locus. We mapped SNPs identified from double digest Restriction-site Associated DNA (ddRAD) sequencing in these five families. Three genetic maps were constructed consisting of 641, 175 and 1,155 SNPs from the three largest families. QTL analyses provided evidence for a novel genome-wide significant QTL in LG 20. Evidence was also found for another sex-determining QTL in the fifth family, in the proximal region of LG 1.
Overall, the results from this study suggest that these previously undetected QTLs are involved in sex determination in the Nile tilapia, causing sex reversal (masculinisation) with respect to the XX genotype at the major sex-determining locus in LG 1.
Inferring phylogenetic trees for newly recovered genomes from metagenomic samples is very useful in determining the identities of uncultivated microorganisms. Even though 16S ribosomal RNA small ...subunit genes have been established as "gold standard" markers for inferring phylogenetic trees, they usually cannot be assembled very well in metagenomes due to shared regions among 16S genes. Using single-copy marker genes to build genome trees has become increasingly popular for uncultivated species. Predefined marker gene sets were discovered and have been applied in various genomic studies; however these gene sets might not be adequate for novel, uncultivated, draft, or incomplete genomes. The automatic identification of marker gene sets among a set of genomes with different assembly qualities has thus become a very important task for inferring reliable phylogenetic relationships for microbial populations.
A computational pipeline, ezTree, was developed to automatically identify single-copy marker genes for a group of genomes and build phylogenetic trees from the marker genes. Testing ezTree on a group of proteobacteria species revealed that ezTree was highly effective in pinpointing marker genes and constructing reliable trees for different groups of bacterial genomes. Applying ezTree to genomes that were recently recovered from metagenomes also showed that ezTree can help elucidate taxonomic relationships among newly recovered genomes and existing ones.
The development of ezTree can help scientists build reliable phylogenetic trees for uncultivated species retrieved from environmental samples. The uncovered single-copy marker genes may also provide crucial hints for understanding shared features of a group of microbes. The ezTree pipeline is freely available at https://github.com/yuwwu/ezTree under a GNU GPLv3 license.
Quinoa (Chenopodium quinoa Willd.) is a balanced nutritional crop, but its breeding improvement has been limited by the lack of information on its genetics and genomics. Therefore, it is necessary to ...obtain knowledge on genomic variation, population structure, and genetic diversity and to develop novel Insertion/Deletion (InDel) markers for quinoa by whole-genome re-sequencing.
We re-sequenced 11 quinoa accessions and obtained a coverage depth between approximately 7× to 23× the quinoa genome. Based on the 1453-megabase (Mb) assembly from the reference accession Riobamba, 8,441,022 filtered bi-allelic single nucleotide polymorphisms (SNPs) and 842,783 filtered InDels were identified, with an estimated SNP and InDel density of 5.81 and 0.58 per kilobase (kb). From the genomic InDel variations, 85 dimorphic InDel markers were newly developed and validated. Together with the 62 simple sequence repeat (SSR) markers reported, a total of 147 markers were used for genotyping the 129 quinoa accessions. Molecular grouping analysis showed classification into two major groups, the Andean highland (composed of the northern and southern highland subgroups) and Chilean coastal, based on combined STRUCTURE, phylogenetic tree and PCA (Principle Component Analysis) analyses. Further analysis of the genetic diversity exhibited a decreasing tendency from the Chilean coast group to the Andean highland group, and the gene flow between subgroups was more frequent than that between the two subgroups and the Chilean coastal group. The majority of the variations (approximately 70%) were found through an analysis of molecular variation (AMOVA) due to the diversity between the groups. This was congruent with the observation of a highly significant F
value (0.705) between the groups, demonstrating significant genetic differentiation between the Andean highland type of quinoa and the Chilean coastal type. Moreover, a core set of 16 quinoa germplasms that capture all 362 alleles was selected using a simulated annealing method.
The large number of SNPs and InDels identified in this study demonstrated that the quinoa genome is enriched with genomic variations. Genetic population structure, genetic core germplasms and dimorphic InDel markers are useful resources for genetic analysis and quinoa breeding.
Transposome-based technologies have enabled the streamlined production of sequencer-ready DNA libraries; however, current methods are highly sensitive to the amount and quality of input nucleic acid.
...We describe a new library preparation technology (Nextera DNA Flex) that utilizes a known concentration of transposomes conjugated directly to beads to bind a fixed amount of DNA, and enables direct input of blood and saliva using an integrated extraction protocol. We further report results from libraries generated outside the standard parameters of the workflow, highlighting novel applications for Nextera DNA Flex, including human genome builds and variant calling from below 1 ng DNA input, customization of insert size, and preparation of libraries from short fragments and severely degraded FFPE samples. Using this bead-linked library preparation method, library yield saturation was observed at an input amount of 100 ng. Preparation of libraries from a range of species with varying GC levels demonstrated uniform coverage of small genomes. For large and complex genomes, coverage across the genome, including difficult regions, was improved compared with other library preparation methods. Libraries were successfully generated from amplicons of varying sizes (from 50 bp to 11 kb), however, a decrease in efficiency was observed for amplicons smaller than 250 bp. This library preparation method was also compatible with poor-quality DNA samples, with sequenceable libraries prepared from formalin-fixed paraffin-embedded samples with varying levels of degradation.
In contrast to solution-based library preparation, this bead-based technology produces a normalized, sequencing-ready library for a wide range of DNA input types and amounts, largely obviating the need for DNA quantitation. The robustness of this bead-based library preparation kit and flexibility of input DNA facilitates application across a wide range of fields.
The development of long-read sequencing technologies, such as single-molecule real-time (SMRT) sequencing by PacBio, has produced a revolution in the sequencing of small genomes. Sequencing organelle ...genomes using PacBio long-read data is a cost effective, straightforward approach. Nevertheless, the availability of simple-to-use software to perform the assembly from raw reads is limited at present.
We present Organelle-PBA, a Perl program designed specifically for the assembly of chloroplast and mitochondrial genomes. For chloroplast genomes, the program selects the chloroplast reads from a whole genome sequencing pool, maps the reads to a reference sequence from a closely related species, and then performs read correction and de novo assembly using Sprai. Organelle-PBA completes the assembly process with the additional step of scaffolding by SSPACE-LongRead. The program then detects the chloroplast inverted repeats and reassembles and re-orients the assembly based on the organelle origin of the reference. We have evaluated the performance of the software using PacBio reads from different species, read coverage, and reference genomes. Finally, we present the assembly of two novel chloroplast genomes from the species Picea glauca (Pinaceae) and Sinningia speciosa (Gesneriaceae).
Organelle-PBA is an easy-to-use Perl-based software pipeline that was written specifically to assemble mitochondrial and chloroplast genomes from whole genome PacBio reads. The program is available at https://github.com/aubombarely/Organelle_PBA .
Massively parallel reporter assays (MPRAs) have emerged as a popular means for understanding noncoding variation in a variety of conditions. While a large number of experiments have been described in ...the literature, analysis typically uses ad-hoc methods. There has been little attention to comparing performance of methods across datasets.
We present the mpralm method which we show is calibrated and powerful, by analyzing its performance on multiple MPRA datasets. We show that it outperforms existing statistical methods for analysis of this data type, in the first comprehensive evaluation of statistical methods on several datasets. We investigate theoretical and real-data properties of barcode summarization methods and show an unappreciated impact of summarization method for some datasets. Finally, we use our model to conduct a power analysis for this assay and show substantial improvements in power by performing up to 6 replicates per condition, whereas sequencing depth has smaller impact; we recommend to always use at least 4 replicates. An R package is available from the Bioconductor project.
Together, these results inform recommendations for differential analysis, general group comparisons, and power analysis and will help improve design and analysis of MPRA experiments.