With the cost of DNA sequencing decreasing, increasing amounts of RNA-Seq data are being generated giving novel insight into gene expression and regulation. Prior to analysis of gene expression, the ...RNA-Seq data has to be processed through a number of steps resulting in a quantification of expression of each gene/transcript in each of the analyzed samples. A number of workflows are available to help researchers perform these steps on their own data, or on public data to take advantage of novel software or reference data in data re-analysis. However, many of the existing workflows are limited to specific types of studies. We therefore aimed to develop a maximally general workflow, applicable to a wide range of data and analysis approaches and at the same time support research on both model and non-model organisms. Furthermore, we aimed to make the workflow usable also for users with limited programming skills.
Utilizing the workflow management system Snakemake and the package management system Conda, we have developed a modular, flexible and user-friendly RNA-Seq analysis workflow: RNA-Seq Analysis Snakemake Workflow (RASflow). Utilizing Snakemake and Conda alleviates challenges with library dependencies and version conflicts and also supports reproducibility. To be applicable for a wide variety of applications, RASflow supports the mapping of reads to both genomic and transcriptomic assemblies. RASflow has a broad range of potential users: it can be applied by researchers interested in any organism and since it requires no programming skills, it can be used by researchers with different backgrounds. The source code of RASflow is available on GitHub: https://github.com/zhxiaokang/RASflow.
RASflow is a simple and reliable RNA-Seq analysis workflow covering many use cases.
High throughput sequencing technology has great promise for biodiversity studies. However, an underlying assumption is that the primers used in these studies are universal for the prokaryotic or ...eukaryotic groups of interest. Full primer universality is difficult or impossible to achieve and studies using different primer sets make biodiversity comparisons problematic. The aim of this study was to design and optimize universal eukaryotic primers that could be used as a standard in future biodiversity studies. Using the alignment of all eukaryotic sequences from the publicly available SILVA database, we generated a full characterization of variable versus conserved regions in the 18S rRNA gene. All variable regions within this gene were analyzed and our results suggested that the V2, V4 and V9 regions were best suited for biodiversity assessments. Previously published universal eukaryotic primers as well as a number of self-designed primers were mapped to the alignment. Primer selection will depend on sequencing technology used, and this study focused on the 454 pyrosequencing GS FLX Titanium platform. The results generated a primer pair yielding theoretical matches to 80% of the eukaryotic and 0% of the prokaryotic sequences in the SILVA database. An empirical test of marine sediments using the AmpliconNoise pipeline for analysis of the high throughput sequencing data yielded amplification of sequences for 71% of all eukaryotic phyla with no isolation of prokaryotic sequences. To our knowledge this is the first characterization of the complete 18S rRNA gene using all eukaryotes present in the SILVA database, providing a robust test for universal eukaryotic primers. Since both in silico and empirical tests using high throughput sequencing retained high inclusion of eukaryotic phyla and exclusion of prokaryotes, we conclude that these primers are well suited for assessing eukaryote diversity, and can be used as a standard in biodiversity studies.
SNP arrays, short- and long-read genome sequencing are genome-wide high-throughput technologies that may be used to assay copy number variants (CNVs) in a personal genome. Each of these technologies ...comes with its own limitations and biases, many of which are well-known, but not all of them are thoroughly quantified. We assembled an ensemble of public datasets of published CNV calls and raw data for the well-studied Genome in a Bottle individual NA12878. This assembly represents a variety of methods and pipelines used for CNV calling from array, short- and long-read technologies. We then performed cross-technology comparisons regarding their ability to call CNVs. Different from other studies, we refrained from using the golden standard. Instead, we attempted to validate the CNV calls by the raw data of each technology. Our study confirms that long-read platforms enable recalling CNVs in genomic regions inaccessible to arrays or short reads. We also found that the reproducibility of a CNV by different pipelines within each technology is strongly linked to other CNV evidence measures. Importantly, the three technologies show distinct public database frequency profiles, which differ depending on what technology the database was built on.
How an organism copes with chemicals is largely determined by the genes and proteins that collectively function to defend against, detoxify and eliminate chemical stressors. This integrative network ...includes receptors and transcription factors, biotransformation enzymes, transporters, antioxidants, and metal- and heat-responsive genes, and is collectively known as the chemical defensome. Teleost fish is the largest group of vertebrate species and can provide valuable insights into the evolution and functional diversity of defensome genes. We have previously shown that the xenosensing pregnane x receptor (pxr, nr1i2) is lost in many teleost species, including Atlantic cod (Gadus morhua) and three-spined stickleback (Gasterosteus aculeatus), but it is not known if compensatory mechanisms or signaling pathways have evolved in its absence. In this study, we compared the genes comprising the chemical defensome of five fish species that span the teleosteii evolutionary branch often used as model species in toxicological studies and environmental monitoring programs: zebrafish (Danio rerio), medaka (Oryzias latipes), Atlantic killifish (Fundulus heteroclitus), Atlantic cod, and three-spined stickleback. Genome mining revealed evolved differences in the number and composition of defensome genes that can have implication for how these species sense and respond to environmental pollutants, but we did not observe any candidates of compensatory mechanisms or pathways in cod and stickleback in the absence of pxr. The results indicate that knowledge regarding the diversity and function of the defensome will be important for toxicological testing and risk assessment studies.
As global exploitation of available resources increases, operations extend towards sensitive and previously protected ecosystems. It is important to monitor such areas in order to detect, understand ...and remediate environmental responses to stressors. The natural heterogeneity and complexity of communities means that accurate monitoring requires high resolution, both temporally and spatially, as well as more complete assessments of taxa. Increased resolution and taxonomic coverage is economically challenging using current microscopy‐based monitoring practices. Alternatively, DNA sequencing‐based methods have been suggested for cost‐efficient monitoring, offering additional insights into ecosystem function and disturbance. Here, we applied DNA metabarcoding of eukaryotic communities in marine sediments, in areas of offshore drilling on the Norwegian continental shelf. Forty‐five samples, collected from seven drilling sites in the Troll/Oseberg region, were assessed, using the small subunit ribosomal RNA gene as a taxonomic marker. In agreement with results based on classical morphology‐based monitoring, we were able to identify changes in sediment communities surrounding oil platforms. In addition to overall changes in community structure, we identified several potential indicator taxa, responding to pollutants associated with drilling fluids. These included the metazoan orders Macrodasyida, Macrostomida and Ceriantharia, as well as several ciliates and other protist taxa, typically not targeted by environmental monitoring programmes. Analysis of a co‐occurrence network to study the distribution of taxa across samples provided a framework for better understanding the impact of anthropogenic activities on the benthic food web, generating novel, testable hypotheses of trophic interactions structuring benthic communities.
Lung cancer in East Asia is characterized by a high percentage of never-smokers, early onset and predominant EGFR mutations. To illuminate the molecular phenotype of this demographically distinct ...disease, we performed a deep comprehensive proteogenomic study on a prospectively collected cohort in Taiwan, representing early stage, predominantly female, non-smoking lung adenocarcinoma. Integrated genomic, proteomic, and phosphoproteomic analysis delineated the demographically distinct molecular attributes and hallmarks of tumor progression. Mutational signature analysis revealed age- and gender-related mutagenesis mechanisms, characterized by high prevalence of APOBEC mutational signature in younger females and over-representation of environmental carcinogen-like mutational signatures in older females. A proteomics-informed classification distinguished the clinical characteristics of early stage patients with EGFR mutations. Furthermore, integrated protein network analysis revealed the cellular remodeling underpinning clinical trajectories and nominated candidate biomarkers for patient stratification and therapeutic intervention. This multi-omic molecular architecture may help develop strategies for management of early stage never-smoker lung adenocarcinoma.
Display omitted
•First deep proteogenomic landscape of non-smoking lung adenocarcinoma in East Asia•Identified age, sex-related endogenous, and environmental carcinogen mutagenic processes•Proteome-informed classification distinguished clinical features within early stages•Protein networks identified tumorigenesis hallmarks, biomarkers, and druggable targets
Deep proteogenomic landscape of early stage lung adenocarcinoma in a cohort of mostly non-smokers reveals unique drivers and biomarkers, as well as gender-associated mutagenesis.
Soda lakes are intriguing ecosystems harboring extremely productive microbial communities in spite of their extreme environmental conditions. This makes them valuable model systems for studying the ...connection between community structure and abiotic parameters such as pH and salinity. For the first time, we apply high-throughput sequencing to accurately estimate phylogenetic richness and composition in five soda lakes, located in the Ethiopian Rift Valley. The lakes were selected for their contrasting pH, salinities and stratification and several depths or spatial positions were covered in each lake. DNA was extracted and analyzed from all lakes at various depths and RNA extracted from two of the lakes, analyzed using both amplicon- and shotgun sequencing. We reveal a surprisingly high biodiversity in all of the studied lakes, similar to that of freshwater lakes. Interestingly, diversity appeared uncorrelated or positively correlated to pH and salinity, with the most "extreme" lakes showing the highest richness. Together, pH, dissolved oxygen, sodium- and potassium concentration explained approximately 30% of the compositional variation between samples. A diversity of prokaryotic and eukaryotic taxa could be identified, including several putatively involved in carbon-, sulfur- or nitrogen cycling. Key processes like methane oxidation, ammonia oxidation and 'nitrifier denitrification' were also confirmed by mRNA transcript analyses.
The salmon louse (Lepeophtheirus salmonis) is an obligate ectoparasitic copepod living on Atlantic salmon and other salmonids in the marine environment. Salmon lice cause a number of environmental ...problems and lead to large economical losses in aquaculture every year. In order to develop novel parasite control strategies, a better understanding of the mechanisms of moulting and development of the salmon louse at the transcriptional level is required. Three weighted gene co-expression networks were constructed based on the pairwise correlations of salmon louse gene expression profiles at different life stages. Network-based approaches and gene annotation information were applied to identify genes that might be important for the moulting and development of the salmon louse. RNA interference was performed for validation. Regulatory impact factors were calculated for all the transcription factor genes by examining the changes in co-expression patterns between transcription factor genes and deferentially expressed genes in middle stages and moulting stages. Eight gene modules were predicted as important, and 10 genes from six of the eight modules have been found to show observable phenotypes in RNA interference experiments. We knocked down five hub genes from three modules and observed phenotypic consequences in all experiments. In the infection trial, no copepodids with a RAB1A-like gene knocked down were found on fish, while control samples developed to chalimus-1 larvae. Also, a FOXO-like transcription factor obtained highest scores in the regulatory impact factor calculation. We propose a gene co-expression network-based approach to identify genes playing an important role in the moulting and development of salmon louse. The RNA interference experiments confirm the effectiveness of our approach and demonstrated the indispensable role of a RAB1A-like gene in the development of the salmon louse. We propose that our approach could be generalized to identify important genes associated with a phenotype of interest in other organisms.
Motivation: 454 pyrosequencing, by Roche Diagnostics, has emerged as an alternative to Sanger sequencing when it comes to read lengths, performance and cost, but shows higher per-base error rates. ...Although there are several tools available for noise removal, targeting different application fields, data interpretation would benefit from a better understanding of the different error types.
Results: By exploring 454 raw data, we quantify to what extent different factors account for sequencing errors. In addition to the well-known homopolymer length inaccuracies, we have identified errors likely to originate from other stages of the sequencing process. We use our findings to extend the flowsim pipeline with functionalities to simulate these errors, and thus enable a more realistic simulation of 454 pyrosequencing data with flowsim.
Availability: The flowsim pipeline is freely available under the General Public License from http://biohaskell.org/Applications/FlowSim.
Contact:
susanne.balzer@imr.no
Sequencing of taxonomic or phylogenetic markers is becoming a fast and efficient method for studying environmental microbial communities. This has resulted in a steadily growing collection of marker ...sequences, most notably of the small-subunit (SSU) ribosomal RNA gene, and an increased understanding of microbial phylogeny, diversity and community composition patterns. However, to utilize these large datasets together with new sequencing technologies, a reliable and flexible system for taxonomic classification is critical. We developed CREST (Classification Resources for Environmental Sequence Tags), a set of resources and tools for generating and utilizing custom taxonomies and reference datasets for classification of environmental sequences. CREST uses an alignment-based classification method with the lowest common ancestor algorithm. It also uses explicit rank similarity criteria to reduce false positives and identify novel taxa. We implemented this method in a web server, a command line tool and the graphical user interfaced program MEGAN. Further, we provide the SSU rRNA reference database and taxonomy SilvaMod, derived from the publicly available SILVA SSURef, for classification of sequences from bacteria, archaea and eukaryotes. Using cross-validation and environmental datasets, we compared the performance of CREST and SilvaMod to the RDP Classifier. We also utilized Greengenes as a reference database, both with CREST and the RDP Classifier. These analyses indicate that CREST performs better than alignment-free methods with higher recall rate (sensitivity) as well as precision, and with the ability to accurately identify most sequences from novel taxa. Classification using SilvaMod performed better than with Greengenes, particularly when applied to environmental sequences. CREST is freely available under a GNU General Public License (v3) from http://apps.cbu.uib.no/crest and http://lcaclassifier.googlecode.com.