The ability to obtain long read lengths during DNA sequencing has several potentially important practical applications. Especially long read lengths have been reported using the Nanopore sequencing ...method, currently commercially available from Oxford Nanopore Technologies (ONT). However, early reports have demonstrated only limited levels of combined throughput and sequence accuracy. Recently, ONT released a new CsgG pore sequencing system as well as a 250b/s translocation chemistry with potential for improvements.
We made use of such components on ONTs miniature 'MinION' device and sequenced native genomic DNA obtained from the near haploid cancer cell line HAP1. Analysis of our data was performed utilising recently described computational tools tailored for nanopore/long-read sequencing outputs, and here we present our key findings.
From a single sequencing run, we obtained ~240,000 high-quality mapped reads, comprising a total of ~2.3 billion bases. A mean read length of 9.6kb and an N50 of ~17kb was achieved, while sequences mapped to reference with a mean identity of 85%. Notably, we obtained ~68X coverage of the mitochondrial genome and were able to achieve a mean consensus identity of 99.8% for sequenced mtDNA reads.
With improved sequencing chemistries already released and higher-throughput instruments in the pipeline, this early study suggests that ONT CsgG-based sequencing may be a useful option for potential practical long-read applications.
Orthology detection is critically important for accurate functional annotation, and has been widely used to facilitate studies on comparative and evolutionary genomics. Although various methods are ...now available, there has been no comprehensive analysis of performance, due to the lack of a genomic-scale 'gold standard' orthology dataset. Even in the absence of such datasets, the comparison of results from alternative methodologies contains useful information, as agreement enhances confidence and disagreement indicates possible errors. Latent Class Analysis (LCA) is a statistical technique that can exploit this information to reasonably infer sensitivities and specificities, and is applied here to evaluate the performance of various orthology detection methods on a eukaryotic dataset. Overall, we observe a trade-off between sensitivity and specificity in orthology detection, with BLAST-based methods characterized by high sensitivity, and tree-based methods by high specificity. Two algorithms exhibit the best overall balance, with both sensitivity and specificity>80%: INPARANOID identifies orthologs across two species while OrthoMCL clusters orthologs from multiple species. Among methods that permit clustering of ortholog groups spanning multiple genomes, the (automated) OrthoMCL algorithm exhibits better within-group consistency with respect to protein function and domain architecture than the (manually curated) KOG database, and the homolog clustering algorithm TribeMCL as well. By way of using LCA, we are also able to comprehensively assess similarities and statistical dependence between various strategies, and evaluate the effects of parameter settings on performance. In summary, we present a comprehensive evaluation of orthology detection on a divergent set of eukaryotic genomes, thus providing insights and guides for method selection, tuning and development for different applications. Many biological questions have been addressed by multiple tests yielding binary (yes/no) outcomes but no clear definition of truth, making LCA an attractive approach for computational biology.
Besides protein-coding mRNAs, eukaryotic transcriptomes include many long non-protein-coding RNAs (ncRNAs) of unknown function that are transcribed away from protein-coding loci. Here, we have ...identified 659 intergenic long ncRNAs whose genomic sequences individually exhibit evolutionary constraint, a hallmark of functionality. Of this set, those expressed in the brain are more frequently conserved and are significantly enriched with predicted RNA secondary structures. Furthermore, brain-expressed long ncRNAs are preferentially located adjacent to protein-coding genes that are (1) also expressed in the brain and (2) involved in transcriptional regulation or in nervous system development. This led us to the hypothesis that spatiotemporal co-expression of ncRNAs and nearby protein-coding genes represents a general phenomenon, a prediction that was confirmed subsequently by in situ hybridisation in developing and adult mouse brain. We provide the full set of constrained long ncRNAs as an important experimental resource and present, for the first time, substantive and predictive criteria for prioritising long ncRNA and mRNA transcript pairs when investigating their biological functions and contributions to development and disease.
Probabilistic functional gene networks are powerful theoretical frameworks for integrating heterogeneous functional genomics and proteomics data into objective models of cellular systems. Such ...networks provide syntheses of millions of discrete experimental observations, spanning DNA microarray experiments, physical protein interactions, genetic interactions, and comparative genomics; the resulting networks can then be easily applied to generate testable hypotheses regarding specific gene functions and associations.
We report a significantly improved version (v. 2) of a probabilistic functional gene network of the baker's yeast, Saccharomyces cerevisiae. We describe our optimization methods and illustrate their effects in three major areas: the reduction of functional bias in network training reference sets, the application of a probabilistic model for calculating confidences in pair-wise protein physical or genetic interactions, and the introduction of simple thresholds that eliminate many false positive mRNA co-expression relationships. Using the network, we predict and experimentally verify the function of the yeast RNA binding protein Puf6 in 60S ribosomal subunit biogenesis.
YeastNet v. 2, constructed using these optimizations together with additional data, shows significant reduction in bias and improvements in precision and recall, in total covering 102,803 linkages among 5,483 yeast proteins (95% of the validated proteome). YeastNet is available from http://www.yeastnet.org.
The prediction of the genetic disease risk of an individual is a powerful public health tool. While predicting risk has been successful in diseases which follow simple Mendelian inheritance, it has ...proven challenging in complex diseases for which a large number of loci contribute to the genetic variance. The large numbers of single nucleotide polymorphisms now available provide new opportunities for predicting genetic risk of complex diseases with high accuracy.
We have derived simple deterministic formulae to predict the accuracy of predicted genetic risk from population or case control studies using a genome-wide approach and assuming a dichotomous disease phenotype with an underlying continuous liability. We show that the prediction equations are special cases of the more general problem of predicting the accuracy of estimates of genetic values of a continuous phenotype. Our predictive equations are responsive to all parameters that affect accuracy and they are independent of allele frequency and effect distributions. Deterministic prediction errors when tested by simulation were generally small. The common link among the expressions for accuracy is that they are best summarized as the product of the ratio of number of phenotypic records per number of risk loci and the observed heritability.
This study advances the understanding of the relative power of case control and population studies of disease. The predictions represent an upper bound of accuracy which may be achievable with improved effect estimation methods. The formulae derived will help researchers determine an appropriate sample size to attain a certain accuracy when predicting genetic risk.
Background Gram-negative bacteria of the genus Serratia are potential producers of many useful secondary metabolites, such as prodigiosin and serrawettins, which have potential applications in ...environmental bioremediation or in the pharmaceutical industry. Several Serratia strains produce prodigiosin and serrawettin W1 as the main bioactive compounds, and the biosynthetic pathways are co-regulated by quorum sensing (QS). In contrast, the Serratia strain, which can simultaneously produce prodigiosin and serrawettin W2, has not been reported. This study focused on analyzing the genomic sequence of Serratia sp. strain YD25T isolated from rhizosphere soil under continuously planted burley tobacco collected from Yongding, Fujian province, China, which is unique in producing both prodigiosin and serrawettin W2. Results A hybrid polyketide synthases (PKS)-non-ribosomal peptide synthetases (NRPS) gene cluster putatively involved in biosynthesis of antimicrobial serrawettin W2 was identified in the genome of YD25T, and its biosynthesis pathway was proposed. We found potent antimicrobial activity of serrawettin W2 purified from YD25T against various pathogenic bacteria and fungi as well as antitumor activity against Hela cells. Subsequently, comparative genomic analyses were performed among a total of 133 Serratia species. The prodigiosin biosynthesis gene cluster in YD25T belongs to the type I pig cluster, which is the main form of pig-encoding genes existing in most of the pigmented Serratia species. In addition, a complete autoinducer-2 (AI-2) system (including luxS, lsrBACDEF, lsrGK, and lsrR) as a conserved bacterial operator is found in the genome of Serratia sp. strain YD25T. Phylogenetic analysis based on concatenated Lsr and LuxS proteins revealed that YD25T formed an independent branch and was clearly distant from the strains that solely produce either prodigiosin or serrawettin W2. The Fe (III) ion reduction assay confirmed that strain YD25T could produce an AI-2 signal molecule. Phylogenetic analysis using the genomic sequence of YD25T combined with phylogenetic and phenotypic analyses support this strain as a member of a novel and previously uncharacterized Serratia species. Conclusion Genomic sequence and metabolite analysis of Serratia surfactantfaciens YD25T indicate that this strain can be further explored for the production of useful metabolites. Unveiling the genomic sequence of S. surfactantfaciens YD25T benefits the usage of this unique strain as a model system for studying the biosynthesis regulation of both prodigiosin and serrawettin W2 by the QS system.
The genome sequence of Rickettsia felis revealed a number of rickettsial genetic anomalies that likely contribute not only to a large genome size relative to other rickettsiae, but also to phenotypic ...oddities that have confounded the categorization of R. felis as either typhus group (TG) or spotted fever group (SFG) rickettsiae. Most intriguing was the first report from rickettsiae of a conjugative plasmid (pRF) that contains 68 putative open reading frames, several of which are predicted to encode proteins with high similarity to conjugative machinery in other plasmid-containing bacteria.
Using phylogeny estimation, we determined the mode of inheritance of pRF genes relative to conserved rickettsial chromosomal genes. Phylogenies of chromosomal genes were in agreement with other published rickettsial trees. However, phylogenies including pRF genes yielded different topologies and suggest a close relationship between pRF and ancestral group (AG) rickettsiae, including the recently completed genome of R. bellii str. RML369-C. This relatedness is further supported by the distribution of pRF genes across other rickettsiae, as 10 pRF genes (or inactive derivatives) also occur in AG (but not SFG) rickettsiae, with five of these genes characteristic of typical plasmids. Detailed characterization of pRF genes resulted in two novel findings: the identification of oriV and replication termination regions, and the likelihood that a second proposed plasmid, pRFdelta, is an artifact of the original genome assembly.
Altogether, we propose a new rickettsial classification scheme with the addition of a fourth lineage, transitional group (TRG) rickettsiae, that is unique from TG and SFG rickettsiae and harbors genes from possible exchanges with AG rickettsiae via conjugation. We offer insight into the evolution of a plastic plasmid system in rickettsiae, including the role plasmids may have played in the acquirement of virulence traits in pathogenic strains, and the likely origin of plasmids within the rickettsial tree.
Different synonymous codons are favored by natural selection for translation efficiency and accuracy in different organisms. The rules governing the identities of favored codons in different ...organisms remain obscure. In fact, it is not known whether such rules exist or whether favored codons are chosen randomly in evolution in a process akin to a series of frozen accidents. Here, we study this question by identifying for the first time the favored codons in 675 bacteria, 52 archea, and 10 fungi. We use a number of tests to show that the identified codons are indeed likely to be favored and find that across all studied organisms the identity of favored codons tracks the GC content of the genomes. Once the effect of the genomic GC content on selectively favored codon choice is taken into account, additional universal amino acid specific rules governing the identity of favored codons become apparent. Our results provide for the first time a clear set of rules governing the evolution of selectively favored codon usage. Based on these results, we describe a putative scenario for how evolutionary shifts in the identity of selectively favored codons can occur without even temporary weakening of natural selection for codon bias.