The current revolution in genomics has been made possible by software tools called genome assemblers, which stitch together DNA fragments "read" by sequencing machines into complete or nearly ...complete genome sequences. Despite decades of research in this field and the development of dozens of genome assemblers, assessing and comparing the quality of assembled genome sequences still relies on the availability of independently determined standards, such as manually curated genome sequences, or independently produced mapping data. These "gold standards" can be expensive to produce and may only cover a small fraction of the genome, which limits their applicability to newly generated genome sequences. Here we introduce a de novo probabilistic measure of assembly quality which allows for an objective comparison of multiple assemblies generated from the same set of reads. We define the quality of a sequence produced by an assembler as the conditional probability of observing the sequenced reads from the assembled sequence. A key property of our metric is that the true genome sequence maximizes the score, unlike other commonly used metrics.
We demonstrate that our de novo score can be computed quickly and accurately in a practical setting even for large datasets, by estimating the score from a relatively small sample of the reads. To demonstrate the benefits of our score, we measure the quality of the assemblies generated in the GAGE and Assemblathon 1 assembly "bake-offs" with our metric. Even without knowledge of the true reference sequence, our de novo metric closely matches the reference-based evaluation metrics used in the studies and outperforms other de novo metrics traditionally used to measure assembly quality (such as N50). Finally, we highlight the application of our score to optimize assembly parameters used in genome assemblers, which enables better assemblies to be produced, even without prior knowledge of the genome being assembled.
Likelihood-based measures, such as ours proposed here, will become the new standard for de novo assembly evaluation.
Due to the high genotyping cost and large data volume in genome-wide association studies data, it is desirable to find a small subset of SNPs, referred as tag SNPs, that covers the genetic variation ...of the entire data. To represent genetic variation of an untagged SNP, the existing tagging methods use either a single tag SNP ( e.g., Tagger, IdSelect), or several tag SNPs ( e.g., MLR, STAMPA). When multiple tags are used to explain variation of a single SNP then usually less tags are needed but overfitting is higher.
This paper explores the trade-off between the number of tags and overfitting and considers the problem of finding a minimum number of tags when at most two tags can represent variation of an untagged SNP. We show that this problem is hard to approximate and propose an efficient heuristic, referred as 2LR. Our experimental results show that 2LR tagging is between Tagger and MLR in the number of tags and in overfitting. Indeed, 2LR uses slightly more tags than MLR but the overfitting measured with 2-fold cross validations is practically the same as for Tagger. 2LR-tagging better tolerates missing data than Tagger.
In this dissertation, we address two different genomic inference problems, namely, assembling viral quasispecies sequences and estimating their frequencies from ultra-deep sequencing data and ...inferring allelic value of single nucleotide polymorphisms (SNPs) from the set of chosen informative (tag) SNPs. We develop efficient algorithmic techniques for assembling viral quasispecies sequences from 454 Life Sciences reads and estimate their frequencies. The proposed Viral Spectrum Assembler has been compared with the state-of-the-art ShoRAH on simulated reads and real 454 pyrosequencing shotgun reads. We also explore the trade-off between the number of tags and overfitting and propose an efficient heuristic called 2-LR tagging to find a minimum number of tags when at most two tags can represent variation of an untagged SNP. INDEX WORDS: Haplotype assembling, Viral quasispecies, Hepatitis C virus, Graph theory, Expectation maximization, Tagging, Bioinformatics, Georgia State University
Understanding how the genomes of viruses mutate and evolve within infected individuals is critically important in epidemiology. By exploiting knowledge of the forces that guide viral microevolution, ...researchers can design drugs and treatments that are effective against newly evolved strains. Therefore, it is critical to develop a method for typing the genomes of all of the variants of a virus (quasispecies) inside an infected individual cell.
In this paper, we focus on sequence assembly of Hepatitis C Virus (HCV) based on 454 Lifesciences system that produces around 250K reads each 100-400 base long. We introduce several formulations of the quasispecies assembly problem and a measure of the assembly quality. We also propose a novel scalable assembling method for quasispecies based on a novel network flow formulation. Finally, we report the results of assembling 44 quasispecies from the 1700 bp long E1E2 region of HCV.
Since existing high-throughput sequencing systems are originally designed for a single genome assembly, they cannot distinguish and simultaneously assemble multiple closely related sequences as well ...as estimate their relative abundances. This paper presents a novel approach in ViSpA software for quasispecies spectrum reconstruction. On simulated data, ViSpA accurately reconstructs up to 29 (out of 44) quasispecies in absence of genotyping errors. The ViSpA was also applied to real read data derived from blood sample of HCV-infected patient processed by Roche 454 Life Science machine. The sequenced region is half-genome long. The method reconstructed 10 most frequent sequences each of which represents a viable protein. The most frequent sequence has been within 1% from the actual ORF obtained by cloning the quasispecies. ShoRAH was able to reconstruct only one sequence that represents a viable protein. This sequence has 99.94% similarity with the fourth most frequent assemblies. Both methods returned similar frequency estimations for this sequence: 0.017% (ShoRAH) and 0.019% (ViSpA). The remaining top 9 quasispecies reconstructed by ShoRAH contain multiple stop codons in their corresponding amino-acid sequences which is an indication of unfixed systematic erroneous indels introduced by 454 Life Sciences machines. Additional experiments on 90% of read data shows that the ten most frequent assembled quasispecies are robustly reproduced by the sequencing process in ViSpA.