The structure of genomes Gopal, Shuba
Genome biology,
03/2000, Letnik:
1, Številka:
1
Journal Article
Recenzirano
Odprti dostop
The particular interest of David Ussery and his group is in creating 'structural atlases' of genomes from a detailed analysis of variations in DNA structure.
Genome annotation in differently evolved organisms presents challenges because the lack of sequence‐based homology limits the ability to determine the function of putative coding regions. To provide ...an alternative to annotation by sequence homology, we developed a method that takes advantage of unusual trypanosomatid biology and skews in nucleotide composition between coding regions and upstream regions to rank putative open reading frames based on the likelihood of coding. The method is 93% accurate when tested on known genes. We have applied our method to the full complement of open reading frames on Chromosome I of Trypanosoma brucei, and we can predict with high confidence that 226 putative coding regions are likely to be functional. Methods such as the one described here for discriminating true coding regions are critical for genome annotation when other sources of evidence for function are limited.
Since the genome of Escherichia coli K-12 was initially annotated in 1997, additional functional information based on biological characterization and functions of sequence-similar proteins has become ...available. On the basis of this new information, an updated version of the annotated chromosome has been generated.
The E. coli K-12 chromosome is currently represented by 4,401 genes encoding 116 RNAs and 4,285 proteins. The boundaries of the genes identified in the GenBank Accession U00096 were used. Some protein-coding sequences are compound and encode multimodular proteins. The coding sequences (CDSs) are represented by modules (protein elements of at least 100 amino acids with biological activity and independent evolutionary history). There are 4,616 identified modules in the 4,285 proteins. Of these, 48.9% have been characterized, 29.5% have an imputed function, 2.1% have a phenotype and 19.5% have no function assignment. Only 7% of the modules appear unique to E. coli, and this number is expected to be reduced as more genome data becomes available. The imputed functions were assigned on the basis of manual evaluation of functions predicted by BLAST and DARWIN analyses and by the MAGPIE genome annotation system.
Much knowledge has been gained about functions encoded by the E. coli K-12 genome since the 1997 annotation was published. The data presented here should be useful for analysis of E. coli gene products as well as gene products encoded by other genomes.
Trypanosoma brucei, the causative agent of African sleeping sickness, is a protist that presents numerous challenges to computational genome analysis. Its divergence from well-studied organisms ...limits the ability to identify true protein-coding regions and assign functions to these coding regions. Further, the limited resources available for the study of non-model organisms slows the pace of research into this important pathogen. Therefore, the application of computational methods that can facilitate the selection of likely coding regions for further investigation is critical to furthering our understanding of this unusual organism. This dissertation describes the development of computational methods for the identification and characterization of true protein-coding regions in T. brucei. To select likely coding regions, a method based on statistical profiles of known coding regions and their upstream, non-coding regions has been developed. This approach has been evaluated both computationally and experimentally and demonstrates high accuracy and confidence in individual predictions. To further refine the identification of likely coding regions, signals associated with a key post-transcriptional processing step known as trans-splicing have been characterized as well. As a result of this analysis, for the first time, trans-splicing sites can be predicted with reasonable accuracy and specificity. An automated annotation system has also been developed to suggest putative function based on sequence similarity to known proteins. When compared to the manual annotation, the system identifies the same or similar function for over three-quarters of coding regions that would be annotated by a human curator. The work presented here provides a powerful suite of methods for identifying, evaluating and annotating protein-coding genes in T. brucei. Further, the methods may be extended to other members of the trypanosomatid family or to organisms that share aspects of trypanosomatid biology. As such, this work offers an organism-specific paradigm for genome analysis that can complement other, generalized solutions for sequence review and annotation.
Trans-splicing is an unusual process in which two separate RNA strands are spliced together to yield a mature mRNA. We present a novel computational approach which has an overall accuracy of 82% and ...can predict 92% of known trans-splicing sites. We have applied our method to chromosomes 1 and 3 of Leishmania major, with high-confidence predictions for 85% and 88% of annotated genes respectively. We suggest some extensions of our method to other systems.