Assembled genome sequences are being generated at an exponential rate. Here we present FCS-GX, part of NCBI's Foreign Contamination Screen (FCS) tool suite, optimized to identify and remove ...contaminant sequences in new genomes. FCS-GX screens most genomes in 0.1-10 min. Testing FCS-GX on artificially fragmented genomes demonstrates high sensitivity and specificity for diverse contaminant species. We used FCS-GX to screen 1.6 million GenBank assemblies and identified 36.8 Gbp of contamination, comprising 0.16% of total bases, with half from 161 assemblies. We updated assemblies in NCBI RefSeq to reduce detected contamination to 0.01% of bases. FCS-GX is available at https://github.com/ncbi/fcs/ or https://doi.org/10.5281/zenodo.10651084 .
The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records ...(http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55,000 organisms (>4800 viruses, >40,000 prokaryotes and >10,000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management.
Leishmania parasites cause a broad spectrum of clinical disease. Here we report the sequencing of the genomes of two species of Leishmania: Leishmania infantum and Leishmania braziliensis. The ...comparison of these sequences with the published genome of Leishmania major reveals marked conservation of synteny and identifies only approximately 200 genes with a differential distribution between the three species. L. braziliensis, contrary to Leishmania species examined so far, possesses components of a putative RNA-mediated interference pathway, telomere-associated transposable elements and spliced leader-associated SLACS retrotransposons. We show that pseudogene formation and gene loss are the principal forces shaping the different genomes. Genes that are differentially distributed between the species encode proteins implicated in host-pathogen interactions and parasite survival in the macrophage.
Full text
Available for:
DOBA, IJS, IZUM, KILJ, NUK, PILJ, PNG, SAZU, UILJ, UKNU, UL, UM, UPUK
The Rice Annotation Project Database (RAP-DB) was created to provide the genome sequence assembly of the International Rice Genome Sequencing Project (IRGSP), manually curated annotation of the ...sequence, and other genomics information that could be useful for comprehensive understanding of the rice biology. Since the last publication of the RAP-DB, the IRGSP genome has been revised and reassembled. In addition, a large number of rice-expressed sequence tags have been released, and functional genomics resources have been produced worldwide. Thus, we have thoroughly updated our genome annotation by manual curation of all the functional descriptions of rice genes. The latest version of the RAP-DB contains a variety of annotation data as follows: clone positions, structures and functions of 31 439 genes validated by cDNAs, RNA genes detected by massively parallel signature sequencing (MPSS) technology and sequence similarity, flanking sequences of mutant lines, transposable elements, etc. Other annotation data such as Gnomon can be displayed along with those of RAP for comparison. We have also developed a new keyword search system to allow the user to access useful information. The RAP-DB is available at: http://rapdb.dna.affrc.go.jp/ and http://rapdb.lab.nig.ac.jp/.
The National Center for Biotechnology Information (NCBI) integrates data from more than 20 biological databases through a flexible search and retrieval system called Entrez. A core Entrez database, ...Entrez Nucleotide, includes GenBank and is tightly linked to the NCBI Taxonomy database, the Entrez Protein database, and the scientific literature in PubMed. A suite of more specialized databases for genomes, genes, gene families, gene expression, gene variation, and protein domains dovetails with the core databases to make Entrez a powerful system for genomic research. Linked to the full range of Entrez databases is the NCBI Map Viewer, which displays aligned genetic, physical, and sequence maps for eukaryotic genomes including those of many plants. A specialized plant query page allow maps from all plant genomes covered by the Map Viewer to be searched in tandem to produce a display of aligned maps from several species. PlantBLAST searches against the sequences shown in the Map Viewer allow BLAST alignments to be viewed within a genomic context. In addition, precomputed sequence similarities, such as those for proteins offered by BLAST Link, enable fluid navigation from unannotated to annotated sequences, quickening the pace of discovery. NCBI Web pages for plants, such as Plant Genome Central, complete the system by providing centralized access to NCBI's genomic resources as well as links to organism-specific Web pages beyond NCBI.
Full text
Available for:
BFBNIB, NMLJ, NUK, PNG, SAZU, UL, UM, UPUK
The primary structures of 11 proteins of ADP-glucose pyrophosphorylase are aligned and compared for relationships among them. These comparisons indicate that many domains are retained in the proteins ...from both the enteric bacteria and the proteins from angiosperm plants. The proteins from angiosperm plants show two main groups, with one of the main groups demonstrating two subgroups. The two main groups of angiosperm plant proteins are based upon the two subunits of the enzyme, whereas the subgroups of the large subunit group are based upon the tissue in which the particular gene had been expressed. Additionally, the small subunit group shows a slight but distinct division into a grouping based upon whether the protein is from a monocot or dicot source. Previous structure-function studies with the Escherichia coli enzyme have identified regions of the primary structure associated with the substrate binding site, the allosteric activator binding site, and the allosteric inhibitor binding site. There is conservation of the primary structure of the polypeptides for the substrate binding site and the allosteric activator binding site. The nucleotide sequences of the coding regions of the genes of 11 of these proteins are compared for relationships among them. This analysis indicates that the protein for the small subunit has been subject to greater selective pressure to retain a particular primary structure. Also, the coding region of the precursor gene for the small subunit diverged from the coding region of the precursor gene for the large subunits slightly prior to the divergence of the two coding regions of the genes for the two tissue-specific large subunit genes.
The National Center for Biotechnology Information (NCBI) provides a data-rich environment in support of genomic research by collecting the biological data for genomes, genes, gene expressions, gene ...variation, gene families, proteins, and protein domains and integrating the data with analytical, search, and retrieval resources through the NCBI Web site. Entrez, an integrated search and retrieval system, enables text searches across various diverse biological databases maintained at NCBI. Map Viewer, the genome browser developed at NCBI, displays aligned genetic, physical, and sequence maps for eukaryotic genomes including those of many plants. A specialized plant query page allows maps from all plant genomes available in the Map Viewer to be searched to produce a display of aligned maps from several species. Customized Plant Basic Local Alignment Search Tool (PlantBLAST) allows the user to perform sequence similarity searches in a special collection of mapped plant sequence data and to view the resulting alignments within a genomic context using Map Viewer. In addition, pre-computed sequence similarities, such as those for proteins offered by BLAST Link (BLink), enable fluid navigation from un-annotated to annotated sequences, quickening the pace of discovery. Plant Genome Central (PGC) is a Web portal that provides centralized access to all NCBI plant genome resources. Also, there are links to plant-specific Web resources external to NCBI such as organism-specific databases, genome-sequencing project Web pages, and homepages of genomic bioinformatics organizations.
Near-full-length cDNA clones to the small and large subunit of the heterotetrameric potato tuber ADP-glucose pyrophosphorylase have been isolated and characterized. The missing amino terminal ...sequence of the small subunit has also been elucidated from its corresponding genomic clone. Primary sequence comparisons revealed that each potato subunit had less identity to each other than to their homologous subunit from other plants. It also appeared that the smaller subunit is more conserved among the different plants and the larger subunit more divergent. Amino acid comparisons of both potato tuber sequences to the Escherichia coli ADP-glucose pyrophosphorylase sequence revealed conserved regions important for both catalytic and allosteric function of the bacterial enzyme.