The UniProt knowledgebase is a public database for protein sequence and function, covering the tree of life and over 220 million protein entries. Now, the whole community can use a new crowdsourcing ...annotation system to help scale up UniProt curation and receive proper attribution for their biocuration work.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
UniRef databases provide full-scale clustering of UniProtKB sequences and are utilized for a broad range of applications, particularly similarity-based functional annotation. Non-redundancy and ...intra-cluster homogeneity in UniRef were recently improved by adding a sequence length overlap threshold. Our hypothesis is that these improvements would enhance the speed and sensitivity of similarity searches and improve the consistency of annotation within clusters.
Intra-cluster molecular function consistency was examined by analysis of Gene Ontology terms. Results show that UniRef clusters bring together proteins of identical molecular function in more than 97% of the clusters, implying that clusters are useful for annotation and can also be used to detect annotation inconsistencies. To examine coverage in similarity results, BLASTP searches against UniRef50 followed by expansion of the hit lists with cluster members demonstrated advantages compared with searches against UniProtKB sequences; the searches are concise (∼7 times shorter hit list before expansion), faster (∼6 times) and more sensitive in detection of remote similarities (>96% recall at e-value <0.0001). Our results support the use of UniRef clusters as a comprehensive and scalable alternative to native sequence databases for similarity searches and reinforces its reliability for use in functional annotation.
The Clinical Proteomic Tumor Analysis Consortium (CPTAC), under the auspices of the National Cancer Institute’s Office of Cancer Clinical Proteomics Research, is a comprehensive and coordinated ...effort to accelerate the understanding of the molecular basis of cancer through the application of proteomic technologies and workflows to clinical tumor samples with characterized genomic and transcript profiles. The consortium analyzes cancer biospecimens using mass spectrometry, identifying and quantifying the constituent proteins and characterizing each tumor sample’s proteome. Mass spectrometry enables highly specific identification of proteins and their isoforms, accurate relative quantitation of protein abundance in contrasting biospecimens, and localization of post-translational protein modifications, such as phosphorylation, on a protein’s sequence. The combination of proteomics, transcriptomics, and genomics data from the same clinical tumor samples provides an unprecedented opportunity for tumor proteogenomics. The CPTAC Data Portal is the centralized data repository for the dissemination of proteomic data collected by Proteome Characterization Centers (PCCs) in the consortium. The portal currently hosts 6.3 TB of data and includes proteomic investigations of breast, colorectal, and ovarian tumor tissues from The Cancer Genome Atlas (TCGA). The data collected by the consortium is made freely available to the public through the data portal.
The UniProt consortium was formed in 2002 by groups from the Swiss Institute of Bioinformatics (SIB), the European Bioinformatics Institute (EBI) and the Protein Information Resource (PIR) at ...Georgetown University, and soon afterwards the website http://www.uniprot.org was set up as a central entry point to UniProt resources. Requests to this address were redirected to one of the three organisations' websites. While these sites shared a set of static pages with general information about UniProt, their pages for searching and viewing data were different. To provide users with a consistent view and to cut the cost of maintaining three separate sites, the consortium decided to develop a common website for UniProt. Following several years of intense development and a year of public beta testing, the http://www.uniprot.org domain was switched to the newly developed site described in this paper in July 2008.
The UniProt consortium is the main provider of protein sequence and annotation data for much of the life sciences community. The http://www.uniprot.org website is the primary access point to this data and to documentation and basic tools for the data. These tools include full text and field-based text search, similarity search, multiple sequence alignment, batch retrieval and database identifier mapping. This paper discusses the design and implementation of the new website, which was released in July 2008, and shows how it improves data access for users with different levels of experience, as well as to machines for programmatic access.http://www.uniprot.org/ is open for both academic and commercial use. The site was built with open source tools and libraries. Feedback is very welcome and should be sent to help@uniprot.org.
The new UniProt website makes accessing and understanding UniProt easier than ever. The two main lessons learned are that getting the basics right for such a data provider website has huge benefits, but is not trivial and easy to underestimate, and that there is no substitute for using empirical data throughout the development process to decide on what is and what is not working for your users.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Motivation: Redundant protein sequences in biological databases hinder sequence similarity searches and make interpretation of search results difficult. Clustering of protein sequence space based on ...sequence similarity helps organize all sequences into manageable datasets and reduces sampling bias and overrepresentation of sequences.
Results: The UniRef (UniProt Reference Clusters) provide clustered sets of sequences from the UniProt Knowledgebase (UniProtKB) and selected UniProt Archive records to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences. Currently covering >4 million source sequences, the UniRef100 database combines identical sequences and subfragments from any source organism into a single UniRef entry. UniRef90 and UniRef50 are built by clustering UniRef100 sequences at the 90 or 50% sequence identity levels. UniRef100, UniRef90 and UniRef50 yield a database size reduction of ∼10, 40 and 70%, respectively, from the source sequence set. The reduced redundancy increases the speed of similarity searches and improves detection of distant relationships. UniRef entries contain summary cluster and membership information, including the sequence of a representative protein, member count and common taxonomy of the cluster, the accession numbers of all the merged entries and links to rich functional annotation in UniProtKB to facilitate biological discovery. UniRef has already been applied to broad research areas ranging from genome annotation to proteomics data analysis.
Availability: UniRef is updated biweekly and is available for online search and retrieval at http://www.uniprot.org, as well as for download at ftp://ftp.uniprot.org/pub/databases/uniprot/uniref
Contact:
bes23@georgetown.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
Stem cell antigen-1 (Sca-1) is used to isolate and characterize tumor initiating cell populations from tumors of various murine models 1. Sca-1 induced disruption of TGF-β signaling is required in ...vivo tumorigenesis in breast cancer models 2, 3-5. The role of human Ly6 gene family is only beginning to be appreciated in recent literature 6-9. To study the significance of Ly6 gene family members, we have visualized one hundred thirty gene expression omnibus (GEO) dataset using Oncomine (Invitrogen) and Georgetown Database of Cancer (G-DOC). This analysis showed that four different members Ly6D, Ly6E, Ly6H or Ly6K have increased gene expressed in bladder, brain and CNS, breast, colorectal, cervical, ovarian, lung, head and neck, pancreatic and prostate cancer than their normal counter part tissues. Increased expression of Ly6D, Ly6E, Ly6H or Ly6K was observed in sub-set of cancer type. The increased expression of Ly6D, Ly6E, Ly6H and Ly6K was found to be associated with poor outcome in ovarian, colorectal, gastric, breast, lung, bladder or brain and CNS as observed by KM plotter and PROGgeneV2 platform. The remarkable findings of increased expression of Ly6 family members and its positive correlation with poor outcome on patient survival in multiple cancer type indicate that Ly6 family members Ly6D, Ly6E, Ly6K and Ly6H will be an important targets in clinical practice as marker of poor prognosis and for developing novel therapeutics in multiple cancer type.
Understanding the association of genetic variation with its functional consequences in proteins is essential for the interpretation of genomic data and identifying causal variants in diseases. ...Integration of protein function knowledge with genome annotation can assist in rapidly comprehending genetic variation within complex biological processes. Here, we describe mapping UniProtKB human sequences and positional annotations, such as active sites, binding sites, and variants to the human genome (GRCh38) and the release of a public genome track hub for genome browsers. To demonstrate the power of combining protein annotations with genome annotations for functional interpretation of variants, we present specific biological examples in disease‐related genes and proteins. Computational comparisons of UniProtKB annotations and protein variants with ClinVar clinically annotated single nucleotide polymorphism (SNP) data show that 32% of UniProtKB variants colocate with 8% of ClinVar SNPs. The majority of colocated UniProtKB disease‐associated variants (86%) map to 'pathogenic' ClinVar SNPs. UniProt and ClinVar are collaborating to provide a unified clinical variant annotation for genomic, protein, and clinical researchers. The genome track hubs, and related UniProtKB files, are downloadable from the UniProt FTP site and discoverable as public track hubs at the UCSC and Ensembl genome browsers.
The ambient solubility of the mineral salts NaCl, KCl, CsCl, KBr, K
SO
and CuSO
is reported as a function of composition of mixed binary solvents consisting of water with polyethylene glycol (PEG ...200), ethoxylated C
alcohols (C
) and ethoxylated, or propoxylated C
alcohols (C
). Solubility gradually decreases with decreasing water content and follows the order PEG 200 > C
> C
. Solubility of CsCl and KBr was found to be surprisingly high in neat PEG 200, on the order of 1 mol kg
, and in neat C
, on the order of 0.1 mol · kg
. The observed solubility trends are explained by the theory of hard and soft acids and bases under the consideration of the polarity of the surfactants.
Subtypes of cigarette smoke-induced disease affect different lung structures and may have distinct pathophysiological mechanisms.
To determine if proteomic classification of the cellular and vascular ...origins of sputum proteins can characterize these mechanisms and phenotypes.
Individual sputum specimens from lifelong nonsmokers (n=7) and smokers with normal lung function (n=13), mucous hypersecretion with normal lung function (n=11), obstructed airflow without emphysema (n=15), and obstruction plus emphysema (n=10) were assessed with mass spectrometry. Data reduction, logarithmic transformation of spectral counts, and Cytoscape network-interaction analysis were performed. The original 203 proteins were reduced to the most informative 50. Sources were secretory dimeric IgA, submucosal gland serous and mucous cells, goblet and other epithelial cells, and vascular permeability.
Epithelial proteins discriminated nonsmokers from smokers. Mucin 5AC was elevated in healthy smokers and chronic bronchitis, suggesting a continuum with the severity of hypersecretion determined by mechanisms of goblet-cell hyperplasia. Obstructed airflow was correlated with glandular proteins and lower levels of Ig joining chain compared to other groups. Emphysema subjects' sputum was unique, with high plasma proteins and components of neutrophil extracellular traps, such as histones and defensins. In contrast, defensins were correlated with epithelial proteins in all other groups. Protein-network interactions were unique to each group.
The proteomes were interpreted as complex "biosignatures" that suggest distinct pathophysiological mechanisms for mucin 5AC hypersecretion, airflow obstruction, and inflammatory emphysema phenotypes. Proteomic phenotyping may improve genotyping studies by selecting more homogeneous study groups. Each phenotype may require its own mechanistically based diagnostic, risk-assessment, drug- and other treatment algorithms.
Tumor molecular profiling plays an integral role in identifying genomic anomalies which may help in personalizing cancer treatments, improving patient outcomes and minimizing risks associated with ...different therapies. However, critical information regarding the evidence of clinical utility of such anomalies is largely buried in biomedical literature. It is becoming prohibitive for biocurators, clinical researchers and oncologists to keep up with the rapidly growing volume and breadth of information, especially those that describe therapeutic implications of biomarkers and therefore relevant for treatment selection. In an effort to improve and speed up the process of manually reviewing and extracting relevant information from literature, we have developed a natural language processing (NLP)-based text mining (TM) system called eGARD (extracting Genomic Anomalies association with Response to Drugs). This system relies on the syntactic nature of sentences coupled with various textual features to extract relations between genomic anomalies and drug response from MEDLINE abstracts. Our system achieved high precision, recall and F-measure of up to 0.95, 0.86 and 0.90, respectively, on annotated evaluation datasets created in-house and obtained externally from PharmGKB. Additionally, the system extracted information that helps determine the confidence level of extraction to support prioritization of curation. Such a system will enable clinical researchers to explore the use of published markers to stratify patients upfront for 'best-fit' therapies and readily generate hypotheses for new clinical trials.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK