Uncovering cellular roles of a protein is a task of tremendous importance and complexity that requires dedicated experimental work as well as often sophisticated data mining and processing tools. ...Protein functions, often referred to as its annotations, are believed to manifest themselves through topology of the networks of inter-proteins interactions. In particular, there is a growing body of evidence that proteins performing the same function are more likely to interact with each other than with proteins with other functions. However, since functional annotation and protein network topology are often studied separately, the direct relationship between them has not been comprehensively demonstrated. In addition to having the general biological significance, such demonstration would further validate the data extraction and processing methods used to compose protein annotation and protein-protein interactions datasets.
We developed a method for automatic extraction of protein functional annotation from scientific text based on the Natural Language Processing (NLP) technology. For the protein annotation extracted from the entire PubMed, we evaluated the precision and recall rates, and compared the performance of the automatic extraction technology to that of manual curation used in public Gene Ontology (GO) annotation. In the second part of our presentation, we reported a large-scale investigation into the correspondence between communities in the literature-based protein networks and GO annotation groups of functionally related proteins. We found a comprehensive two-way match: proteins within biological annotation groups form significantly denser linked network clusters than expected by chance and, conversely, densely linked network communities exhibit a pronounced non-random overlap with GO groups. We also expanded the publicly available GO biological process annotation using the relations extracted by our NLP technology. An increase in the number and size of GO groups without any noticeable decrease of the link density within the groups indicated that this expansion significantly broadens the public GO annotation without diluting its quality. We revealed that functional GO annotation correlates mostly with clustering in a physical interaction protein network, while its overlap with indirect regulatory network communities is two to three times smaller.
Protein functional annotations extracted by the NLP technology expand and enrich the existing GO annotation system. The GO functional modularity correlates mostly with the clustering in the physical interaction network, suggesting that the essential role of structural organization maintained by these interactions. Reciprocally, clustering of proteins in physical interaction networks can serve as an evidence for their functional similarity.
The RD114/simian type D retroviruses, which include the feline endogenous retrovirus RD114, all strains of simian immunosuppressive type D retroviruses, the avian reticuloendotheliosis group ...including spleen necrosis virus, and baboon endogenous virus, use a common cell-surface receptor for cell entry. We have used a retroviral cDNA library approach, involving transfer and expression of cDNAs from highly infectable HeLa cells to nonpermissive NIH 3T3 mouse cells, to clone and identify this receptor. The cloned cDNA, denoted RDR, is an allele of the previously cloned neutral amino acid transporter ATB0(SLC1A5). Both RDR and ATB0serve as retrovirus receptors and both show specific transport of neutral amino acids. We have localized the receptor by radiation hybrid mapping to a region of about 500-kb pairs on the long arm of human chromosome 19 at q13.3. Infection of cells with RD114/type D retroviruses results in impaired amino acid transport, suggesting a mechanism for virus toxicity and immunosuppression. The identification and functional characterization of this retrovirus receptor provide insight into the retrovirus life cycle and pathogenesis and will be an important tool for optimization of gene therapy using vectors derived from RD114/type D retroviruses.
p53 is a multifunctional tumor suppressor protein involved in the negative control of cell growth. Mutations in p53 cause alterations in cellular phenotype, including immortalization, neoplastic ...transformation, and resistance to DNA-damaging drugs. To help dissect distinct functions of p53, a set of genetic suppressor elements (GSEs) capable of inducing different p53-related phenotypes in rodent embryo fibroblasts was isolated from a retroviral library of random rat p53 cDNA fragments. All the GSEs were 100-300 nucleotides long and were in the sense orientation. They fell into four classes, corresponding to the transactivator (class I), DNA-binding (class II), and C-terminal (class III) domains of the protein and the 3$^{\prime}$-untranslated region of the mRNA (class IV). GSEs in all four classes promoted immortalization of primary cells, but only members of classes I and III cooperated with activated ras to transform cells, and only members of class III conferred resistance to etoposide and strongly inhibited transcriptional transactivation by p53. These observations suggest that processes related to control of senescence, response to DNA damage, and transformation involve different functions of the p53 protein and furthermore indicate a regulatory role for the 3$^{\prime}$-untranslated region of p53 mRNA.
In 2020, Novartis Pharmaceuticals Corporation and the U.S. Food and Drug Administration (FDA) started a 4‐year scientific collaboration to approach complex new data modalities and advanced analytics. ...The scientific question was to find novel radio‐genomics‐based prognostic and predictive factors for HR+/HER− metastatic breast cancer under a Research Collaboration Agreement. This collaboration has been providing valuable insights to help successfully implement future scientific projects, particularly using artificial intelligence and machine learning. This tutorial aims to provide tangible guidelines for a multi‐omics project that includes multidisciplinary expert teams, spanning across different institutions. We cover key ideas, such as “maintaining effective communication” and “following good data science practices,” followed by the four steps of exploratory projects, namely (1) plan, (2) design, (3) develop, and (4) disseminate. We break each step into smaller concepts with strategies for implementation and provide illustrations from our collaboration to further give the readers actionable guidance.
We demonstrate that protein–protein interaction networks in several eukaryotic organisms contain significantly more self-interacting proteins than expected if such homodimers randomly appeared in the ...course of the evolution. We also show that on average homodimers have twice as many interaction partners than non-self-interacting proteins. More specifically, the likelihood of a protein to physically interact with itself was found to be proportional to the total number of its binding partners. These properties of dimers are in agreement with a phenomenological model, in which individual proteins differ from each other by the degree of their ‘stickiness’ or general propensity toward interaction with other proteins including oneself. A duplication of self-interacting proteins creates a pair of paralogous proteins interacting with each other. We show that such pairs occur more frequently than could be explained by pure chance alone. Similar to homodimers, proteins involved in heterodimers with their paralogs on average have twice as many interacting partners than the rest of the network. The likelihood of a pair of paralogous proteins to interact with each other was also shown to decrease with their sequence similarity. This points to the conclusion that most of interactions between paralogs are inherited from ancestral homodimeric proteins, rather than established de novo after duplication. We finally discuss possible implications of our empirical observations from functional and evolutionary standpoints.
We describe a general strategy for cloning mammalian genes whose downregulation results in a selectable phenotype. This strategy is based on expression selection of genetic suppressor elements ...(GSEs), cDNA fragments encoding either specific peptides that act as dominant inhibitors of protein function or antisense RNA segments that efficiently inhibit gene expression. Since GSEs counteract the gene from which they are derived, they can be used as dominant selectable markers for the phenotype associated with downregulation of the corresponding gene. A retroviral library containing random fragments of normalized (uniform abundance) cDNA expressed in mouse NIH 3T3 cells was used to select for GSEs inducing resistance to the anticancer drug etoposide. Three GSEs were isolated, two of which are derived from unknown genes and the third encodes antisense RNA for the heavy chain of a motor protein kinesin. The kinesin-derived GSE induces resistance to several DNA-damaging drugs and immortalizes senescent mouse embryo fibroblasts, indicating that kinesin is involved in the mechanisms of drug sensitivity and in vitro senescence. Expression of the human kinesin heavy-chain gene was decreased in four of four etoposide-resistant HeLa cell lines, derived by conventional drug selection, indicating that downregulation of kinesin represents a natural mechanism of drug resistance in mammalian cells.
Motivation: The living cell is a complex machine that depends on the proper functioning of its numerous parts, including proteins. Understanding protein functions and how they modify and regulate ...each other is the next great challenge for life-sciences researchers. The collective knowledge about protein functions and pathways is scattered throughout numerous publications in scientific journals. Bringing the relevant information together becomes a bottleneck in a research and discovery process. The volume of such information grows exponentially, which renders manual curation impractical. As a viable alternative, automated literature processing tools could be employed to extract and organize biological data into a knowledge base, making it amenable to computational analysis and data mining. Results: We present MedScan, a completely automated natural language processing-based information extraction system. We have used MedScan to extract 2976 interactions between human proteins from MEDLINE abstracts dated after 1988. The precision of the extracted information was found to be 91%. Comparison with the existing protein interaction databases BIND and DIP revealed that 96% of extracted information is novel. The recall rate of MedScan was found to be 21%. Additional experiments with MedScan suggest that MEDLINE is a unique source of diverse protein function information, which can be extracted in a completely automated way with a reasonably high precision. Further directions of the MedScan technology improvement are discussed. Availability: MedScan is available for commercial licensing from Ariadne Genomics, Inc.
Abstract Objectives Anaphylaxis is a severe life-threatening allergic reaction, and its accurate identification in healthcare databases can harness the potential of “Big Data” for healthcare or ...public health purposes. Materials and methods This study used claims data obtained between October 1, 2015 and February 28, 2019 from the CMS database to examine the utility of machine learning in identifying incident anaphylaxis cases. We created a feature selection pipeline to identify critical features between different datasets. Then a variety of unsupervised and supervised methods were used (eg, Sammon mapping and eXtreme Gradient Boosting) to train models on datasets of differing data quality, which reflects the varying availability and potential rarity of ground truth data in medical databases. Results Resulting machine learning model accuracies ranged from 47.7% to 94.4% when tested on ground truth data. Finally, we found new features to help experts enhance existing case-finding algorithms. Discussion Developing precise algorithms to detect medical outcomes in claims can be a laborious and expensive process, particularly for conditions presented and coded diversely. We found it beneficial to filter out highly potent codes used for data curation to identify underlying patterns and features. To improve rule-based algorithms where necessary, researchers could use model explainers to determine noteworthy features, which could then be shared with experts and included in the algorithm. Conclusion Our work suggests machine learning models can perform at similar levels as a previously published expert case-finding algorithm, while also having the potential to improve performance or streamline algorithm construction processes by identifying new relevant features for algorithm construction.