Recently, automatically extracting biomedical relations has been a significant subject in biomedical research due to the rapid growth of biomedical literature. Since the adaptation to the biomedical ...domain, the transformer-based BERT models have produced leading results on many biomedical natural language processing tasks. In this work, we will explore the approaches to improve the BERT model for relation extraction tasks in both the pre-training and fine-tuning stages of its applications. In the pre-training stage, we add another level of BERT adaptation on sub-domain data to bridge the gap between domain knowledge and task-specific knowledge. Also, we propose methods to incorporate the ignored knowledge in the last layer of BERT to improve its fine-tuning.
The experiment results demonstrate that our approaches for pre-training and fine-tuning can improve the BERT model performance. After combining the two proposed techniques, our approach outperforms the original BERT models with averaged F1 score improvement of 2.1% on relation extraction tasks. Moreover, our approach achieves state-of-the-art performance on three relation extraction benchmark datasets.
The extra pre-training step on sub-domain data can help the BERT model generalization on specific tasks, and our proposed fine-tuning mechanism could utilize the knowledge in the last layer of BERT to boost the model performance. Furthermore, the combination of these two approaches further improves the performance of BERT model on the relation extraction tasks.
Chinese hamster ovary (CHO) cells are widely used for mass production of therapeutic proteins in the pharmaceutical industry. With the growing need in optimizing the performance of producer CHO cell ...lines, research on CHO cell line development and bioprocess continues to increase in recent decades. Bibliographic mapping and classification of relevant research studies will be essential for identifying research gaps and trends in literature. To qualitatively and quantitatively understand the CHO literature, we have conducted topic modeling using a CHO bioprocess bibliome manually compiled in 2016, and compared the topics uncovered by the Latent Dirichlet Allocation (LDA) models with the human labels of the CHO bibliome. The results show a significant overlap between the manually selected categories and computationally generated topics, and reveal the machine-generated topic-specific characteristics. To identify relevant CHO bioprocessing papers from new scientific literature, we have developed supervized models using Logistic Regression to identify specific article topics and evaluated the results using three CHO bibliome datasets, Bioprocessing set, Glycosylation set, and Phenotype set. The use of top terms as features supports the explainability of document classification results to yield insights on new CHO bioprocessing papers.
Significant progress has been made in applying deep learning on natural language processing tasks recently. However, deep learning models typically require a large amount of annotated training data ...while often only small labeled datasets are available for many natural language processing tasks in biomedical literature. Building large-size datasets for deep learning is expensive since it involves considerable human effort and usually requires domain expertise in specialized fields. In this work, we consider augmenting manually annotated data with large amounts of data using distant supervision. However, data obtained by distant supervision is often noisy, we first apply some heuristics to remove some of the incorrect annotations. Then using methods inspired from transfer learning, we show that the resulting models outperform models trained on the original manually annotated sets.
Abstract
Protein post-translational modifications (PTMs) play a pivotal role in numerous biological processes by modulating regulation of protein function. We have developed iPTMnet ...(http://proteininformationresource.org/iPTMnet) for PTM knowledge discovery, employing an integrative bioinformatics approach-combining text mining, data mining, and ontological representation to capture rich PTM information, including PTM enzyme-substrate-site relationships, PTM-specific protein-protein interactions (PPIs) and PTM conservation across species. iPTMnet encompasses data from (i) our PTM-focused text mining tools, RLIMS-P and eFIP, which extract phosphorylation information from full-scale mining of PubMed abstracts and full-length articles; (ii) a set of curated databases with experimentally observed PTMs; and iii) Protein Ontology that organizes proteins and PTM proteoforms, enabling their representation, annotation and comparison within and across species. Presently covering eight major PTM types (phosphorylation, ubiquitination, acetylation, methylation, glycosylation, S-nitrosylation, sumoylation and myristoylation), iPTMnet knowledgebase contains more than 654 500 unique PTM sites in over 62 100 proteins, along with more than 1200 PTM enzymes and over 24 300 PTM enzyme-substrate-site relations. The website supports online search, browsing, retrieval and visual analysis for scientific queries. Several examples, including functional interpretation of phosphoproteomic data, demonstrate iPTMnet as a gateway for visual exploration and systematic analysis of PTM networks and conservation, thereby enabling PTM discovery and hypothesis generation.
MicroRNAs (miRNAs) regulate a wide range of cellular and developmental processes through gene expression suppression or mRNA degradation. Experimentally validated miRNA gene targets are often ...reported in the literature. In this paper, we describe miRTex, a text mining system that extracts miRNA-target relations, as well as miRNA-gene and gene-miRNA regulation relations. The system achieves good precision and recall when evaluated on a literature corpus of 150 abstracts with F-scores close to 0.90 on the three different types of relations. We conducted full-scale text mining using miRTex to process all the Medline abstracts and all the full-length articles in the PubMed Central Open Access Subset. The results for all the Medline abstracts are stored in a database for interactive query and file download via the website at http://proteininformationresource.org/mirtex. Using miRTex, we identified genes potentially regulated by miRNAs in Triple Negative Breast Cancer, as well as miRNA-gene relations that, in conjunction with kinase-substrate relations, regulate the response to abiotic stress in Arabidopsis thaliana. These two use cases demonstrate the usefulness of miRTex text mining in the analysis of miRNA-regulated biological processes.
The number of published articles describing associations between mutations and diseases is increasing at a fast pace. There is a pressing need to gather such mutation-disease associations into public ...knowledge bases, but manual curation slows down the growth of such databases. We have addressed this problem by developing a text-mining system (DiMeX) to extract mutation to disease associations from publication abstracts. DiMeX consists of a series of natural language processing modules that preprocess input text and apply syntactic and semantic patterns to extract mutation-disease associations. DiMeX achieves high precision and recall with F-scores of 0.88, 0.91 and 0.89 when evaluated on three different datasets for mutation-disease associations. DiMeX includes a separate component that extracts mutation mentions in text and associates them with genes. This component has been also evaluated on different datasets and shown to achieve state-of-the-art performance. The results indicate that our system outperforms the existing mutation-disease association tools, addressing the low precision problems suffered by most approaches. DiMeX was applied on a large set of abstracts from Medline to extract mutation-disease associations, as well as other relevant information including patient/cohort size and population data. The results are stored in a database that can be queried and downloaded at http://biotm.cis.udel.edu/dimex/. We conclude that this high-throughput text-mining approach has the potential to significantly assist researchers and curators to enrich mutation databases.
Researchers have shown that program analyses that drive software development and maintenance tools supporting search, traceability and other tasks can benefit from leveraging the natural language ...information found in identifiers and comments. Accurate natural language information depends on correctly splitting the identifiers into their component words and abbreviations. While conventions such as camel-casing can ease this task, conventions are not well-defined in certain situations and may be modified to improve readability, thus making automatic splitting more challenging. This paper describes an empirical study of state-of-the-art identifier splitting techniques and the construction of a publicly available oracle to evaluate identifier splitting algorithms. In addition to comparing current approaches, the results help to guide future development and evaluation of improved identifier splitting approaches.
The Gene Ontology (GO) is a resource that supplies information about gene product function using ontologies to represent biological knowledge. These ontologies cover three domains: Cellular Component ...(CC), Molecular Function (MF), and Biological Process (BP). GO annotation is a process which assigns gene functional information using GO terms to relevant genes in the literature. It is a common task among the Model Organism Database (MOD) groups. Manual GO annotation relies on human curators assigning gene functional information using GO terms by reading the biomedical literature. This process is very time-consuming and labor-intensive. As a result, many MODs can afford to curate only a fraction of relevant articles.
GO terms from the CC domain can be essentially divided into two sub-hierarchies: subcellular location terms, and protein complex terms. We cast the task of gene annotation using GO terms from the CC domain as relation extraction between gene and other entities: (1) extract cases where a protein is found to be in a subcellular location, and (2) extract cases where a protein is a subunit of a protein complex. For each relation extraction task, we use an approach based on triggers and syntactic dependencies to extract the desired relations among entities.
We tested our approach on the BC4GO test set, a publicly available corpus for GO annotation. Our approach obtains a F1-score of 71%, a precision of 91% and a recall of 58% for predicting GO terms from CC Domain for given genes.
We have described a novel approach of treating gene annotation with GO terms from CC domain as two relation extraction subtasks. Evaluation results show that our approach achieves a F1-score of 71% for predicting GO terms for given genes. Thereby our approach can be used to accelerate the process of GO annotation for the bio-annotators.
Abstract
Recent technological advances in glycobiology have resulted in a large influx of data and the publication of many papers describing discoveries in glycoscience. However, the terms used in ...describing glycan structural features are not standardized, making it difficult to harmonize data across biomolecular databases, hampering the harvesting of information across studies and hindering text mining and curation efforts. To address this shortcoming, the Glycan Structure Dictionary has been developed as a reference dictionary to provide a standardized list of widely used glycan terms that can help in the curation and mapping of glycan structures described in publications. Currently, the dictionary has 190 glycan structure terms with 297 synonyms linked to 3,332 publications. For a term to be included in the dictionary, it must be present in at least 2 peer-reviewed publications. Synonyms, annotations, and cross-references to GlyTouCan, GlycoMotif, and other relevant databases and resources are also provided when available. The purpose of this effort is to facilitate biocuration, assist in the development of text mining tools, improve the harmonization of search, and browse capabilities in glycoinformatics resources and help to map glycan structures to function and disease. It is also expected that authors will use these terms to describe glycan structures in their manuscripts over time. A mechanism is also provided for researchers to submit terms for potential incorporation. The dictionary is available at https://wiki.glygen.org/Glycan_structure_dictionary.
Heat stress triggers an evolutionarily conserved set of responses in cells. The transcriptome responds to hyperthermia by altering expression of genes to adapt the cell or organism to survive the ...heat challenge. RNA-seq technology allows rapid identification of environmentally responsive genes on a large scale. In this study, we have used RNA-seq to identify heat stress responsive genes in the chicken male white leghorn hepatocellular (LMH) cell line. The transcripts of 812 genes were responsive to heat stress (p<0.01) with 235 genes upregulated and 577 downregulated following 2.5 h of heat stress. Among the upregulated were genes whose products function as chaperones, along with genes affecting collagen synthesis and deposition, transcription factors, chromatin remodelers, and genes modulating the WNT and TGF-beta pathways. Predominant among the downregulated genes were ones that affect DNA replication and repair along with chromosomal segregation. Many of the genes identified in this study have not been previously implicated in the heat stress response. These data extend our understanding of the transcriptome response to heat stress with many of the identified biological processes and pathways likely to function in adapting cells and organisms to hyperthermic stress. Furthermore, this study should provide important insight to future efforts attempting to improve species abilities to withstand heat stress through genome-wide association studies and breeding.