Abstract
Motivation
Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing (NLP), extracting ...valuable information from biomedical literature has gained popularity among researchers, and deep learning has boosted the development of effective biomedical text mining models. However, directly applying the advancements in NLP to biomedical text mining often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora. In this article, we investigate how the recently introduced pre-trained language model BERT can be adapted for biomedical corpora.
Results
We introduce BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question answering (12.24% MRR improvement). Our analysis results show that pre-training BERT on biomedical corpora helps it to understand complex biomedical texts.
Availability and implementation
We make the pre-trained weights of BioBERT freely available at https://github.com/naver/biobert-pretrained, and the source code for fine-tuning BioBERT available at https://github.com/dmis-lab/biobert.
Traditional drug discovery approaches identify a target for a disease and find a compound that binds to the target. In this approach, structures of compounds are considered as the most important ...features because it is assumed that similar structures will bind to the same target. Therefore, structural analogs of the drugs that bind to the target are selected as drug candidates. However, even though compounds are not structural analogs, they may achieve the desired response. A new drug discovery method based on drug response, which can complement the structure-based methods, is needed.
We implemented Siamese neural networks called ReSimNet that take as input two chemical compounds and predicts the CMap score of the two compounds, which we use to measure the transcriptional response similarity of the two compounds. ReSimNet learns the embedding vector of a chemical compound in a transcriptional response space. ReSimNet is trained to minimize the difference between the cosine similarity of the embedding vectors of the two compounds and the CMap score of the two compounds. ReSimNet can find pairs of compounds that are similar in response even though they may have dissimilar structures. In our quantitative evaluation, ReSimNet outperformed the baseline machine learning models. The ReSimNet ensemble model achieves a Pearson correlation of 0.518 and a precision@1% of 0.989. In addition, in the qualitative analysis, we tested ReSimNet on the ZINC15 database and showed that ReSimNet successfully identifies chemical compounds that are relevant to a prototype drug whose mechanism of action is known.
The source code and the pre-trained weights of ReSimNet are available at https://github.com/dmis-lab/ReSimNet.
Supplementary data are available at Bioinformatics online.
Abstract
Fusion genes represent an important class of biomarkers and therapeutic targets in cancer. ChimerDB is a comprehensive database of fusion genes encompassing analysis of deep sequencing data ...(ChimerSeq) and text mining of publications (ChimerPub) with extensive manual annotations (ChimerKB). In this update, we present all three modules substantially enhanced by incorporating the recent flood of deep sequencing data and related publications. ChimerSeq now covers all 10 565 patients in the TCGA project, with compilation of computational results from two reliable programs of STAR-Fusion and FusionScan with several public resources. In sum, ChimerSeq includes 65 945 fusion candidates, 21 106 of which were predicted by multiple programs (ChimerSeq-Plus). ChimerPub has been upgraded by applying a deep learning method for text mining followed by extensive manual curation, which yielded 1257 fusion genes including 777 cases with experimental supports (ChimerPub-Plus). ChimerKB includes 1597 fusion genes with publication support, experimental evidences and breakpoint information. Importantly, we implemented several new features to aid estimation of functional significance, including the fusion structure viewer with domain information, gene expression plot of fusion positive versus negative patients and a STRING network viewer. The user interface also was greatly enhanced by applying responsive web design. ChimerDB 4.0 is available at http://www.kobic.re.kr/chimerdb/.
Ubiquitination controls the stability of most cellular proteins, and its deregulation contributes to human diseases including cancer. Deubiquitinases remove ubiquitin from proteins, and their ...inhibition can induce the degradation of selected proteins, potentially including otherwise 'undruggable' targets. For example, the inhibition of ubiquitin-specific protease 7 (USP7) results in the degradation of the oncogenic E3 ligase MDM2, and leads to re-activation of the tumour suppressor p53 in various cancers. Here we report that two compounds, FT671 and FT827, inhibit USP7 with high affinity and specificity in vitro and within human cells. Co-crystal structures reveal that both compounds target a dynamic pocket near the catalytic centre of the auto-inhibited apo form of USP7, which differs from other USP deubiquitinases. Consistent with USP7 target engagement in cells, FT671 destabilizes USP7 substrates including MDM2, increases levels of p53, and results in the transcription of p53 target genes, induction of the tumour suppressor p21, and inhibition of tumour growth in mice.
Activation of the phosphoinositide 3-kinase (PI3K) pathway occurs frequently in breast cancer. However, clinical results of single-agent PI3K inhibitors have been modest to date. A combinatorial drug ...screen on multiple PIK3CA mutant cancers with decreased sensitivity to PI3K inhibitors revealed that combined CDK 4/6-PI3K inhibition synergistically reduces cell viability. Laboratory studies revealed that sensitive cancers suppress RB phosphorylation upon treatment with single-agent PI3K inhibitors but cancers with reduced sensitivity fail to do so. Similarly, patients’ tumors that responded to the PI3K inhibitor BYL719 demonstrated suppression of pRB, while nonresponding tumors showed sustained or increased levels of pRB. Importantly, the combination of PI3K and CDK 4/6 inhibitors overcomes intrinsic and adaptive resistance leading to tumor regressions in PIK3CA mutant xenografts.
•Synergy exists between inhibitions of CDK 4/6 and PI3K in PIK3CA mutant breast cancer.•CDK 4/6-PI3K inhibition is effective in several PIK3CA mutant xenograft tumor models.•Failure to suppress pRB correlates with resistance to PI3K inhibitors in patients.
PI3K inhibitors have only modest clinical efficacy in breast cancers with an aberrantly activated PI3K pathway. Vora et al. show that inhibiting CDK 4/6 overcomes intrinsic and adaptive resistance to PI3K inhibitors in these tumors and that reduction of phosphorylated RB is a good biomarker for the response.
As the volume of publications rapidly increases, searching for relevant information from the literature becomes more challenging. To complement standard search engines such as PubMed, it is desirable ...to have an advanced search tool that directly returns relevant biomedical entities such as targets, drugs, and mutations rather than a long list of articles. Some existing tools submit a query to PubMed and process retrieved abstracts to extract information at query time, resulting in a slow response time and limited coverage of only a fraction of the PubMed corpus. Other tools preprocess the PubMed corpus to speed up the response time; however, they are not constantly updated, and thus produce outdated results. Further, most existing tools cannot process sophisticated queries such as searches for mutations that co-occur with query terms in the literature. To address these problems, we introduce BEST, a biomedical entity search tool. BEST returns, as a result, a list of 10 different types of biomedical entities including genes, diseases, drugs, targets, transcription factors, miRNAs, and mutations that are relevant to a user's query. To the best of our knowledge, BEST is the only system that processes free text queries and returns up-to-date results in real time including mutation information in the results. BEST is freely accessible at http://best.korea.ac.kr.
Embedding techniques for converting high-dimensional sparse data into low-dimensional distributed representations have been gaining popularity in various fields of research. In deep learning models, ...embedding is commonly used and proven to be more effective than naive binary representation. However, yet no attempt has been made to embed highly sparse mutation profiles into densely distributed representations. Since binary representation does not capture biological context, its use is limited in many applications such as discovering novel driver mutations. Additionally, training distributed representations of mutations is challenging due to a relatively small amount of available biological data compared with the large amount of text corpus data in text mining fields.
We introduce Mut2Vec, a novel computational pipeline that can be used to create a distributed representation of cancerous mutations. Mut2Vec is trained on cancer profiles using Skip-Gram since cancer can be characterized by a series of co-occurring mutations. We also augmented our pipeline with existing information in the biomedical literature and protein-protein interaction networks to compensate for the data insufficiency.
To evaluate our models, we conducted two experiments that involved the following tasks: a) visualizing driver and passenger mutations, b) identifying novel driver mutations using a clustering method. Our visualization showed a clear distinction between passenger mutations and driver mutations. We also found driver mutation candidates and proved that these were true driver mutations based on our literature survey. The pre-trained mutation vectors and the candidate driver mutations are publicly available at http://infos.korea.ac.kr/mut2vec .
We introduce Mut2Vec that can be utilized to generate distributed representations of mutations and experimentally validate the efficacy of the generated mutation representations. Mut2Vec can be used in various deep learning applications such as cancer classification and drug sensitivity prediction.
Castration-resistant prostate cancer (CRPC) is the most aggressive, incurable form of prostate cancer. MDV3100 (enzalutamide), an antagonist of the androgen receptor (AR), was approved for clinical ...use in men with metastatic CRPC. Although this compound showed clinical efficacy, many initial responders later developed resistance. To uncover relevant resistant mechanisms, we developed a model of spontaneous resistance to MDV3100 in LNCaP prostate cancer cells. Detailed characterization revealed that emergence of an F876L mutation in AR correlated with blunted AR response to MDV3100 and sustained proliferation during treatment. Functional studies confirmed that AR(F876L) confers an antagonist-to-agonist switch that drives phenotypic resistance. Finally, treatment with distinct antiandrogens or cyclin-dependent kinase (CDK)4/6 inhibitors effectively antagonized AR(F876L) function. Together, these findings suggest that emergence of F876L may (i) serve as a novel biomarker for prediction of drug sensitivity, (ii) predict a "withdrawal" response to MDV3100, and (iii) be suitably targeted with other antiandrogens or CDK4/6 inhibitors.
We uncovered an F876L agonist-switch mutation in AR that confers genetic and phenotypic resistance to the antiandrogen drug MDV3100. On the basis of this fi nding, we propose new therapeutic strategies to treat patients with prostate cancer presenting with this AR mutation.
Large language models (LLMs) have revolutionized the global landscape of technology beyond natural language processing. Owing to their extensive pre-training on vast datasets, contemporary LLMs can ...handle tasks ranging from general functionalities to domain-specific areas, such as radiology, without additional fine-tuning. General-purpose chatbots based on LLMs can optimize the efficiency of radiologists in terms of their professional work and research endeavors. Importantly, these LLMs are on a trajectory of rapid evolution, wherein challenges such as "hallucination," high training cost, and efficiency issues are addressed, along with the inclusion of multimodal inputs. In this review, we aim to offer conceptual knowledge and actionable guidance to radiologists interested in utilizing LLMs through a succinct overview of the topic and a summary of radiology-specific aspects, from the beginning to potential future directions.