Data Mining; Text Mining; Health Informatics; Health Care Information Systems; Medical Terminologies; Natural Language Processing; Text Analysis; Support Vector Machines
Abstract Background The increasing complexity of data streams and computational processes in modern clinical health information systems makes reproducibility challenging. Clinical natural language ...processing (NLP) pipelines are routinely leveraged for the secondary use of data. Workflow management systems (WMS) have been widely used in bioinformatics to handle the reproducibility bottleneck. Objective To evaluate if WMS and other bioinformatics practices could impact the reproducibility of clinical NLP frameworks. Materials and Methods Based on the literature across multiple researcho fields (NLP, bioinformatics and clinical informatics) we selected articles which (1) review reproducibility practices and (2) highlight a set of rules or guidelines to ensure tool or pipeline reproducibility. We aggregate insight from the literature to define reproducibility recommendations. Finally, we assess the compliance of 7 NLP frameworks to the recommendations. Results We identified 40 reproducibility features from 8 selected articles. Frameworks based on WMS match more than 50% of features (26 features for LAPPS Grid, 22 features for OpenMinted) compared to 18 features for current clinical NLP framework (cTakes, CLAMP) and 17 features for GATE, ScispaCy, and Textflows. Discussion 34 recommendations are endorsed by at least 2 articles from our selection. Overall, 15 features were adopted by every NLP Framework. Nevertheless, frameworks based on WMS had a better compliance with the features. Conclusion NLP frameworks could benefit from lessons learned from the bioinformatics field (eg, public repositories of curated tools and workflows or use of containers for shareability) to enhance the reproducibility in a clinical setting.
The continuous growth of scientific literature brings innovations and, at the same time, raises new challenges. One of them is related to the fact that its analysis has become difficult due to the ...high volume of published papers for which manual effort for annotations and management is required. Novel technological infrastructures are needed to help researchers, research policy makers, and companies to time-efficiently browse, analyse, and forecast scientific research. Knowledge graphs i.e., large networks of entities and relationships, have proved to be effective solution in this space. Scientific knowledge graphs focus on the scholarly domain and typically contain metadata describing research publications such as authors, venues, organizations, research topics, and citations. However, the current generation of knowledge graphs lacks of an explicit representation of the knowledge presented in the research papers. As such, in this paper, we present a new architecture that takes advantage of Natural Language Processing and Machine Learning methods for extracting entities and relationships from research publications and integrates them in a large-scale knowledge graph. Within this research work, we (i) tackle the challenge of knowledge extraction by employing several state-of-the-art Natural Language Processing and Text Mining tools, (ii) describe an approach for integrating entities and relationships generated by these tools, (iii) show the advantage of such an hybrid system over alternative approaches, and (vi) as a chosen use case, we generated a scientific knowledge graph including 109,105 triples, extracted from 26,827 abstracts of papers within the Semantic Web domain. As our approach is general and can be applied to any domain, we expect that it can facilitate the management, analysis, dissemination, and processing of scientific knowledge.
•A novel hybrid approach to extract knowledge from textual resources and build Knowledge Graphs is proposed.•The hybrid approach allows to have advantages versus methods that are only focused on supervised classification.•A Knowledge Graph about the Semantic Web domain has been released as a result of the proposed architecture.
This study examines whether user-generated content (UGC) is related to stock market performance, which metric of UGC has the strongest relationship, and what the dynamics of the relationship are. We ...aggregate UGC from multiple websites over a four-year period across 6 markets and 15 firms. We derive multiple metrics of UGC and use multivariate time-series models to assess the relationship between UGC and stock market performance.
Volume of chatter significantly leads abnormal returns by a few days (supported by Granger causality tests). Of all the metrics of UGC, volume of chatter has the strongest positive effect on abnormal returns and trading volume. The effect of negative and positive metrics of UGC on abnormal returns is asymmetric. Whereas negative UGC has a significant negative effect on abnormal returns with a short "wear-in" and long "wear-out," positive UGC has no significant effect on these metrics. The volume of chatter and negative chatter have a significant positive effect on trading volume. Idiosyncratic risk increases significantly with negative information in UGC. Positive information does not have much influence on the risk of the firm. An increase in off-line advertising significantly increases the volume of chatter and decreases negative chatter. These results have important implications for managers and investors.
Abstract
Existing general clinical natural language processing (NLP) systems such as MetaMap and Clinical Text Analysis and Knowledge Extraction System have been successfully applied to information ...extraction from clinical text. However, end users often have to customize existing systems for their individual tasks, which can require substantial NLP skills. Here we present CLAMP (Clinical Language Annotation, Modeling, and Processing), a newly developed clinical NLP toolkit that provides not only state-of-the-art NLP components, but also a user-friendly graphic user interface that can help users quickly build customized NLP pipelines for their individual applications. Our evaluation shows that the CLAMP default pipeline achieved good performance on named entity recognition and concept encoding. We also demonstrate the efficiency of the CLAMP graphic user interface in building customized, high-performance NLP pipelines with 2 use cases, extracting smoking status and lab test values. CLAMP is publicly available for research use, and we believe it is a unique asset for the clinical NLP community.
In emergency situations users of social networks convey all sorts of what have been called communicative intentions, well-known since the work of Austin (1962) and Searle (1969) as speech acts (SA). ...While speech acts have been the focus of close scrutiny in the philosophical and linguistic literature (see (Portner, 2018) for extended discussion), their role has been only rarely understood and exploited in processing social media content about crisis events, our focus here. Current work on communicative intentions in social media are topic-oriented, focusing on the correlation between SA and specific topics such as crisis (e.g., earthquakes) but also politics, celebrities, cooking, travel, etc. It has been observed that people globally tend to react to natural disasters with SA distinct from those used in other contexts (e.g., celebrities, which are essentially made up of comments). Here, we explore the further hypothesis of a correlation between different SA types and urgency and propose an in depth linguistic and computational analysis of communicative intentions in tweets from an urgency-oriented perspective. Indeed, SA are mostly relevant to identify intentions, desires, plans and preferences towards action and to ultimately produce a system intended to help rescue teams. Our contribution is four-fold and consists of: (1) A two-layer annotation scheme of speech acts both at the tweet and sub-tweet levels, (2) A new French dataset of about 13K tweets annotated for both urgency and SA, targeting both expected (e.g., storms) and unexpected or sudden (e.g., building collapse, explosion) events, (3) A thorough analysis of the annotations studying in particular the correlation between SA and the urgency of the message, SA and intentions to act categories (e.g., human damages), and SA and crisis types, finally, (4) A set of deep learning experiments to detect SA in crises related corpora. Our results show a strong correlation between SA and urgency annotations at both the tweet and sub-tweet levels with a particular salient correlation in the latter case, which constitutes a first important step towards SA-aware NLP-based crisis management on social media.
languages, both on the source and target sides, with a single translation engine. In this paper, we present the general principles underlying these systems and the innovations that have made them ...possible, before discussing their main strengths and weaknesses.
Les systèmes de traduction automatique (TA) neuronale ont fait ces dernières années des progrès tangibles, qui les ont rendus utilisables pour un nombre croissant de domaines et de couples de langues. Les systèmes neuronaux reposent sur des algorithmes d’apprentissage automatique et leur développement nécessite de grands corpus électroniques de textes parallèles, alignés au niveau des phrases, des ressources qui n’existent que pour un petit nombre de couples de langues et de domaines. Pour pallier ce manque, une proposition récente consiste à développer des systèmes de traduction dits « multilingues ». Ces développements ont été impulsés en particulier par les grands acteurs de l’Internet, pour qui le traitement d’un maximum de langues est un enjeu majeur. La principale caractéristique de ces architectures est de pouvoir traiter, avec un unique système de traduction, de multiples langues, aussi bien du côté source que du côté cible. Dans cette contribution, nous exposons les principes généraux qui sous-tendent ces systèmes et les innovations qui les ont rendus possibles, avant d’en discuter les principales forces et faiblesses.