Over the last decade industrial and academic communities have increased their focus on sentiment analysis techniques, especially applied to tweets. State-of-the-art results have been recently ...achieved using language models trained from scratch on corpora made up exclusively of tweets, in order to better handle the Twitter jargon. This work aims to introduce a different approach for Twitter sentiment analysis based on two steps. Firstly, the tweet jargon, including emojis and emoticons, is transformed into plain text, exploiting procedures that are language-independent or easily applicable to different languages. Secondly, the resulting tweets are classified using the language model BERT, but pre-trained on plain text, instead of tweets, for two reasons: (1) pre-trained models on plain text are easily available in many languages, avoiding resource- and time-consuming model training directly on tweets from scratch; (2) available plain text corpora are larger than tweet-only ones, therefore allowing better performance. A case study describing the application of the approach to Italian is presented, with a comparison with other Italian existing solutions. The results obtained show the effectiveness of the approach and indicate that, thanks to its general basis from a methodological perspective, it can also be promising for other languages.
The COrona VIrus Disease 19 (COVID-19) pandemic required the work of all global experts to tackle it. Despite the abundance of new studies, privacy laws prevent their dissemination for medical ...investigations: through clinical de-identification, the Protected Health Information (PHI) contained therein can be anonymized so that medical records can be shared and published. The automation of clinical de-identification through deep learning techniques has proven to be less effective for languages other than English due to the scarcity of data sets. Hence a new Italian de-identification data set has been created from the COVID-19 clinical records made available by the Italian Society of Radiology (SIRM). Therefore, two multi-lingual deep learning systems have been developed for this low-resource language scenario: the objective is to investigate their ability to transfer knowledge between different languages while maintaining the necessary features to correctly perform the Named Entity Recognition task for de-identification. The systems were trained using four different strategies, using both the English Informatics for Integrating Biology & the Bedside (i2b2) 2014 and the new Italian SIRM COVID-19 data sets, then evaluated on the latter. These approaches have demonstrated the effectiveness of cross-lingual transfer learning to de-identify medical records written in a low resource language such as Italian, using one with high resources such as English.
•Comparison of multilingual deep learning systems through clinical de-identification.•Proposal and testing of 4 possible training approaches with low resources languages.•Construction of a new annotated Italian dataset from public COVID-19 medical records.
Clinical de-identification aims to identify Protected Health Information in clinical data, enabling data sharing and publication. First automatic de-identification systems were based on rules or on ...machine learning methods, limited by language changes, lack of context awareness and time consuming feature engineering. Newer deep learning techniques for sequence labeling have shown better results with a reduction in feature engineering efforts and the use of word representation techniques in vector space. However, they are not able to jointly represent the polysemic and context-dependent nature of words, as well as their morpho-syntactic mutations characteristic of handwriting. To address these limitations, a new de-identification approach based on deep learning techniques for Named Entity Recognition has been proposed, whose key factors are: (i) a Bidirectional Long Short-Term Memory + Conditional Random Field architecture for sequence labeling that takes advantage of the widest possible representation context; (ii) a contextualized language model, working at character level, to capture the polysemy of words and manage the morpho-syntactic variations typical of handwritten notes; (iii) more word representations stacked to better capture latent syntactic and semantic similarities. This approach has been tested on the official Informatics for Integrating Biology & the Bedside 2014 de-identification dataset, showing similar or higher performance than state of the art with respect to category and binary recognition, but without any feature engineering or handcrafted rules. The experiments demonstrate the effectiveness of the proposed approach, in particular with regard to category level recognition which is essential to correctly replace entities with surrogates for anonymization purposes.
•De-identify entities belonging to various classes in unstructured medical records.•Stack embeddings and extend the context to boost Bi-LSTM+CRF systems performance.•Establish a new state of the art in the classification of entities at category level.
Over the years, the attention of the scientific world towards the techniques of sentiment analysis has increased considerably, driven by industry. The arrival of the Google BERT language model has ...confirmed the superiority of models based on a particular structure of artificial neural network called Transformer, from which many variants have resulted. These models are generally pre-trained on large text corpora and only later specialized according to the precise task to be faced on much smaller amounts of data. For these reasons, countless versions were developed to meet the specific needs of each language, especially in the case of languages with relatively few datasets available. At the same time, models that were pre-trained for multiple languages became widespread, providing greater flexibility of use in exchange for lower performance. This study shows how the use of techniques to transfer learning from languages with high resources to languages with low resources provides an important performance increase: a multilingual BERT model fine tuned on a mixed English/Italian dataset (using for the English a literature dataset and for the Italian a reviews dataset created ad-hoc from the well-known platform TripAdvisor), provides much higher performance than models specific to Italian. Overall, the results obtained by comparing the different possible approaches indicate which one is the most promising to pursue in order to obtain the best results in low resource scenarios.
•Use transfer learning from high to low resources languages to identify fake reviews.•Propose a novel ad-hoc created dataset of TripAdvisor reviews in Italian.•Identify the best fine-tuning approach improving performance of BERT language model.
In the last years, the need to de-identify privacy-sensitive information within Electronic Health Records (EHRs) has become increasingly felt and extremely relevant to encourage the sharing and ...publication of their content in accordance with the restrictions imposed by both national and supranational privacy authorities. In the field of Natural Language Processing (NLP), several deep learning techniques for Named Entity Recognition (NER) have been applied to face this issue, significantly improving the effectiveness in identifying sensitive information in EHRs written in English. However, the lack of data sets in other languages has strongly limited their applicability and performance evaluation. To this aim, a new de-identification data set in Italian has been developed in this work, starting from the 115 COVID-19 EHRs provided by the Italian Society of Radiology (SIRM): 65 were used for training and development, the remaining 50 were used for testing. The data set was labelled following the guidelines of the i2b2 2014 de-identification track. As additional contribution, combined with the best performing Bi-LSTM + CRF sequence labeling architecture, a stacked word representation form, not yet experimented for the Italian clinical de-identification scenario, has been tested, based both on a contextualized linguistic model to manage word polysemy and its morpho-syntactic variations and on sub-word embeddings to better capture latent syntactic and semantic similarities. Finally, other cutting-edge approaches were compared with the proposed model, which achieved the best performance highlighting the goodness of the promoted approach.
The growth of the online review phenomenon, which has expanded from specialised trade magazines to end users via online platforms, has also increasingly involved the cultural heritage of countries, a ...source of tourism and growth driver of local economies. Unfortunately, this has been paralleled by the emergence and spread of the phenomenon of fake reviews, against which the scientific world has developed language models capable of distinguishing them from the truthful. The application of such models, often based on deep neural networks with transformer-type architectures, is however limited by the availability of local language data sets for specific domains, useful for both training and verification. The purpose of this article is twofold. Firstly, a new data set was created in the Italian language, generally considered low-resource, relating to the domain of cultural heritage in Italy, by collecting reviews available online, reorganising them in the form of a data set usable by the language models. Secondly, a baseline of results for the detection of misleading reviews was constructed by exploiting two widely used language models, namely BERT and ELECTRA. The performance achieved is interesting, around 95% accuracy and F1 score, using data set splits between training and testing of 80/20 and 90/10. In addition, SHAP was used as a tool to support the explicability of AI models: in this way, it was possible to show the usefulness of sentiment analysis as a support for the recognition of deceptiveness.
Recent evolutions in the e-commerce market have led to an increasing importance attributed by consumers to product reviews made by third parties before proceeding to purchase. The industry, in order ...to improve the offer intercepting the discontent of consumers, has placed increasing attention towards systems able to identify the sentiment expressed by buyers, whether positive or negative. From a technological point of view, the literature in recent years has seen the development of two types of methodologies: those based on lexicons and those based on machine and deep learning techniques. This study proposes a comparison between these technologies in the Italian market, one of the largest in the world, exploiting an ad hoc dataset: scientific evidence generally shows the superiority of language models such as BERT built on deep neural networks, but it opens several considerations on the effectiveness and improvement of these solutions when compared to those based on lexicons in the presence of datasets of reduced size such as the one under study, a common condition for languages other than English or Chinese.
Nowadays, the spread of deceptive reviews is a problem that has reached critical dimensions, having a significant economic impact on business activities. This paper aims to estimate – at the ...quantitative and qualitative levels – the possibility of using particular words to disambiguate between truthful and deceptive text, focusing on reviews produced in the cultural heritage domain. For this purpose, a lexicon-based methodology has used two different lexicons: sentiment information, intensifiers, downtoners, and negation operators. As known in the literature, these elements are crucial in a classification process related to deceptiveness. The evaluation phase has considered quantitative metrics such as accuracy and F1 score and ad hoc developed metrics that consider specific linguistic parameters such as polarity and tone of voice intensifiers. A qualitative analysis of a subset of the corpus has also been carried out to understand better factors that impact the classification of deceptive review. Several linguistic features have been considered, ranging from the number of intensifiers to their type and position in phrases and sentences. A comparison between the performances of two different lexicons used has been added to the analysis.
•Lexicon-based approach for classifying Italian fake Cultural Heritage reviews.•A methodology based on intensifiers and downtoners from sentiment analysis.•New Italian Cultural Heritage Corpus for Deceptive Reviews Classification.•Comparison of manual and automatically created lexicons.•New metric considering polarity and linguistic features.
Today, reviews are the advertising medium par excellence through which companies are able to influence customers’ spending decisions. Although the initial purpose of reviews was to provide companies ...with a feedback tool to improve products and services based on customer needs, they soon became a way to climb the sales rankings, often illegally. In fact, deceptive and fake reviews have managed to evade the often non-existent means of validation of online platforms, proliferating a new business. To combat this phenomenon, several classification methods have been developed to train automated tools in the arduous task of distinguishing between genuine and misleading reviews, the most recent based on machine and deep learning techniques. This paper proposes a multi-label classification methodology based on the Google BERT neural language model to build a deceptive review detector aided by its sentiment awareness: improved modeling of the link between sentiment polarity and deceptiveness during the fine-tuning phase by exploiting the Binary Cross Entropy with Logits loss function adds to the advantages provided by pre-trained contextual models, which are able to capture word polysemy through word embeddings and benefit from pre-training on huge corpora. Tests were performed on the Deceptive Opinion Spam Corpus and Yelp New York City datasets, providing a quantitative and qualitative analysis of the results which, when compared with the state of the art available in the literature, showed an encouraging increase in performance.
•Distinguish between deceptive and real reviews written by unknown users online.•Combine sentiment polarity and deceptiveness labels via multilabel classification.•Vanguard results on Deceptive Opinion Spam and Yelp corpora via BERT language model.
The paper proposes a methodology based on Natural Language Processing (NLP) and Sentiment Analysis (SA) to get insights into sentiments and opinions toward COVID-19 vaccination in Italy. The studied ...dataset consists of vaccine-related tweets published in Italy from January 2021 to February 2022. In the considered period, 353,217 tweets have been analyzed, obtained after filtering 1,602,940 tweets with the word “vaccin”. A main novelty of the approach is the categorization of opinion holders in four classes, Common users, Media, Medicine, Politics, obtained by applying NLP tools, enhanced with large-scale domain-specific lexicons, on the short bios published by users themselves. Feature-based sentiment analysis is enriched with an Italian sentiment lexicon containing polarized words, expressing semantic orientation, and intensive words which give cues to identify the tone of voice of each user category. The results of the analysis highlighted an overall negative sentiment along all the considered periods, especially for the Common users, and a different attitude of opinion holders towards specific important events, such as deaths after vaccination, occurring in some days of the examined 14 months.
•A large-scale study exploiting tweets with reference to four different stakeholders.•A methodology to mine sentiments and opinions toward COVID-19 vaccination in Italy.•Sentiments analysis evolution over time from starting of vaccination to completion.•Focus on specific events of the vaccination campaign.•Results highlighted an overall negative sentiment, especially for the Common users.•Results show different attitudes of opinion holders towards specific key events.