Abstract Automatic genre identification (AGI) is a text classification task focused on genres, i.e., text categories defined by the author’s purpose, common function of the text, and the text’s ...conventional form. Obtaining genre information has been shown to be beneficial for a wide range of disciplines, including linguistics, corpus linguistics, computational linguistics, natural language processing, information retrieval and information security. Consequently, in the past 20 years, numerous researchers have collected genre datasets with the aim to develop an efficient genre classifier. However, their approaches to the definition of genre schemata, data collection and manual annotation vary substantially, resulting in significantly different datasets. As most AGI experiments are dataset-dependent, a sufficient understanding of the differences between the available genre datasets is of great importance for the researchers venturing into this area. In this paper, we present a detailed overview of different approaches to each of the steps of the AGI task, from the definition of the genre concept and the genre schema, to the dataset collection and annotation methods, and, finally, to machine learning strategies. Special focus is dedicated to the description of the most relevant genre schemata and datasets, and details on the availability of all of the datasets are provided. In addition, the paper presents the recent advances in machine learning approaches to automatic genre identification, and concludes with proposing the directions towards developing a stable multilingual genre classifier.
We address a challenging problem of identifying main sources of hate speech on Twitter. On one hand, we carefully annotate a large set of tweets for hate speech, and deploy advanced deep learning to ...produce high quality hate speech classification models. On the other hand, we create retweet networks, detect communities and monitor their evolution through time. This combined approach is applied to three years of Slovenian Twitter data. We report a number of interesting results. Hate speech is dominated by offensive tweets, related to political and ideological issues. The share of unacceptable tweets is moderately increasing with time, from the initial 20% to 30% by the end of 2020. Unacceptable tweets are retweeted significantly more often than acceptable tweets. About 60% of unacceptable tweets are produced by a single right-wing community of only moderate size. Institutional Twitter accounts and media accounts post significantly less unacceptable tweets than individual accounts. In fact, the main sources of unacceptable tweets are anonymous accounts, and accounts that were suspended or closed during the years 2018-2020.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Communities in social networks often reflect close social ties between their members and their evolution through time. We propose an approach that tracks two aspects of community evolution in retweet ...networks: flow of the members in, out and between the communities, and their influence. We start with high resolution time windows, and then select several timepoints which exhibit large differences between the communities. For community detection, we propose a two-stage approach. In the first stage, we apply an enhanced Louvain algorithm, called Ensemble Louvain, to find stable communities. In the second stage, we form influence links between these communities, and identify linked super-communities. For the detected communities, we compute internal and external influence, and for individual users, the retweet h-index influence. We apply the proposed approach to three years of Twitter data of all Slovenian tweets. The analysis shows that the Slovenian tweetosphere is dominated by politics, that the left-leaning communities are larger, but that the right-leaning communities and users exhibit significantly higher impact. An interesting observation is that retweet networks change relatively gradually, despite such events as the emergence of the Covid-19 pandemic or the change of government.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Automatic term extraction (ATE) is a natural language processing task that eases the effort of manually identifying terms from domain-specific corpora by providing a list of candidate terms. In this ...paper, we treat ATE as a sequence-labeling task and explore the efficacy of XLMR in evaluating cross-lingual and multilingual learning against monolingual learning in the cross-domain ATE context. Additionally, we introduce NOBI, a novel annotation mechanism enabling the labeling of single-word nested terms. Our experiments are conducted on the ACTER corpus, encompassing four domains and three languages (English, French, and Dutch), as well as the RSDO5 Slovenian corpus, encompassing four additional domains. Results indicate that cross-lingual and multilingual models outperform monolingual settings, showcasing improved F1-scores for all languages within the ACTER dataset. When incorporating an additional Slovenian corpus into the training set, the multilingual model exhibits superior performance compared to state-of-the-art approaches in specific scenarios. Moreover, the newly introduced NOBI labeling mechanism enhances the classifier’s capacity to extract short nested terms significantly, leading to substantial improvements in Recall for the ACTER dataset and consequentially boosting the overall F1-score performance.
Deep lexicography” – Fad or Opportunity? Ljubešić, Nikola
Rasprave Instituta za hrvatski jezik i jezikoslovlje,
01/2020, Letnik:
46, Številka:
2
Journal Article
Recenzirano
Odprti dostop
In recent years, we are witnessing staggering improvements in various semantic data processing tasks due to the developments in the area of deep learning, ranging from image and video processing to ...speech processing, and natural language understanding. In this paper, we discuss the opportunities and challenges that these developments pose for the area of electronic lexicography. We primarily focus on the concept of representation learning of the basic elements of language, namely words, and the applicability of these word representations to lexicography. We first discuss well-known approaches to learning static representations of words, the so-called word embeddings, and their usage in lexicography-related tasks such as semantic shift detection, and cross-lingual prediction of lexical features such as concreteness and imageability. We wrap up the paper with the most recent developments in the area of word representation learning in form of learning dynamic, context-aware representations of words, showcasing some dynamic word embedding examples, and discussing improvements on lexicography-relevant tasks of word sense disambiguation and word sense induction.
Posljednjih smo godina svjedoci velikoga napretka u različitim zadatcima inteligentne obrade podataka koji je posljedica razvoja dubokoga učenja. ti zadatci uključuju i obradu slike, videa, govora te razumijevanje jezika. u ovome se radu raspravlja o prilikama i izazovima koje taj napredak omogućuje u području digitalne leksikografije. Veći se dio rada odnosi na učenje prikaza različitih elemenata jezika – riječi, leksema te izjava – i njihovu primjenu u leksikografiji. Prikazuju se dobro poznati postupci učenja statičkih vektorskih prikaza riječi te njihova primjena u zadatcima poput prepoznavanja semantičkih pomaka te predviđanja leksičkih značajka riječi. u radu se dalje govori o višejezičnoj razini učenja prikaza riječi te se rad zaključuje prikazom novijih postignuća u području strojnoga razumijevanja jezika – dinamičkih, kontekstnih prikaza riječi.
Massive text collections are the backbone of large language models, the main ingredient of the current significant progress in artificial intelligence. However, as these collections are mostly ...collected using automatic methods, researchers have few insights into what types of texts they consist of. Automatic genre identification is a text classification task that enriches texts with genre labels, such as promotional and legal, providing meaningful insights into the composition of these large text collections. In this paper, we evaluate machine learning approaches for the genre identification task based on their generalizability across different datasets to assess which model is the most suitable for the downstream task of enriching large web corpora with genre information. We train and test multiple fine-tuned BERT-like Transformer-based models and show that merging different genre-annotated datasets yields superior results. Moreover, we explore the zero-shot capabilities of large GPT Transformer models in this task and discuss the advantages and disadvantages of the zero-shot approach. We also publish the best-performing fine-tuned model that enables automatic genre annotation in multiple languages. In addition, to promote further research in this area, we plan to share, upon request, a new benchmark for automatic genre annotation, ensuring the non-exposure of the latest large language models.
We discuss the added value of various approaches for identifying similarities in social network communities based on the content they produce. We show the limitations of observing communities using ...topology-only and illustrate the benefits and complementarity of including supplementary data when analyzing social networks. As a case study, we analyze the reactions of the Ex-Yugoslavian retweet communities to the Russian invasion of Ukraine, comparing topological inter-community interaction with their content-based similarity (hashtags, news sources, topics and sentiment). The findings indicate that despite the Ex-Yugoslavian countries having a common macro-language, their retweet communities exhibit diverse responses to the invasion. Certain communities exhibit a notable level of content-based similarity, although their topological similarity remains relatively low. On the other hand, there are communities that display high similarity in specific types of content, but demonstrate less similarity when considering other aspects. For example, we identify a strong echo-chamber community linked to the Serbian government that deliberately avoids the invasion topic, despite showing news source similarities with other communities highly active on the subject. In summary, our study highlights the importance of employing multifaceted approaches to analyzing community similarities, as they enable a more comprehensive understanding of social media discourse. This approach extends beyond the confines of our specific case study, presenting opportunities to gain valuable insights into complex social events across various contexts.
The paper presents the results of the Janes project, which aimed to develop language resources and tools for Slovene user generated content. The paper first describes the 200 million word Janes ...corpus, containing tweets, forum posts, news comments, user and talk pages from Wikipedia, and blogs and blog comments, where each text is accompanied by rich metadata. The developed processing tools for Slovene user generated content are presented next, which include a tokeniser, word-normaliser, part-of-speech tagger and lemmatiser, and a named entity recogniser. A set of manually annotated datasets was also produced, both for tool training as well as for linguistic research. The developed resources and tools are made publicly available under Creative Commons licences in the repository of the CLARIN.SI research infrastructure and on GitHub, while the corpora are also available through the CLARIN.SI concordancers.
Psycholinguistic databases containing ratings of concreteness, imageability, age of acquisition, and subjective frequency are used in psycholinguistic and neurolinguistic studies which require words ...as stimuli. Linguistic characteristics (e.g. word length, corpus frequency) are frequently coded, but word class is seldom systematically treated, although there are indications of its significance for imageability and concreteness. This paper presents the Croatian Psycholinguistic Database (CPD; available at:
https://doi.org/10.17234/megahr.2019.hpb
), containing 6000 Croatian nouns, verbs, adjectives and adverbs, rated for concreteness, imageability, age of acquisition, and subjective frequency. Moreover, we present computationally obtained extrapolations of concreteness and imageability to the remainder of the Croatian lexicon (available at:
https://github.com/megahr/lexicon/blob/master/predictions/hr_c_i.predictions.txt
). In the two studies presented here, we explore the significance of word class for concreteness and imageability in human and computationally obtained ratings. The observed correlations in the CPD indicate correspondences between psycholinguistic measures expected from the literature. Word classes exhibit differences in subjective frequency, age of acquisition, concreteness and imageability, with significant differences between nouns, verbs, adjectives and adverbs. In the computational study which focused on concreteness and imageability, concreteness obtained higher correlations with human ratings than imageability, and the system underpredicted the concreteness of nouns, and overpredicted the concreteness of adjectives and adverbs. Overall, this suggests that word class contains schematic conceptual and distributional information. Schematic conceptual content seems to be more significant in human ratings of concreteness and less significant in computationally obtained ratings, where distributional information seems to play a more significant role. This suggests that word class differences should be theoretically explored.