Anotacijska shema i njezina evaluacija Žitnik, Slavko; Dontcheva-Navratilova, Olga; Borowiak, Agnieszka ...
Rasprave Instituta za hrvatski jezik i jezikoslovlje,
2023, Letnik:
49, Številka:
1
Journal Article, Paper
Recenzirano
Odprti dostop
The present paper focuses on the presentation and discussion of aspects of OFFENSIVE LANGUAGE linguistic annotation, including the creation, annotation practice, curation, and evaluation of an ...OFFENSIVE LANGUAGE annotation taxonomy scheme, that was first proposed in Lewandowska-Tomaszczyk et al. (2021). An extended offensive language ontology comprising 17 categories, structured in terms of 4 hierarchical levels, has been shown to represent the encoding of the defined offensive language schema, trained in terms of non-contextual word embeddings – i.e., Word2Vec and Fast Text, and eventually juxtaposed to the data acquired by using a pair wise training and testing analysis for existing categories in the HateBERT model (Lewandowska-Tomaszczyk et al. submitted). The study reports on the annotation practice in WG 4.1.1.
Incivility in media and social media
in the context of COST Action CA 18209
European network for Web-centred linguistic data science
(Nexus Linguarum) with the INCEpTION tool (https://github.com/inception-project/inception) – a semantic
annotation
platform offering assistance in the annotation. The results partly support the proposed ontology of explicit offense and positive implicitness types to provide more variance among widely recognized types of figurative language (e.g., metaphorical, metonymic, ironic, etc.). The use of the annotation system and the representation of linguistic data were also evaluated in a series of the annotators’ comments, by means of a questionnaire and an open discussion. The annotation results and the questionnaire showed that for some of the categories there was low or medium inter-annotator agreement, and it was more challenging for annotators to distinguish between category items than between aspect items, with the category items
offensive
,
insulting
and
abusive
being the most difficult in this respect. The need for taxonomic simplification measures on the basis of these results has been recognized for further annotation practices.
U ovome je radu predstavljen proces označavanja uvredljivoga jezika koji uključuje izradu klasifikacije toga jezika, označivačku praksu, vođenje procesa i evaluaciju. Klasifikacijska je shema prvi put predložena u Lewandowska-Tomaszczyk i dr. (2021). Proširena ontologija uvredljivoga jezika sadrži 17 kategorija posloženih u četiri hijerarhijske razine te tako predstavlja shemu uvredljivoga jezika koja je trenirana u okviru nekontekstualiziranih vektorskih prikaza riječi (engl. word embeddings) poput Word2Vec i Fast Text koji su naposljetku supostavljeni podatcima prikupljenima korištenjem analize parova i analize testiranja za postojeće kategorije u modelu HateBERT (Lewandowska-Tomaszczyk i dr., u postupku recenzije). U radu se izvještava o označivačkoj praksi u okviru radne grupe WG 4.1.1.
Incivility in media and social media
COST-ove akcije CA 18209
European network for Web-centred linguistic data science
(
Nexus Linguarum
). Označavanje je provedeno u alatu INCEpTION (https://github.com/inception-project/inception) – platformi za semantičko označavanje koja ima ugrađene alate za takvu obradu podataka. Dobiveni rezultati podupiru predloženu ontologiju eksplicitnoga i implicitnoga uvredljivog jezika koja omogućuje veću raznovrsnost među već prepoznatim tipovima figurativnoga jezika (primjerice metafora, metonimija, ironija itd.). Upotreba sustava za označavanje i prikazivanje jezičnih podataka također je procijenjena u povratnim komentarima koje su pružili označivači. Komentari označivača prikupljeni su metodom upitnika te otvorenom raspravom. Na kraju je usustavljen niz preporuka za buduće označivačke prakse.
Due to numerous public information sources and services, many methods to combine heterogeneous data were proposed recently. However, general end-to-end solutions are still rare, especially systems ...taking into account different context dimensions. Therefore, the techniques often prove insufficient or are limited to a certain domain. In this paper we briefly review and rigorously evaluate a general framework for data matching and merging. The framework employs collective entity resolution and redundancy elimination using three dimensions of context types. In order to achieve domain independent results, data is enriched with semantics and trust. However, the main contribution of the paper is evaluation on five public domain-incompatible datasets. Furthermore, we introduce additional attribute, relationship, semantic and trust metrics, which allow complete framework management. Besides overall results improvement within the framework, metrics could be of independent interest.
In the past decade, social media has become an important part of our everyday life. The employment of different social media changes the way we communicate, collaborate, gather information and ...consequently perceive the world around us. Thus, researchers from different fields exploit the social media to provide deeper insight into human behaviour. Each social media possesses its own privacy politics and access to publicly available data. In this paper, we present a generic framework along with the tools to analyse different social media. The analysis shows basic usage statistics, reach and engagement differences, language, sentiment and gender identification of each social network data. We compare data from Twitter, Facebook, Tumblr, Google+ and YouTube. The results reveal specifics of each social media, which to some extent also depend on the data available and the selected seed keywords. We uncover that popularity of selected topics in social media is proportional to the number of hits on Google, celebrities and politicians are the most talked topics and that behaviour of users across social media is different. For example, Twitter users prefer to post more, while Facebook and Youtube users prefer to comment. The majority of all social media posts are in English, larger number of them are negative and often written by male users. The results of the proposed framework should serve as a tool to identify the appropriate source of data for the representative analysis of social media.
Natural language processing is used for solving a wide variety of problems. Some scholars and interest groups working with language resources are not well versed in programming, so there is a need ...for a good graphical framework that allows users to quickly design and test natural language processing pipelines without the need for programming. The existing frameworks do not satisfy all the requirements for such a tool. We, therefore, propose a new framework that provides a simple way for its users to build language processing pipelines. It also allows a simple programming language agnostic way for adding new modules, which will help the adoption by natural language processing developers and researchers. The main parts of the proposed framework consist of (a) a pluggable Docker-based architecture, (b) a general data model, and (c) APIs description along with the graphical user interface. The proposed design is being used for implementation of a new natural language processing framework, called ANGLEr.
Nowadays, there are many tools and applications available that use novel machine learning algorithms, but only few tools exist that bridge the gap between the computer scientists and general public. ...The latter especially holds for the field of natural language processing. Therefore in this paper we present a new information extraction toolkit, called nutIE, that offers a frontend for end-to-end text analysis using a modern javascript-based web application. The core of the toolkit includes all the high-level information extraction techniques with supporting preprocessing and evaluation methods. All these can be accessed by a Scala programmatic API or an arbitrary third-party application via REST interface. It has already been shown that the integrated toolkit's algorithms achieve state-of-the art results. The toolkit is currently applied at international text processing projects and courses.
Due to numerous public information sources and services, many methods to combine heterogeneous data were proposed recently. However, general end-to-end solutions are still rare, especially systems ...taking into account different context dimensions. Therefore, the techniques often prove insufficient or are limited to a certain domain. In this paper we briefly review and rigorously evaluate a general framework for data matching and merging. The framework employs collective entity resolution and redundancy elimination using three dimensions of context types. In order to achieve domain independent results, data is enriched with semantics and trust. However, the main contribution of the paper is evaluation on five public domain-incompatible datasets. Furthermore, we introduce additional attribute, relationship, semantic and trust metrics, which allow complete framework management. Besides overall results improvement within the framework, metrics could be of independent interest.
The basic indicators of a researcher's productivity and impact are still the number of publications and their citation counts. These metrics are clear, straightforward, and easy to obtain. When a ...ranking of scholars is needed, for instance in grant, award, or promotion procedures, their use is the fastest and cheapest way of prioritizing some scientists over others. However, due to their nature, there is a danger of oversimplifying scientific achievements. Therefore, many other indicators have been proposed including the usage of the PageRank algorithm known for the ranking of webpages and its modifications suited to citation networks. Nevertheless, this recursive method is computationally expensive and even if it has the advantage of favouring prestige over popularity, its application should be well justified, particularly when compared to the standard citation counts. In this study, we analyze three large datasets of computer science papers in the categories of artificial intelligence, software engineering, and theory and methods and apply 12 different ranking methods to the citation networks of authors. We compare the resulting rankings with self-compiled lists of outstanding researchers selected as frequent editorial board members of prestigious journals in the field and conclude that there is no evidence of PageRank-based methods outperforming simple citation counts.