NUK - logo
E-viri
Celotno besedilo
Recenzirano Odprti dostop
  • Anotacijska shema i njezina...
    Žitnik, Slavko; Dontcheva-Navratilova, Olga; Borowiak, Agnieszka; Despot, Kristina; Mitrović, Jelena; Liebeskind, Chaya; Valunaite Oleskevicienė, Giedre; Bączkowska, Anna; Wilson, Paul A.; Trojszczak, Marcin; Brač, Ivana; Filipić, Lobel; Ostroški Anić, Ana; Lewandowska-Tomaszczyk, Barbara

    Rasprave Instituta za hrvatski jezik i jezikoslovlje, 2023, Letnik: 49, Številka: 1
    Journal Article, Paper

    The present paper focuses on the presentation and discussion of aspects of OFFENSIVE LANGUAGE linguistic annotation, including the creation, annotation practice, curation, and evaluation of an OFFENSIVE LANGUAGE annotation taxonomy scheme, that was first proposed in Lewandowska-Tomaszczyk et al. (2021). An extended offensive language ontology comprising 17 categories, structured in terms of 4 hierarchical levels, has been shown to represent the encoding of the defined offensive language schema, trained in terms of non-contextual word embeddings – i.e., Word2Vec and Fast Text, and eventually juxtaposed to the data acquired by using a pair wise training and testing analysis for existing categories in the HateBERT model (Lewandowska-Tomaszczyk et al. submitted). The study reports on the annotation practice in WG 4.1.1. Incivility in media and social media in the context of COST Action CA 18209 European network for Web-centred linguistic data science (Nexus Linguarum) with the INCEpTION tool (https://github.com/inception-project/inception) – a semantic annotation platform offering assistance in the annotation. The results partly support the proposed ontology of explicit offense and positive implicitness types to provide more variance among widely recognized types of figurative language (e.g., metaphorical, metonymic, ironic, etc.). The use of the annotation system and the representation of linguistic data were also evaluated in a series of the annotators’ comments, by means of a questionnaire and an open discussion. The annotation results and the questionnaire showed that for some of the categories there was low or medium inter-annotator agreement, and it was more challenging for annotators to distinguish between category items than between aspect items, with the category items offensive , insulting and abusive being the most difficult in this respect. The need for taxonomic simplification measures on the basis of these results has been recognized for further annotation practices. U ovome je radu predstavljen proces označavanja uvredljivoga jezika koji uključuje izradu klasifikacije toga jezika, označivačku praksu, vođenje procesa i evaluaciju. Klasifikacijska je shema prvi put predložena u Lewandowska-Tomaszczyk i dr. (2021). Proširena ontologija uvredljivoga jezika sadrži 17 kategorija posloženih u četiri hijerarhijske razine te tako predstavlja shemu uvredljivoga jezika koja je trenirana u okviru nekontekstualiziranih vektorskih prikaza riječi (engl. word embeddings) poput Word2Vec i Fast Text koji su naposljetku supostavljeni podatcima prikupljenima korištenjem analize parova i analize testiranja za postojeće kategorije u modelu HateBERT (Lewandowska-Tomaszczyk i dr., u postupku recenzije). U radu se izvještava o označivačkoj praksi u okviru radne grupe WG 4.1.1. Incivility in media and social media COST-ove akcije CA 18209 European network for Web-centred linguistic data science ( Nexus Linguarum ). Označavanje je provedeno u alatu INCEpTION (https://github.com/inception-project/inception) – platformi za semantičko označavanje koja ima ugrađene alate za takvu obradu podataka. Dobiveni rezultati podupiru predloženu ontologiju eksplicitnoga i implicitnoga uvredljivog jezika koja omogućuje veću raznovrsnost među već prepoznatim tipovima figurativnoga jezika (primjerice metafora, metonimija, ironija itd.). Upotreba sustava za označavanje i prikazivanje jezičnih podataka također je procijenjena u povratnim komentarima koje su pružili označivači. Komentari označivača prikupljeni su metodom upitnika te otvorenom raspravom. Na kraju je usustavljen niz preporuka za buduće označivačke prakse.