Offensive language detection in Tamil YouTube comments by adapters and cross-domain knowledge transfer

E-resources

Peer reviewed

Offensive language detection in Tamil YouTube comments by adapters and cross-domain knowledge transfer

Subramanian, Malliga; Ponnusamy, Rahul; Benhur, Sean; Shanmugavadivel, Kogilavani; Ganesan, Adhithiya; Ravi, Deepti; Shanmugasundaram, Gowtham Krishnan; Priyadharshini, Ruba; Chakravarthi, Bharathi Raja

Computer speech & language, November 2022, 2022-11-00, Volume: 76

Journal Article

Over the past few years, researchers have been focusing on the identification of offensive language on social networks. In places where English is not the primary language, social media users tend to post/comment using a code-mixed form of text. This poses various hitches in identifying offensive texts, and when combined with the limited resources available for languages such as Tamil, the task becomes considerably more challenging. This study undertakes multiple tests in order to detect potentially offensive texts in YouTube comments, made available through the HASOC-Offensive Language Identification track in Dravidian Code-Mix FIRE 2021.11https://competitions.codalab.org/competitions/31146. To detect the offensive texts, models based on traditional machine learning techniques, namely Bernoulli Naïve Bayes, Support Vector Machine, Logistic Regression, and K-Nearest Neighbor, were created. In addition, pre-trained multilingual transformer-based natural language processing models such as mBERT, MuRIL (Base and Large), and XLM-RoBERTa (Base and Large) were also attempted. These models were used as fine-tuner and adapter transformers. In essence, adapters and fine-tuners accomplish the same goal, but adapters function by adding layers to the main pre-trained model and freezing their weights. This study shows that transformer-based models outperform machine learning approaches. Furthermore, in low-resource languages such as Tamil, adapter-based techniques surpass fine-tuned models in terms of both time and efficiency. Of all the adapter-based approaches, XLM-RoBERTa (Large) was found to have the highest accuracy of 88.5%. The study also demonstrates that, compared to fine-tuning the models, the adapter models require training of a fewer parameters. In addition, the tests revealed that the proposed models performed notably well against a cross-domain data set. •Identifying the predictive features that distinguish offensive texts in Tamil.•Measuring the efficacy of transformer models in classifying offensive texts in Tamil.•Testing the cross-domain ability of the proposed models on the misogynous texts.

Keep searching

Author

Access to the JCR database is permitted only to users from Slovenia. Your current IP address is not on the list of IP addresses with access permission, and authentication with the relevant AAI accout is required.

Year	Impact factor		Edition		Category		Classification
Year	JCR	SNIP	JCR	SNIP	JCR	SNIP	JCR	SNIP

Links to authors' personal bibliographies	Links to information on researchers in the SICRIS system

Source: Personal bibliographies and: SICRIS

Upload image

Shelf entry

Adding material to shelf was successful.

Adding material to shelf failed.

It was not necessary to add the material to the shelf.

Permalink

E-mail

Impact factor

Select the library membership card:

DRS, in which the journal is indexed

Citations

Theme