Machine learning is the sub field of Artificial Intelligence which is responsible for developing a system that has the capability to learn and improve from past experience. There is an increasing ...demand for the offensive language detection because there is a large code-mixed text. Sometimes there is a possibility of code-mixed texts which are written using native scripts. If the system is getting trained on the data which is created using single language then it will leads to failure because of complexity. Using online communities and social media platforms has become increasing now. Recently many of the researchers have been investigating different ways of identifying the abusive content and developing systems to detect its different types like aggression identification, cyber bullying detection, hate speech identification, offensive language identification and toxic comments identification. Developing mechanisms to detect this is a challenging task. In the last few years, machine learning techniques and deep learning concept has been playing a significant role in the development of solutions for offensive language detection. So, considerable efforts have also been devoted to the development of machine learning techniques and deep learning models which is used to identify toxic content in recent years. First, this survey provides offensive language taxonomy and detection approaches. Then, the article focuses on the offensive language identification and toxic comments identification approaches. This article reviews various models used, type of learning model, languages for which the developed model is applied, other methods or approaches that were compared against this model, dataset used, offensive language detection type and performance metrics. The findings from the review are as well summarized and they urge that further effort is required to get improvement in the current state-of-the art approaches. This article also focuses on future work to improve the offensive language classification and toxic comments identification.
Social networking platforms gained widespread popularity and are used for various activities like: promoting products, sharing news, achievements and many more. On the other hand, it is also used for ...spreading rumors, bullying people, and abusing certain groups of people with hateful words. The hate and offensive posts must be detected and removed as early as possible from the social platforms because such posts are spread very quickly and tend to have a lot of negative impacts on human beings. In the last few years, offensive content and hate speech detection has become popular topic of research. Detecting hate speech on social platforms has many challenges, one of them being the use of code-mixed language. Majority of the social media users usually post their messages in code-mixed languages such as Hindi–English, Tamil–English, Malayalam–English, Telugu–English and others. In this exhaustive study, we explore and compare the use of various machine learning and deep learning approaches. An ensemble model by combining the outcomes of transformer and deep learning-based models is suggested to detect hate speech and offensive language on social networking platforms. The experimental outcomes of the proposed weighted ensemble framework outperformed state-of-the-art models by achieving 0.802 and 0.933 weighted F1-score for Malayalam and Tamil code-mixed datasets.
•Proposed a weighted ensemble framework for hate and offensive code-mixed posts identification on social platforms.•Two code-mixed datasets, namely Tamil and Malayalam, are used in this research.•The proposed model utilized the outcomes of deep learning and transformer-based models.•Transformer based models like m-BERT, distilBERT, xlm-RoBERTa performed better than the ML and DL based models.
The treatment of insults –understood as words within offensive language whose function is hurting the addressee’s feelings (Author, 2016)– in audiovisual translation (AVT) always poses a challenge to ...audiovisual translators: because of the semantic/pragmatic load these terms have in the source text (ST), the effect caused in the target text (TT) and culture, and because of the difficulty in transferring them in an idiomatic way. Certain formulas do not always maintain the effect that some words have in the ST. In addition, the translation techniques used may not even be faithful to the original dialogue exchanges. This paper aims to analyse all the insults uttered in Once Upon a Time in Hollywood (Tarantino, 2019), and in its dubbed and subtitled versions into European Spanish. In order to do so, I will pay particular attention to the speaker’s intention (Grice, 1969), whether the insults found in the ST can be viewed as examples of friendly banter or whether, by contrast, the speaker’s intention was offending. Author’s (2023) taxonomy of translation techniques will be used to delve into the manner in which insults were translated to the TT to determine whether or not the semantic/pragmatic load of these terms is transferred (being toned up, maintained or toned down) or not (being neutralised or omitted).
The point of departure of this case study resorts to the initial hypothesis that dubbing transfers more insults into European Spanish than subtitling due to the technical features of the former. The aims of the study are: to determine (1) how faithful the dubbed and subtitled version insults were towards the ST, that is, whether or not the load of the insults is transferred to the TT and to what degree; (2) which AVT mode transfers the greatest number of insults to the TT; (3) if the insults transferred had the intention of offending or not; and (4) if the insults tend to foreignisation or domestication. In order to do so, a multidisciplinary methodology will be used based on a descriptive translation studies and pragmatics approach.
•Dubbing and subtitling transfer insults to European Spanish mostly faithfully.•Dubbing has slightly transferred more insults than subtitling.•The function of insults depends on its linguistic form, sociopragmatics and context.•Insults aimed at offending have been present in the TT closely in both AVT modes.•The audio and visual channels help distinguish insults as offence vs. friendly banter.•The same professional did the dubbing and subtitling, hence the similar results.
In recent years, social media networks are emerging as a key player by providing platforms for opinions expression, communication, and content distribution. However, users often take advantage of ...perceived anonymity on social media platforms to share offensive or hateful content. Thus, offensive language has grown as a significant issue with the increase in online communication and the popularity of social media platforms. This problem has attracted significant attention for devising methods for detecting offensive content and preventing its spread on online social networks. Therefore, this paper aims to develop an effective Arabic offensive language detection model by employing deep learning and semantic and contextual features. This paper proposes a deep learning approach that utilizes the bidirectional long short-term memory (BiLSTM) model and domain-specific word embeddings extracted from an Arabic offensive dataset. The detection approach was evaluated on an Arabic dataset collected from Twitter. The results showed the highest performance accuracy of 0.93% with the BiLSTM model trained using a combination of domain-specific and agnostic-domain word embeddings.
This paper introduces a streamlined taxonomy for categorizing offensive language in Hebrew, addressing a gap in the literature that has, until now, largely focused on Indo-European languages. Our ...taxonomy divides offensive language into seven levels (six explicit and one implicit level). We based our work on the simplified offensive language (SOL) taxonomy introduced in (
) hoping that our adjustment of SOL to the Hebrew language will be capable of reflecting the unique linguistic and cultural nuances of Hebrew. The study involves both linguistic and cultural analysis beyond Natural Language Processing (NLP). We employed manual linguistic analysis to understand the nuances of offensive language in Hebrew.
An accompanying dataset, gathered on Twitter and manually curated by human annotators, is described in detail. This dataset was constructed to both validate the taxonomy and serve as a foundation for future research on offensive language detection and analysis in Hebrew. Preliminary analysis of the dataset reveals intriguing patterns and distributions, underscoring the complexity and specificity of offensive expressions in the Hebrew language.
The aim of our work is to capture the complexity and specificity of offensive expressions in Hebrew beyond what automated NLP methods alone can provide. Our findings highlight the significance of considering linguistic and cultural variations when researching and correcting abusive language online. We believe that our streamlined taxonomy and associated dataset will be crucial in improving research in Hebrew language sociocultural studies, natural language processing, and offensive language detection. Our study also makes a substantial contribution to the study of low-resource languages and can be used as a model for future research on other languages.
Persian offensive language detection Kebriaei, Emad; Homayouni, Ali; Faraji, Roghayeh ...
Machine learning,
07/2024, Letnik:
113, Številka:
7
Journal Article
Recenzirano
With the proliferation of social networks and their impact on human life, one of the rising problems in this environment is the rise in verbal and written insults and hatred. As one of the ...significant platforms for distributing text-based content, Twitter frequently publishes its users’ abusive remarks. Creating a model that requires a complete collection of offensive sentences is the initial stage in recognizing objectionable phrases. In addition, despite the abundance of resources in English and other languages, there are limited resources and studies on identifying hateful and offensive statements in Persian. In this study, we compiled a 38K-tweet dataset of Persian Hate and Offensive language using keyword-based data selection strategies. A Persian offensive lexicon and nine hatred target group lexicons were gathered through crowdsourcing for this purpose. The dataset was annotated manually so that at least two annotators investigated tweets. In addition, for the purpose of analyzing the effect of used lexicons on language model functionality, we employed two assessment criteria (FPED and pAUCED) to measure the dataset’s potential bias. Then, by configuring the dataset based on the results of the bias measurement, we mitigated the effect of words’ bias in tweets on language model performance. The results indicate that bias is significantly diminished, while less than a hundredth reduced the F1 score.
The fight against offensive speech on the Internet necessitates increased efforts from linguistic analysis and artificial intelligence perspectives to develop countermeasures and preventive methods. ...Reliable predictions can only be obtained if these methods are exposed to a representative sample of the domain or environment under consideration. Datasets serve as the foundation for significant developments in this field because they are the main means of obtaining appropriate instances that reveal the multiple and varied faces of the offensive speech phenomenon. In this sense, we present Ar-PuFi, a dataset of offensive speech towards Public Figures in the Arabian community. With 24,071 comments collected from TV interviews with Egyptian celebrities belonging to six domains of public interest, Ar-PuFi is currently the largest Arabic dataset in terms of its category and size. The examples were annotated by three native speakers over the course of two months and are provided with two-class and six-class annotations based on the presence or absence of explicit and implicit offensive content. We evaluated the performance of a diverse set of classification models employing several text representations of actual examples (e.g., N-gram, TF/IDF, AraVec, and fastText), and AraBERT achieved the baseline for the new dataset in both offensive detection and group classification. Additionally, we apply the Pointwise Mutual Information (PMI) technique to comments within the target domain in order to derive a lexicon of offensive terms associated with each domain of ArPuFi. We further explored whether active learning (AL) or meta-learning (ML) frameworks could be used to reduce the labeling effort required for our dataset without affecting prediction quality and found that, though AL can reduce the amount of data annotations by 10% over the ML approach, neither approach requires less than about 70% of the full dataset to achieve baseline performance. Finally, we took advantage of the availability of relevant datasets and conducted a cross-domain experiment to back up our claims not only about the uniqueness of our dataset but also about the difficulty of adapting Arabic dialects against one another.
This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for ...sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff’s alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning and deep learning methods. The dataset is available on Github and Zenodo.
This study examined the factors that affect artificial intelligence (AI) chatbot users' use of profanity and offensive words, employing the concepts of ethical ideology, social competence, and ...perceived humanlikeness of chatbot. The study also looked into users' liking of chatbots' responses to the users' utterance of profanity and offensive words. Using a national survey (N = 645), the study found that users' idealism orientation was a significant factor in explaining use of such offensive language. In addition, users with high idealism revealed liking of chatbots' active intervention, whereas those with high relativism displayed liking of chatbots' reactive responses. Moreover, users’ perceived humanlikeness of chatbot increased their likelihood of using offensive words targeting dislikable acquaintances, racial/ethnic groups, and political parties. These findings are expected to fill the gap between the current use of AI chatbots and the lack of empirical studies examining language use.
•This study examined factors impacting AI chatbot users' language use.•Users' ethical idealism was related to the use of profanity and offensive words.•Users' perceived human-likeness of chatbot increased their use of offensive words.
SOLD: Sinhala offensive language dataset Ranasinghe, Tharindu; Anuradha, Isuri; Premasiri, Damith ...
Language resources and evaluation,
03/2024
Journal Article
Recenzirano
Odprti dostop
Abstract The widespread of offensive content online, such as hate speech and cyber-bullying, is a global phenomenon. This has sparked interest in the artificial intelligence (AI) and natural language ...processing (NLP) communities, motivating the development of various systems trained to detect potentially harmful content automatically. These systems require annotated datasets to train the machine learning (ML) models. However, with a few notable exceptions, most datasets on this topic have dealt with English and a few other high-resource languages. As a result, the research in offensive language identification has been limited to these languages. This paper addresses this gap by tackling offensive language identification in Sinhala, a low-resource Indo-Aryan language spoken by over 17 million people in Sri Lanka. We introduce the Sinhala Offensive Language Dataset ( SOLD ) and present multiple experiments on this dataset. SOLD is a manually annotated dataset containing 10,000 posts from Twitter annotated as offensive and not offensive at both sentence-level and token-level, improving the explainability of the ML models. SOLD is the first large publicly available offensive language dataset compiled for Sinhala. We also introduce SemiSOLD , a larger dataset containing more than 145,000 Sinhala tweets, annotated following a semi-supervised approach.