Akademska digitalna zbirka SLovenije - logo
E-viri
Celotno besedilo
Recenzirano
  • Exposing the Achilles’ heel...
    Aggarwal, Sajal; Vishwakarma, Dinesh Kumar

    Expert systems with applications, 11/2024, Letnik: 254
    Journal Article

    The accessibility of online hate speech has increased significantly, making it crucial for social-media companies to prioritize efforts to curb its spread. Although deep learning models demonstrate vulnerability to adversarial attacks, whether models fine-tuned for hate speech detection exhibit similar susceptibility remains underexplored. Textual adversarial attacks involve making subtle alterations to the original samples. These alterations are designed so that the adversarial examples produced can effectively deceive the target model, even when correctly classified by human observers. Though many approaches have been proposed to conduct word-level adversarial attacks on textual data, they face the obstacle of preserving the semantic coherence of texts during the generation of adversarial counterparts. Moreover, the adversarial examples produced are often easily distinguishable by human observers. This work presents a novel methodology that uses visually confusable glyphs and invisible characters to generate semantically and visually similar adversarial examples in a black-box setting. In the hate speech detection task context, our attack was effectively applied to several state-of-the-art deep learning models, fine-tuned on two benchmark datasets. The major contributions of this study are: (1) demonstrating the vulnerability of deep learning models fine-tuned for hate speech detection; (2) a novel attack framework based on a simple yet potent modification strategy; (3) superior outcomes in terms of accuracy degradation, attack success rate, average perturbation, semantic similarity, and perplexity when compared to existing baselines; (4) strict adherence to prescribed linguistic constraints while formulating adversarial samples; and (5) preservation of ground truth label while perturbing original input using imperceptible adversarial examples.