"Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs

Link: http://arxiv.org/abs/2505.14226v1

PDF Link: http://arxiv.org/pdf/2505.14226v1

Summary: Large Language Models (LLMs) have become increasingly powerful, withmultilingual and multimodal capabilities improving by the day.

These models arebeing evaluated through audits, alignment studies and red-teaming efforts toexpose model vulnerabilities towards generating harmful, biased and unfaircontent.

Existing red-teaming efforts have previously focused on the Englishlanguage, using fixed template-based attacks; thus, models continue to besusceptible to multilingual jailbreaking strategies, especially in themultimodal context.

In this study, we introduce a novel strategy that leveragescode-mixing and phonetic perturbations to jailbreak LLMs for both text andimage generation tasks.

We also introduce two new jailbreak strategies thatshow higher effectiveness than baseline strategies.

Our work presents a methodto effectively bypass safety filters in LLMs while maintaining interpretabilityby applying phonetic misspellings to sensitive words in code-mixed prompts.

Ournovel prompts achieve a 99% Attack Success Rate for text generation and 78% forimage generation, with Attack Relevance Rate of 100% for text generation and95% for image generation when using the phonetically perturbed code-mixedprompts.

Our interpretability experiments reveal that phonetic perturbationsimpact word tokenization, leading to jailbreak success.

Our study motivatesincreasing the focus towards more generalizable safety alignment formultilingual multimodal models, especially in real-world settings whereinprompts can have misspelt words.

Published on arXiv on: 2025-05-20T11:35:25Z