Link: http://arxiv.org/abs/2505.14226v1
PDF Link: http://arxiv.org/pdf/2505.14226v1
Summary: Large Language Models (LLMs) have become increasingly powerful, withmultilingual and multimodal capabilities improving by the day.
These models arebeing evaluated through audits, alignment studies and red-teaming efforts toexpose model vulnerabilities towards generating harmful, biased and unfaircontent.
Existing red-teaming efforts have previously focused on the Englishlanguage, using fixed template-based attacks; thus, models continue to besusceptible to multilingual jailbreaking strategies, especially in themultimodal context.
In this study, we introduce a novel strategy that leveragescode-mixing and phonetic perturbations to jailbreak LLMs for both text andimage generation tasks.
We also introduce two new jailbreak strategies thatshow higher effectiveness than baseline strategies.
Our work presents a methodto effectively bypass safety filters in LLMs while maintaining interpretabilityby applying phonetic misspellings to sensitive words in code-mixed prompts.
Ournovel prompts achieve a 99% Attack Success Rate for text generation and 78% forimage generation, with Attack Relevance Rate of 100% for text generation and95% for image generation when using the phonetically perturbed code-mixedprompts.
Our interpretability experiments reveal that phonetic perturbationsimpact word tokenization, leading to jailbreak success.
Our study motivatesincreasing the focus towards more generalizable safety alignment formultilingual multimodal models, especially in real-world settings whereinprompts can have misspelt words.
Published on arXiv on: 2025-05-20T11:35:25Z