Link: http://arxiv.org/abs/2502.04204v1
PDF Link: http://arxiv.org/pdf/2502.04204v1
Summary: Jailbreak attacks against large language models (LLMs) aim to induce harmfulbehaviors in LLMs through carefully crafted adversarial prompts.
To mitigateattacks, one way is to perform adversarial training (AT)-based alignment, i.
e.
,training LLMs on some of the most adversarial prompts to help them learn how tobehave safely under attacks.
During AT, the length of adversarial prompts playsa critical role in the robustness of aligned LLMs.
This paper focuses onadversarial suffix jailbreak attacks and unveils that to defend against ajailbreak attack with an adversarial suffix of length $\Theta(M)$, it is enoughto align LLMs on prompts with adversarial suffixes of length$\Theta(\sqrt{M})$.
Theoretically, we analyze the adversarial in-contextlearning of linear transformers on linear regression tasks and prove a robustgeneralization bound for trained transformers.
The bound depends on the term$\Theta(\sqrt{M_{\text{test}}}/M_{\text{train}})$, where $M_{\text{train}}$ and$M_{\text{test}}$ are the number of adversarially perturbed in-context samplesduring training and testing.
Empirically, we conduct AT on popular open-sourceLLMs and evaluate their robustness against jailbreak attacks of differentadversarial suffix lengths.
Results confirm a positive correlation between theattack success rate and the ratio of the square root of the adversarial suffixduring jailbreaking to the length during AT.
Our findings show that it ispractical to defend "long-length" jailbreak attacks via efficient"short-length" AT.
The code is available at https://github.
com/fshp971/adv-icl.
Published on arXiv on: 2025-02-06T16:44:26Z