Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models

Link: http://arxiv.org/abs/2505.22271v1

PDF Link: http://arxiv.org/pdf/2505.22271v1

Summary: While (multimodal) large language models (LLMs) have attracted widespreadattention due to their exceptional capabilities, they remain vulnerable tojailbreak attacks.

Various defense methods are proposed to defend againstjailbreak attacks, however, they are often tailored to specific types ofjailbreak attacks, limiting their effectiveness against diverse adversarialstrategies.

For instance, rephrasing-based defenses are effective against textadversarial jailbreaks but fail to counteract image-based attacks.

To overcomethese limitations, we propose a universal defense framework, termed Test-timeIMmunization (TIM), which can adaptively defend against various jailbreakattacks in a self-evolving way.

Specifically, TIM initially trains a gist tokenfor efficient detection, which it subsequently applies to detect jailbreakactivities during inference.

When jailbreak attempts are identified, TIMimplements safety fine-tuning using the detected jailbreak instructions pairedwith refusal answers.

Furthermore, to mitigate potential performancedegradation in the detector caused by parameter updates during safetyfine-tuning, we decouple the fine-tuning process from the detection module.

Extensive experiments on both LLMs and multimodal LLMs demonstrate the efficacyof TIM.

Published on arXiv on: 2025-05-28T11:57:46Z