Skip to content
arxiv papers 1 min read

MirrorGuard: Adaptive Defense Against Jailbreaks via Entropy-Guided Mirror Crafting

Link: http://arxiv.org/abs/2503.12931v1

PDF Link: http://arxiv.org/pdf/2503.12931v1

Summary: Defending large language models (LLMs) against jailbreak attacks is crucialfor ensuring their safe deployment.

Existing defense strategies generally relyon predefined static criteria to differentiate between harmful and benignprompts.

However, such rigid rules are incapable of accommodating the inherentcomplexity and dynamic nature of real jailbreak attacks.

In this paper, wepropose a novel concept of ``mirror'' to enable dynamic and adaptive defense.

Amirror refers to a dynamically generated prompt that mirrors the syntacticstructure of the input while ensuring semantic safety.

The personalizeddiscrepancies between the input prompts and their corresponding mirrors serveas the guiding principles for defense.

A new defense paradigm, MirrorGuard, isfurther proposed to detect and calibrate risky inputs based on such mirrors.

Anentropy-based detection metric, Relative Input Uncertainty (RIU), is integratedinto MirrorGuard to quantify the discrepancies between input prompts andmirrors.

MirrorGuard is evaluated on several popular datasets, demonstratingstate-of-the-art defense performance while maintaining general effectiveness.

Published on arXiv on: 2025-03-17T08:41:29Z