JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models

Link: http://arxiv.org/abs/2505.19610v1

PDF Link: http://arxiv.org/pdf/2505.19610v1

Summary: Vision-Language Models (VLMs) exhibit impressive performance, yet theintegration of powerful vision encoders has significantly broadened theirattack surface, rendering them increasingly susceptible to jailbreak attacks.

However, lacking well-defined attack objectives, existing jailbreak methodsoften struggle with gradient-based strategies prone to local optima and lackingprecise directional guidance, and typically decouple visual and textualmodalities, thereby limiting their effectiveness by neglecting crucialcross-modal interactions.

Inspired by the Eliciting Latent Knowledge (ELK)framework, we posit that VLMs encode safety-relevant information within theirinternal fusion-layer representations, revealing an implicit safety decisionboundary in the latent space.

This motivates exploiting boundary to steer modelbehavior.

Accordingly, we propose JailBound, a novel latent space jailbreakframework comprising two stages: (1) Safety Boundary Probing, which addressesthe guidance issue by approximating decision boundary within fusion layer'slatent space, thereby identifying optimal perturbation directions towards thetarget region; and (2) Safety Boundary Crossing, which overcomes thelimitations of decoupled approaches by jointly optimizing adversarialperturbations across both image and text inputs.

This latter stage employs aninnovative mechanism to steer the model's internal state towardspolicy-violating outputs while maintaining cross-modal semantic consistency.

Extensive experiments on six diverse VLMs demonstrate JailBound's efficacy,achieves 94.

32% white-box and 67.

28% black-box attack success averagely, whichare 6.

17% and 21.

13% higher than SOTA methods, respectively.

Our findingsexpose a overlooked safety risk in VLMs and highlight the urgent need for morerobust defenses.

Warning: This paper contains potentially sensitive, harmfuland offensive content.

Published on arXiv on: 2025-05-26T07:23:00Z