Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models

Link: http://arxiv.org/abs/2506.24056v1

PDF Link: http://arxiv.org/pdf/2506.24056v1

Summary: We introduce logit-gap steering, a fast jailbreak framework that casts therefusal-affirmation gap of RLHF-aligned language models as a single pass overthe vocabulary.

A forward-computable score blends gap reduction withlightweight proxies for KL penalty and reward shift, allowing a "sort-sum-stop"sweep to complete in under a second and return a short suffix--two orders ofmagnitude fewer model calls than beam or gradient attacks.

The same suffixgeneralises to unseen prompts and scales from 0.

5 B to 70 B checkpoints,lifting one-shot attack success from baseline levels to 80-100% whilepreserving topical coherence.

Beyond efficiency, these suffixes exposesentence-boundary reward cliffs and other alignment artefacts, offering alightweight probe into how safety tuning reshapes internal representations.

Published on arXiv on: 2025-06-30T17:01:18Z