One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models

Link: http://arxiv.org/abs/2505.07167v1

PDF Link: http://arxiv.org/pdf/2505.07167v1

Summary: Large Language Models (LLMs) have been extensively used across diversedomains, including virtual assistants, automated code generation, andscientific research.

However, they remain vulnerable to jailbreak attacks,which manipulate the models into generating harmful responses despite safetyalignment.

Recent studies have shown that current safety-aligned LLMs oftenundergo the shallow safety alignment, where the first few tokens largelydetermine whether the response will be harmful.

Through comprehensiveobservations, we find that safety-aligned LLMs and various defense strategiesgenerate highly similar initial tokens in their refusal responses, which wedefine as safety trigger tokens.

Building on this insight, we propose\texttt{D-STT}, a simple yet effective defense algorithm that identifies andexplicitly decodes safety trigger tokens of the given safety-aligned LLM totrigger the model's learned safety patterns.

In this process, the safetytrigger is constrained to a single token, which effectively preserves modelusability by introducing minimum intervention in the decoding process.

Extensive experiments across diverse jailbreak attacks and benign promptsdemonstrate that \ours significantly reduces output harmfulness whilepreserving model usability and incurring negligible response time overhead,outperforming ten baseline methods.

Published on arXiv on: 2025-05-12T01:26:50Z