Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region

Link: http://arxiv.org/abs/2502.13946v1

PDF Link: http://arxiv.org/pdf/2502.13946v1

Summary: The safety alignment of large language models (LLMs) remains vulnerable, astheir initial behavior can be easily jailbroken by even relatively simpleattacks.

Since infilling a fixed template between the input instruction andinitial model output is a common practice for existing LLMs, we hypothesizethat this template is a key factor behind their vulnerabilities: LLMs'safety-related decision-making overly relies on the aggregated information fromthe template region, which largely influences these models' safety behavior.

Werefer to this issue as template-anchored safety alignment.

In this paper, weconduct extensive experiments and verify that template-anchored safetyalignment is widespread across various aligned LLMs.

Our mechanistic analysesdemonstrate how it leads to models' susceptibility when encounteringinference-time jailbreak attacks.

Furthermore, we show that detaching safetymechanisms from the template region is promising in mitigating vulnerabilitiesto jailbreak attacks.

We encourage future research to develop more robustsafety alignment techniques that reduce reliance on the template region.

Published on arXiv on: 2025-02-19T18:42:45Z