Link: http://arxiv.org/abs/2508.17739v1
PDF Link: http://arxiv.org/pdf/2508.17739v1
Summary: Despite extensive efforts to align Large Language Models (LLMs) with humanvalues and safety rules, jailbreak attacks that exploit certain vulnerabilitiescontinuously emerge, highlighting the need to strengthen existing LLMs withadditional safety properties to defend against these attacks.
However, tuninglarge models has become increasingly resource-intensive and may have difficultyensuring consistent performance.
We introduce Speculative Safety-Aware Decoding(SSD), a lightweight decoding-time approach that equips LLMs with the desiredsafety property while accelerating inference.
We assume that there exists asmall language model that possesses this desired property.
SSD integratesspeculative sampling during decoding and leverages the match ratio between thesmall and composite models to quantify jailbreak risks.
This enables SSD todynamically switch between decoding schemes to prioritize utility or safety, tohandle the challenge of different model capacities.
The output token is thensampled from a new distribution that combines the distributions of the originaland the small models.
Experimental results show that SSD successfully equipsthe large model with the desired safety property, and also allows the model toremain helpful to benign queries.
Furthermore, SSD accelerates the inferencetime, thanks to the speculative sampling design.
Published on arXiv on: 2025-08-25T07:30:10Z