Link: http://arxiv.org/abs/2509.06350v1
PDF Link: http://arxiv.org/pdf/2509.06350v1
Summary: Jailbreak attacks on Large Language Models (LLMs) have demonstrated varioussuccessful methods whereby attackers manipulate models into generating harmfulresponses that they are designed to avoid.
Among these, Greedy CoordinateGradient (GCG) has emerged as a general and effective approach that optimizesthe tokens in a suffix to generate jailbreakable prompts.
While severalimproved variants of GCG have been proposed, they all rely on fixed-lengthsuffixes.
However, the potential redundancy within these suffixes remainsunexplored.
In this work, we propose Mask-GCG, a plug-and-play method thatemploys learnable token masking to identify impactful tokens within the suffix.
Our approach increases the update probability for tokens at high-impactpositions while pruning those at low-impact positions.
This pruning not onlyreduces redundancy but also decreases the size of the gradient space, therebylowering computational overhead and shortening the time required to achievesuccessful attacks compared to GCG.
We evaluate Mask-GCG by applying it to theoriginal GCG and several improved variants.
Experimental results show that mosttokens in the suffix contribute significantly to attack success, and pruning aminority of low-impact tokens does not affect the loss values or compromise theattack success rate (ASR), thereby revealing token redundancy in LLM prompts.
Our findings provide insights for developing efficient and interpretable LLMsfrom the perspective of jailbreak attacks.
Published on arXiv on: 2025-09-08T05:45:37Z