PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips

Link: http://arxiv.org/abs/2412.07192v1

PDF Link: http://arxiv.org/pdf/2412.07192v1

Summary: We introduce a new class of attacks on commercial-scale (human-aligned)language models that induce jailbreaking through targeted bitwise corruptionsin model parameters.

Our adversary can jailbreak billion-parameter languagemodels with fewer than 25 bit-flips in all cases$-$and as few as 5 insome$-$using up to 40$\times$ less bit-flips than existing attacks on computervision models at least 100$\times$ smaller.

Unlike prompt-based jailbreaks, ourattack renders these models in memory 'uncensored' at runtime, allowing them togenerate harmful responses without any input modifications.

Our attackalgorithm efficiently identifies target bits to flip, offering up to 20$\times$more computational efficiency than previous methods.

This makes it practicalfor language models with billions of parameters.

We show an end-to-endexploitation of our attack using software-induced fault injection, Rowhammer(RH).

Our work examines 56 DRAM RH profiles from DDR4 and LPDDR4X devices withdifferent RH vulnerabilities.

We show that our attack can reliably inducejailbreaking in systems similar to those affected by prior bit-flip attacks.

Moreover, our approach remains effective even against highly RH-secure systems(e.

g.

, 46$\times$ more secure than previously tested systems).

Our analysesfurther reveal that: (1) models with less post-training alignment require fewerbit flips to jailbreak; (2) certain model components, such as value projectionlayers, are substantially more vulnerable than others; and (3) our method ismechanistically different than existing jailbreaks.

Our findings highlight apressing, practical threat to the language model ecosystem and underscore theneed for research to protect these models from bit-flip attacks.

Published on arXiv on: 2024-12-10T05:00:01Z