Skip to content
arxiv papers 1 min read

Obfuscated Activations Bypass LLM Latent-Space Defenses

Link: http://arxiv.org/abs/2412.09565v1

PDF Link: http://arxiv.org/pdf/2412.09565v1

Summary: Recent latent-space monitoring techniques have shown promise as defensesagainst LLM attacks.

These defenses act as scanners that seek to detect harmfulactivations before they lead to undesirable actions.

This prompts the question:Can models execute harmful behavior via inconspicuous latent states? Here, westudy such obfuscated activations.

We show that state-of-the-art latent-spacedefenses -- including sparse autoencoders, representation probing, and latentOOD detection -- are all vulnerable to obfuscated activations.

For example,against probes trained to classify harmfulness, our attacks can often reducerecall from 100% to 0% while retaining a 90% jailbreaking rate.

However,obfuscation has limits: we find that on a complex task (writing SQL code),obfuscation reduces model performance.

Together, our results demonstrate thatneural activations are highly malleable: we can reshape activation patterns ina variety of ways, often while preserving a network's behavior.

This poses afundamental challenge to latent-space defenses.

Published on arXiv on: 2024-12-12T18:49:53Z