Link: http://arxiv.org/abs/2504.01550v1
PDF Link: http://arxiv.org/pdf/2504.01550v1
Summary: Large Language Models (LLMs) have emerged as powerful tools, but theirinherent safety risks - ranging from harmful content generation to broadersocietal harms - pose significant challenges.
These risks can be amplified bythe recent adversarial attacks, fine-tuning vulnerabilities, and the increasingdeployment of LLMs in high-stakes environments.
Existing safety-enhancingtechniques, such as fine-tuning with human feedback or adversarial training,are still vulnerable as they address specific threats and often fail togeneralize across unseen attacks, or require manual system-level defenses.
Thispaper introduces RepBend, a novel approach that fundamentally disrupts therepresentations underlying harmful behaviors in LLMs, offering a scalablesolution to enhance (potentially inherent) safety.
RepBend brings the idea ofactivation steering - simple vector arithmetic for steering model's behaviorduring inference - to loss-based fine-tuning.
Through extensive evaluation,RepBend achieves state-of-the-art performance, outperforming prior methods suchas Circuit Breaker, RMU, and NPO, with up to 95% reduction in attack successrates across diverse jailbreak benchmarks, all with negligible reduction inmodel usability and general capabilities.
Published on arXiv on: 2025-04-02T09:47:01Z