Skip to content
arxiv papers 1 min read

Fight Fire with Fire: Defending Against Malicious RL Fine-Tuning via Reward Neutralization

Link: http://arxiv.org/abs/2505.04578v1

PDF Link: http://arxiv.org/pdf/2505.04578v1

Summary: Reinforcement learning (RL) fine-tuning transforms large language modelswhile creating a vulnerability we experimentally verify: Our experiment showsthat malicious RL fine-tuning dismantles safety guardrails with remarkableefficiency, requiring only 50 steps and minimal adversarial prompts, withharmful escalating from 0-2 to 7-9.

This attack vector particularly threatensopen-source models with parameter-level access.

Existing defenses targetingsupervised fine-tuning prove ineffective against RL's dynamic feedbackmechanisms.

We introduce Reward Neutralization, the first defense frameworkspecifically designed against RL fine-tuning attacks, establishing conciserejection patterns that render malicious reward signals ineffective.

Ourapproach trains models to produce minimal-information rejections that attackerscannot exploit, systematically neutralizing attempts to optimize toward harmfuloutputs.

Experiments validate that our approach maintains low harmful scores(no greater than 2) after 200 attack steps, while standard models rapidlydeteriorate.

This work provides the first constructive proof that robustdefense against increasingly accessible RL attacks is achievable, addressing acritical security gap for open-weight models.

Published on arXiv on: 2025-05-07T17:18:48Z