Finetuning-Activated Backdoors in LLMs

Link: http://arxiv.org/abs/2505.16567v1

PDF Link: http://arxiv.org/pdf/2505.16567v1

Summary: Finetuning openly accessible Large Language Models (LLMs) has become standardpractice for achieving task-specific performance improvements.

Until now,finetuning has been regarded as a controlled and secure process in whichtraining on benign datasets led to predictable behaviors.

In this paper, wedemonstrate for the first time that an adversary can create poisoned LLMs thatinitially appear benign but exhibit malicious behaviors once finetuned bydownstream users.

To this end, our proposed attack, FAB (Finetuning-ActivatedBackdoor), poisons an LLM via meta-learning techniques to simulate downstreamfinetuning, explicitly optimizing for the emergence of malicious behaviors inthe finetuned models.

At the same time, the poisoned LLM is regularized toretain general capabilities and to exhibit no malicious behaviors prior tofinetuning.

As a result, when users finetune the seemingly benign model ontheir own datasets, they unknowingly trigger its hidden backdoor behavior.

Wedemonstrate the effectiveness of FAB across multiple LLMs and three targetbehaviors: unsolicited advertising, refusal, and jailbreakability.

Additionally, we show that FAB-backdoors are robust to various finetuningchoices made by the user (e.

g.

, dataset, number of steps, scheduler).

Ourfindings challenge prevailing assumptions about the security of finetuning,revealing yet another critical attack vector exploiting the complexities ofLLMs.

Published on arXiv on: 2025-05-22T11:59:44Z