PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning

Link: http://arxiv.org/abs/2411.19335v1

PDF Link: http://arxiv.org/pdf/2411.19335v1

Summary: Federated Parameter-Efficient Fine-Tuning (FedPEFT) has emerged as apromising paradigm for privacy-preserving and efficient adaptation ofPre-trained Language Models (PLMs) in Federated Learning (FL) settings.

Itpreserves data privacy by keeping the data decentralized and training the modelon local devices, ensuring that raw data never leaves the user's device.

Moreover, the integration of PEFT methods such as LoRA significantly reducesthe number of trainable parameters compared to fine-tuning the entire model,thereby minimizing communication costs and computational overhead.

Despite itspotential, the security implications of FedPEFT remain underexplored.

Thispaper introduces a novel security threat to FedPEFT, termed PEFT-as-an-Attack(PaaA), which exposes how PEFT can be exploited as an attack vector tocircumvent PLMs' safety alignment and generate harmful content in response tomalicious prompts.

Our evaluation of PaaA reveals that with less than 1% of themodel's parameters set as trainable, and a small subset of clients actingmaliciously, the attack achieves an approximate 80% attack success rate usingrepresentative PEFT methods such as LoRA.

To mitigate this threat, we furtherinvestigate potential defense strategies, including Robust Aggregation Schemes(RASs) and Post-PEFT Safety Alignment (PPSA).

However, our empirical analysishighlights the limitations of these defenses, i.

e.

, even the most advancedRASs, such as DnC and ClippedClustering, struggle to defend against PaaA inscenarios with highly heterogeneous data distributions.

Similarly, while PPSAcan reduce attack success rates to below 10%, it severely degrades the model'saccuracy on the target task.

Our results underscore the urgent need for moreeffective defense mechanisms that simultaneously ensure security and maintainthe performance of the FedPEFT paradigm.

Published on arXiv on: 2024-11-28T19:05:01Z