LLM Unlearning Reveals a Stronger-Than-Expected Coreset Effect in Current Benchmarks

Link: http://arxiv.org/abs/2504.10185v1

PDF Link: http://arxiv.org/pdf/2504.10185v1

Summary: Large language model unlearning has become a critical challenge in ensuringsafety and controlled model behavior by removing undesired data-modelinfluences from the pretrained model while preserving general utility.

Significant recent efforts have been dedicated to developing LLM unlearningbenchmarks such as WMDP (Weapons of Mass Destruction Proxy) and MUSE (MachineUnlearning Six-way Evaluation), facilitating standardized unlearningperformance assessment and method comparison.

Despite their usefulness, weuncover for the first time a novel coreset effect within these benchmarks.

Specifically, we find that LLM unlearning achieved with the original (full)forget set can be effectively maintained using a significantly smaller subset(functioning as a "coreset"), e.

g.

, as little as 5% of the forget set, evenwhen selected at random.

This suggests that LLM unlearning in these benchmarkscan be performed surprisingly easily, even in an extremely low-data regime.

Wedemonstrate that this coreset effect remains strong, regardless of the LLMunlearning method used, such as NPO (Negative Preference Optimization) and RMU(Representation Misdirection Unlearning), the popular ones in these benchmarks.

The surprisingly strong coreset effect is also robust across various dataselection methods, ranging from random selection to more sophisticatedheuristic approaches.

We explain the coreset effect in LLM unlearning through akeyword-based perspective, showing that keywords extracted from the forget setalone contribute significantly to unlearning effectiveness and indicating thatcurrent unlearning is driven by a compact set of high-impact tokens rather thanthe entire dataset.

We further justify the faithfulness of coreset-unlearnedmodels along additional dimensions, such as mode connectivity and robustness tojailbreaking attacks.

Codes are available athttps://github.

com/OPTML-Group/MU-Coreset.

Published on arXiv on: 2025-04-14T12:38:37Z