Detecting and Filtering Unsafe Training Data via Data Attribution

Link: http://arxiv.org/abs/2502.11411v1

PDF Link: http://arxiv.org/pdf/2502.11411v1

Summary: Large language models (LLMs) are vulnerable to unsafe training data that evensmall amounts of unsafe data can lead to harmful model behaviors.

Detecting andfiltering such unsafe training data is essential for trustworthy modeldevelopment.

Current state-of-the-art (SOTA) approaches typically rely ontraining moderation classifiers which requires significant computationaloverhead and are limited to predefined taxonomies, making them less adaptableto evolving safety concerns.

Moreover, these classifiers lack insight into thetraining process, limiting their effectiveness in filtering unsafe data.

Toaddress these limitations, we propose DABUF, leveraging data attribution todetect and filter unsafe training data by attributing harmful model outputs toinfluential training data points.

DABUF enables flexible identification ofvarious unsafe data types without predefined taxonomies.

However, in practice,model outputs can be complex with combined safe linguistic features and unsafecontent, leading to reduced attribution accuracy.

In such cases, DABUF willintegrate moderation classifiers to identify a minimal subset of unsafetraining data for targeted attribution (such as jailbreak).

When model outputsare relatively straightforward, DABUF uses model outputs directly as theattribution targets.

We evaluate the performance on two different tasks: infiltering jailbreaking training data and in identifying and mitigating genderbias.

DABUF outperforms SOTA approaches by up to 7.

5\% in detection AUPRC injailbreaking scenarios, and 44.

1\% in detecting gender bias.

Moreover,retraining on DABUF-filtered data leads to higher model safety acrossexperiments, underscoring its versatility in addressing a broad spectrum ofunsafe data issues.

Published on arXiv on: 2025-02-17T03:50:58Z