Skip to content
arxiv papers 1 min read

TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data

Link: http://arxiv.org/abs/2507.05660v1

PDF Link: http://arxiv.org/pdf/2507.05660v1

Summary: Recent advances in foundation models, such as LLMs, have revolutionizedconversational AI.

Chatbots are increasingly being developed by customizingLLMs on specific conversational datasets.

However, mitigating toxicity duringthis customization, especially when dealing with untrusted training data,remains a significant challenge.

To address this, we introduce TuneShield, adefense framework designed to mitigate toxicity during chatbot fine-tuningwhile preserving conversational quality.

TuneShield leverages LLM-basedtoxicity classification, utilizing the instruction-following capabilities andsafety alignment of LLMs to effectively identify toxic samples, outperformingindustry API services.

TuneShield generates synthetic conversation samples,termed 'healing data', based on the identified toxic samples, using them tomitigate toxicity while reinforcing desirable behavior during fine-tuning.

Itperforms an alignment process to further nudge the chatbot towards producingdesired responses.

Our findings show that TuneShield effectively mitigatestoxicity injection attacks while preserving conversational quality, even whenthe toxicity classifiers are imperfect or biased.

TuneShield proves to beresilient against adaptive adversarial and jailbreak attacks.

Additionally,TuneShield demonstrates effectiveness in mitigating adaptive toxicity injectionattacks during dialog-based learning (DBL).

Published on arXiv on: 2025-07-08T04:40:09Z