Link: http://arxiv.org/abs/2508.15648v1
PDF Link: http://arxiv.org/pdf/2508.15648v1
Summary: Large Language Models (LLMs) excel at various natural language processingtasks but remain vulnerable to jailbreaking attacks that induce harmful contentgeneration.
In this paper, we reveal a critical safety inconsistency: LLMs canmore effectively identify harmful requests as discriminators than defendagainst them as generators.
This insight inspires us to explore aligning themodel's inherent discrimination and generation capabilities.
To this end, wepropose SDGO (Self-Discrimination-Guided Optimization), a reinforcementlearning framework that leverages the model's own discrimination capabilitiesas a reward signal to enhance generation safety through iterativeself-improvement.
Our method does not require any additional annotated data orexternal models during the training phase.
Extensive experiments demonstratethat SDGO significantly improves model safety compared to both prompt-based andtraining-based baselines while maintaining helpfulness on general benchmarks.
By aligning LLMs' discrimination and generation capabilities, SDGO bringsrobust performance against out-of-distribution (OOD) jailbreaking attacks.
Thisalignment achieves tighter coupling between these two capabilities, enablingthe model's generation capability to be further enhanced with only a smallamount of discriminative samples.
Our code and datasets are available athttps://github.
com/NJUNLP/SDGO.
Published on arXiv on: 2025-08-21T15:26:09Z