Link: http://arxiv.org/abs/2505.19766v1
PDF Link: http://arxiv.org/pdf/2505.19766v1
Summary: Aligning large language models (LLMs) with deployment-specific requirementsis critical but inherently imperfect.
Despite extensive training, models remainsusceptible to misalignment and adversarial inputs such as jailbreaks.
Contentmoderation filters are commonly used as external safeguards, though theytypically focus narrowly on safety.
We introduce SGM (Specification-GuidedModeration), a flexible framework for training moderation filters grounded inuser-defined specifications that go beyond standard safety concerns.
SGMautomates training data generation without relying on human-written examples,enabling scalable support for diverse, application-specific alignment goals.
SGM-trained filters perform on par with state-of-the-art safety filters builton curated datasets, while supporting fine-grained and user-defined alignmentcontrol.
Published on arXiv on: 2025-05-26T09:49:43Z