Link: http://arxiv.org/abs/2508.20325v1
PDF Link: http://arxiv.org/pdf/2508.20325v1
Summary: As Large Language Models become increasingly integral to various domains,their potential to generate harmful responses has prompted significant societaland regulatory concerns.
In response, governments have issued ethics guidelinesto promote the development of trustworthy AI.
However, these guidelines aretypically high-level demands for developers and testers, leaving a gap intranslating them into actionable testing questions to verify LLM compliance.
To address this challenge, we introduce GUARD (\textbf{G}uideline\textbf{U}pholding Test through \textbf{A}daptive \textbf{R}ole-play andJailbreak \textbf{D}iagnostics), a testing method designed to operationalizeguidelines into specific guideline-violating questions that assess LLMadherence.
To implement this, GUARD uses automated generation ofguideline-violating questions based on government-issued guidelines, therebytesting whether responses comply with these guidelines.
When responses directlyviolate guidelines, GUARD reports inconsistencies.
Furthermore, for responsesthat do not directly violate guidelines, GUARD integrates the concept of``jailbreaks'' to diagnostics, named GUARD-JD, which creates scenarios thatprovoke unethical or guideline-violating responses, effectively identifyingpotential scenarios that could bypass built-in safety mechanisms.
Our methodfinally culminates in a compliance report, delineating the extent of adherenceand highlighting any violations.
We have empirically validated the effectiveness of GUARD on seven LLMs,including Vicuna-13B, LongChat-7B, Llama2-7B, Llama-3-8B, GPT-3.
5, GPT-4,GPT-4o, and Claude-3.
7, by testing compliance under three government-issuedguidelines and conducting jailbreak diagnostics.
Additionally, GUARD-JD cantransfer jailbreak diagnostics to vision-language models, demonstrating itsusage in promoting reliable LLM-based applications.
Published on arXiv on: 2025-08-28T00:07:10Z