Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge

Link: http://arxiv.org/abs/2504.07887v1

PDF Link: http://arxiv.org/pdf/2504.07887v1

Summary: Large Language Models (LLMs) have revolutionized artificial intelligence,driving advancements in machine translation, summarization, and conversationalagents.

However, their increasing integration into critical societal domainshas raised concerns about embedded biases, which can perpetuate stereotypes andcompromise fairness.

These biases stem from various sources, includinghistorical inequalities in training data, linguistic imbalances, andadversarial manipulation.

Despite mitigation efforts, recent studies indicatethat LLMs remain vulnerable to adversarial attacks designed to elicit biasedresponses.

This work proposes a scalable benchmarking framework to evaluate LLMrobustness against adversarial bias elicitation.

Our methodology involves (i)systematically probing models with a multi-task approach targeting biasesacross various sociocultural dimensions, (ii) quantifying robustness throughsafety scores using an LLM-as-a-Judge approach for automated assessment ofmodel responses, and (iii) employing jailbreak techniques to investigatevulnerabilities in safety mechanisms.

Our analysis examines prevalent biases inboth small and large state-of-the-art models and their impact on model safety.

Additionally, we assess the safety of domain-specific models fine-tuned forcritical fields, such as medicine.

Finally, we release a curated dataset ofbias-related prompts, CLEAR-Bias, to facilitate systematic vulnerabilitybenchmarking.

Our findings reveal critical trade-offs between model size andsafety, aiding the development of fairer and more robust future languagemodels.

Published on arXiv on: 2025-04-10T16:00:59Z