Concept-Level Explainability for Auditing & Steering LLM Responses

Link: http://arxiv.org/abs/2505.07610v1

PDF Link: http://arxiv.org/pdf/2505.07610v1

Summary: As large language models (LLMs) become widely deployed, concerns about theirsafety and alignment grow.

An approach to steer LLM behavior, such asmitigating biases or defending against jailbreaks, is to identify which partsof a prompt influence specific aspects of the model's output.

Token-levelattribution methods offer a promising solution, but still struggle in textgeneration, explaining the presence of each token in the output separately,rather than the underlying semantics of the entire LLM response.

We introduceConceptX, a model-agnostic, concept-level explainability method that identifiesthe concepts, i.

e.

, semantically rich tokens in the prompt, and assigns themimportance based on the outputs' semantic similarity.

Unlike currenttoken-level methods, ConceptX also offers to preserve context integrity throughin-place token replacements and supports flexible explanation goals, e.

g.

,gender bias.

ConceptX enables both auditing, by uncovering sources of bias, andsteering, by modifying prompts to shift the sentiment or reduce the harmfulnessof LLM responses, without requiring retraining.

Across three LLMs, ConceptXoutperforms token-level methods like TokenSHAP in both faithfulness and humanalignment.

Steering tasks boost sentiment shift by 0.

252 versus 0.

131 forrandom edits and lower attack success rates from 0.

463 to 0.

242, outperformingattribution and paraphrasing baselines.

While prompt engineering andself-explaining methods sometimes yield safer responses, ConceptX offers atransparent and faithful alternative for improving LLM safety and alignment,demonstrating the practical value of attribution-based explainability inguiding LLM behavior.

Published on arXiv on: 2025-05-12T14:31:51Z