SoK: Evaluating Jailbreak Guardrails for Large Language Models

Link: http://arxiv.org/abs/2506.10597v1

PDF Link: http://arxiv.org/pdf/2506.10597v1

Summary: Large Language Models (LLMs) have achieved remarkable progress, but theirdeployment has exposed critical vulnerabilities, particularly to jailbreakattacks that circumvent safety mechanisms.

Guardrails--external defensemechanisms that monitor and control LLM interaction--have emerged as apromising solution.

However, the current landscape of LLM guardrails isfragmented, lacking a unified taxonomy and comprehensive evaluation framework.

In this Systematization of Knowledge (SoK) paper, we present the first holisticanalysis of jailbreak guardrails for LLMs.

We propose a novel,multi-dimensional taxonomy that categorizes guardrails along six keydimensions, and introduce a Security-Efficiency-Utility evaluation framework toassess their practical effectiveness.

Through extensive analysis andexperiments, we identify the strengths and limitations of existing guardrailapproaches, explore their universality across attack types, and provideinsights into optimizing defense combinations.

Our work offers a structuredfoundation for future research and development, aiming to guide the principledadvancement and deployment of robust LLM guardrails.

The code is available athttps://github.

com/xunguangwang/SoK4JailbreakGuardrails.

Published on arXiv on: 2025-06-12T11:42:40Z