Link: http://arxiv.org/abs/2505.03574v1
PDF Link: http://arxiv.org/pdf/2505.03574v1
Summary: Large language models (LLMs) have evolved from simple chatbots intoautonomous agents capable of performing complex tasks such as editingproduction code, orchestrating workflows, and taking higher-stakes actionsbased on untrusted inputs like webpages and emails.
These capabilitiesintroduce new security risks that existing security measures, such as modelfine-tuning or chatbot-focused guardrails, do not fully address.
Given thehigher stakes and the absence of deterministic solutions to mitigate theserisks, there is a critical need for a real-time guardrail monitor to serve as afinal layer of defense, and support system level, use case specific safetypolicy definition and enforcement.
We introduce LlamaFirewall, an open-sourcesecurity focused guardrail framework designed to serve as a final layer ofdefense against security risks associated with AI Agents.
Our frameworkmitigates risks such as prompt injection, agent misalignment, and insecure coderisks through three powerful guardrails: PromptGuard 2, a universal jailbreakdetector that demonstrates clear state of the art performance; Agent AlignmentChecks, a chain-of-thought auditor that inspects agent reasoning for promptinjection and goal misalignment, which, while still experimental, showsstronger efficacy at preventing indirect injections in general scenarios thanpreviously proposed approaches; and CodeShield, an online static analysisengine that is both fast and extensible, aimed at preventing the generation ofinsecure or dangerous code by coding agents.
Additionally, we includeeasy-to-use customizable scanners that make it possible for any developer whocan write a regular expression or an LLM prompt to quickly update an agent'ssecurity guardrails.
Published on arXiv on: 2025-05-06T14:34:21Z