Security Steerability is All You Need

Link: http://arxiv.org/abs/2504.19521v1

PDF Link: http://arxiv.org/pdf/2504.19521v1

Summary: The adoption of Generative AI (GenAI) in various applications inevitablycomes with expanding the attack surface, combining new security threats alongwith the traditional ones.

Consequently, numerous research and industrialinitiatives aim to mitigate these security threats in GenAI by developingmetrics and designing defenses.

However, while most of the GenAI security workfocuses on universal threats (e.

g.

manipulating the LLM to generate forbiddencontent), there is significantly less discussion on application-level securityand how to mitigate it.

Thus, in this work we adopt an application-centric approach to GenAIsecurity, and show that while LLMs cannot protect against ad-hoc applicationspecific threats, they can provide the framework for applications to protectthemselves against such threats.

Our first contribution is defining SecuritySteerability - a novel security measure for LLMs, assessing the model'scapability to adhere to strict guardrails that are defined in the system prompt('Refrain from discussing about politics').

These guardrails, in caseeffective, can stop threats in the presence of malicious users who attempt tocircumvent the application and cause harm to its providers.

Our second contribution is a methodology to measure the security steerabilityof LLMs, utilizing two newly-developed datasets: VeganRibs assesses the LLMbehavior in forcing specific guardrails that are not security per se in thepresence of malicious user that uses attack boosters (jailbreaks andperturbations), and ReverseText takes this approach further and measures theLLM ability to force specific treatment of the user input as plain text whiledo user try to give it additional meanings.

.

.

Published on arXiv on: 2025-04-28T06:40:01Z