Link: http://arxiv.org/abs/2505.14607v1
PDF Link: http://arxiv.org/pdf/2505.14607v1
Summary: User authorization-based access privileges are a key feature in manysafety-critical systems, but have thus far been absent from the large languagemodel (LLM) realm.
In this work, drawing inspiration from such access controlsystems, we introduce sudoLLM, a novel framework that results in multi-rolealigned LLMs, i.
e.
, LLMs that account for, and behave in accordance with, useraccess rights.
sudoLLM injects subtle user-based biases into queries and trainsan LLM to utilize this bias signal in order to produce sensitive information ifand only if the user is authorized.
We present empirical results demonstratingthat this approach shows substantially improved alignment, generalization, andresistance to prompt-based jailbreaking attacks.
The persistent tension betweenthe language modeling objective and safety alignment, which is often exploitedto jailbreak LLMs, is somewhat resolved with the aid of the injected biassignal.
Our framework is meant as an additional security layer, and complementsexisting guardrail mechanisms for enhanced end-to-end safety with LLMs.
Published on arXiv on: 2025-05-20T16:54:34Z