Link: http://arxiv.org/abs/2504.19440v1
PDF Link: http://arxiv.org/pdf/2504.19440v1
Summary: Safety and security remain critical concerns in AI deployment.
Despite safetytraining through reinforcement learning with human feedback (RLHF) [ 32],language models remain vulnerable to jailbreak attacks that bypass safetyguardrails.
Universal jailbreaks - prefixes that can circumvent alignment forany payload - are particularly concerning.
We show empirically that jailbreakdetection systems face distribution shift, with detectors trained at one pointin time performing poorly against newer exploits.
To study this problem, werelease JailbreaksOverTime, a comprehensive dataset of timestamped real userinteractions containing both benign requests and jailbreak attempts collectedover 10 months.
We propose a two-pronged method for defenders to detect newjailbreaks and continuously update their detectors.
First, we show how to usecontinuous learning to detect jailbreaks and adapt rapidly to new emergingjailbreaks.
While detectors trained at a single point in time eventually faildue to drift, we find that universal jailbreaks evolve slowly enough forself-training to be effective.
Retraining our detection model weekly using itsown labels - with no new human labels - reduces the false negative rate from 4%to 0.
3% at a false positive rate of 0.
1%.
Second, we introduce an unsupervisedactive monitoring approach to identify novel jailbreaks.
Rather thanclassifying inputs directly, we recognize jailbreaks by their behavior,specifically, their ability to trigger models to respond to known-harmfulprompts.
This approach has a higher false negative rate (4.
1%) than supervisedmethods, but it successfully identified some out-of-distribution attacks thatwere missed by the continuous learning approach.
Published on arXiv on: 2025-04-28T03:01:51Z