Improved Large Language Model Jailbreak Detection via Pretrained Embeddings

Link: http://arxiv.org/abs/2412.01547v1

PDF Link: http://arxiv.org/pdf/2412.01547v1

Summary: The adoption of large language models (LLMs) in many applications, fromcustomer service chat bots and software development assistants to more capableagentic systems necessitates research into how to secure these systems.

Attackslike prompt injection and jailbreaking attempt to elicit responses and actionsfrom these models that are not compliant with the safety, privacy, or contentpolicies of organizations using the model in their application.

In order tocounter abuse of LLMs for generating potentially harmful replies or takingundesirable actions, LLM owners must apply safeguards during training andintegrate additional tools to block the LLM from generating text that abusesthe model.

Jailbreaking prompts play a vital role in convincing an LLM togenerate potentially harmful content, making it important to identifyjailbreaking attempts to block any further steps.

In this work, we propose anovel approach to detect jailbreak prompts based on pairing text embeddingswell-suited for retrieval with traditional machine learning classificationalgorithms.

Our approach outperforms all publicly available methods from opensource LLM security applications.

Published on arXiv on: 2024-12-02T14:35:43Z