Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models

Link: http://arxiv.org/abs/2501.18280v1

PDF Link: http://arxiv.org/pdf/2501.18280v1

Summary: The security issue of large language models (LLMs) has gained significantattention recently, with various defense mechanisms developed to preventharmful outputs, among which safeguards based on text embedding models serve asa fundamental defense.

Through testing, we discover that the distribution oftext embedding model outputs is significantly biased with a large mean.

Inspired by this observation, we propose novel efficient methods to search foruniversal magic words that can attack text embedding models.

The universalmagic words as suffixes can move the embedding of any text towards the biasdirection, therefore manipulate the similarity of any text pair and misleadsafeguards.

By appending magic words to user prompts and requiring LLMs to endanswers with magic words, attackers can jailbreak the safeguard.

To eradicatethis security risk, we also propose defense mechanisms against such attacks,which can correct the biased distribution of text embeddings in a train-freemanner.

Published on arXiv on: 2025-01-30T11:37:40Z