PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization

Link: http://arxiv.org/abs/2505.09921v1

PDF Link: http://arxiv.org/pdf/2505.09921v1

Summary: Large Language Models (LLMs) excel in various domains but pose inherentprivacy risks.

Existing methods to evaluate privacy leakage in LLMs often usememorized prefixes or simple instructions to extract data, both of whichwell-alignment models can easily block.

Meanwhile, Jailbreak attacks bypass LLMsafety mechanisms to generate harmful content, but their role in privacyscenarios remains underexplored.

In this paper, we examine the effectiveness ofjailbreak attacks in extracting sensitive information, bridging privacy leakageand jailbreak attacks in LLMs.

Moreover, we propose PIG, a novel frameworktargeting Personally Identifiable Information (PII) and addressing thelimitations of current jailbreak methods.

Specifically, PIG identifies PIIentities and their types in privacy queries, uses in-context learning to builda privacy context, and iteratively updates it with three gradient-basedstrategies to elicit target PII.

We evaluate PIG and existing jailbreak methodsusing two privacy-related datasets.

Experiments on four white-box and twoblack-box LLMs show that PIG outperforms baseline methods and achievesstate-of-the-art (SoTA) results.

The results underscore significant privacyrisks in LLMs, emphasizing the need for stronger safeguards.

Our code isavailble at\href{https://github.

com/redwyd/PrivacyJailbreak}{https://github.

com/redwyd/PrivacyJailbreak}.

Published on arXiv on: 2025-05-15T03:11:57Z