PrivacyXray: Detecting Privacy Breaches in LLMs through Semantic Consistency and Probability Certainty

Link: http://arxiv.org/abs/2506.19563v1

PDF Link: http://arxiv.org/pdf/2506.19563v1

Summary: Large Language Models (LLMs) are widely used in sensitive domains, includinghealthcare, finance, and legal services, raising concerns about potentialprivate information leaks during inference.

Privacy extraction attacks, such asjailbreaking, expose vulnerabilities in LLMs by crafting inputs that force themodels to output sensitive information.

However, these attacks cannot verifywhether the extracted private information is accurate, as no public datasetsexist for cross-validation, leaving a critical gap in private informationdetection during inference.

To address this, we propose PrivacyXray, a novelframework detecting privacy breaches by analyzing LLM inner states.

Ouranalysis reveals that LLMs exhibit higher semantic coherence and probabilisticcertainty when generating correct private outputs.

Based on this, PrivacyXraydetects privacy breaches using four metrics: intra-layer and inter-layersemantic similarity, token-level and sentence-level probability distributions.

PrivacyXray addresses critical challenges in private information detection byovercoming the lack of open-source private datasets and eliminating reliance onexternal data for validation.

It achieves this through the synthesis ofrealistic private data and a detection mechanism based on the inner states ofLLMs.

Experiments show that PrivacyXray achieves consistent performance, withan average accuracy of 92.

69% across five LLMs.

Compared to state-of-the-artmethods, PrivacyXray achieves significant improvements, with an averageaccuracy increase of 20.

06%, highlighting its stability and practical utilityin real-world applications.

Published on arXiv on: 2025-06-24T12:22:59Z