Publish to Perish: Prompt Injection Attacks on LLM-Assisted Peer Review

Link: http://arxiv.org/abs/2508.20863v1

PDF Link: http://arxiv.org/pdf/2508.20863v1

Summary: Large Language Models (LLMs) are increasingly being integrated into thescientific peer-review process, raising new questions about their reliabilityand resilience to manipulation.

In this work, we investigate the potential forhidden prompt injection attacks, where authors embed adversarial text within apaper's PDF to influence the LLM-generated review.

We begin by formalisingthree distinct threat models that envision attackers with different motivations-- not all of which implying malicious intent.

For each threat model, we designadversarial prompts that remain invisible to human readers yet can steer anLLM's output toward the author's desired outcome.

Using a user study withdomain scholars, we derive four representative reviewing prompts used to elicitpeer reviews from LLMs.

We then evaluate the robustness of our adversarialprompts across (i) different reviewing prompts, (ii) different commercialLLM-based systems, and (iii) different peer-reviewed papers.

Our results showthat adversarial prompts can reliably mislead the LLM, sometimes in ways thatadversely affect a "honest-but-lazy" reviewer.

Finally, we propose andempirically assess methods to reduce detectability of adversarial prompts underautomated content checks.

Published on arXiv on: 2025-08-28T14:57:04Z