Adversarial Attacks against Neural Ranking Models via In-Context Learning

Link: http://arxiv.org/abs/2508.15283v1

PDF Link: http://arxiv.org/pdf/2508.15283v1

Summary: While neural ranking models (NRMs) have shown high effectiveness, they remainsusceptible to adversarial manipulation.

In this work, we introduce Few-ShotAdversarial Prompting (FSAP), a novel black-box attack framework that leveragesthe in-context learning capabilities of Large Language Models (LLMs) togenerate high-ranking adversarial documents.

Unlike previous approaches thatrely on token-level perturbations or manual rewriting of existing documents,FSAP formulates adversarial attacks entirely through few-shot prompting,requiring no gradient access or internal model instrumentation.

By conditioningthe LLM on a small support set of previously observed harmful examples, FSAPsynthesizes grammatically fluent and topically coherent documents that subtlyembed false or misleading information and rank competitively against authenticcontent.

We instantiate FSAP in two modes: FSAP-IntraQ, which leverages harmfulexamples from the same query to enhance topic fidelity, and FSAP-InterQ, whichenables broader generalization by transferring adversarial patterns acrossunrelated queries.

Our experiments on the TREC 2020 and 2021 HealthMisinformation Tracks, using four diverse neural ranking models, reveal thatFSAP-generated documents consistently outrank credible, factually accuratedocuments.

Furthermore, our analysis demonstrates that these adversarialoutputs exhibit strong stance alignment and low detectability, posing arealistic and scalable threat to neural retrieval systems.

FSAP alsoeffectively generalizes across both proprietary and open-source LLMs.

Published on arXiv on: 2025-08-21T06:19:00Z