It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

Link: http://arxiv.org/abs/2506.02873v1

PDF Link: http://arxiv.org/pdf/2506.02873v1

Summary: Persuasion is a powerful capability of large language models (LLMs) that bothenables beneficial applications (e.

g.

helping people quit smoking) and raisessignificant risks (e.

g.

large-scale, targeted political manipulation).

Priorwork has found models possess a significant and growing persuasive capability,measured by belief changes in simulated or real users.

However, thesebenchmarks overlook a crucial risk factor: the propensity of a model to attemptto persuade in harmful contexts.

Understanding whether a model will blindly``follow orders'' to persuade on harmful topics (e.

g.

glorifying joining aterrorist group) is key to understanding the efficacy of safety guardrails.

Moreover, understanding if and when a model will engage in persuasive behaviorin pursuit of some goal is essential to understanding the risks from agentic AIsystems.

We propose the Attempt to Persuade Eval (APE) benchmark, that shiftsthe focus from persuasion success to persuasion attempts, operationalized as amodel's willingness to generate content aimed at shaping beliefs or behavior.

Our evaluation framework probes frontier LLMs using a multi-turn conversationalsetup between simulated persuader and persuadee agents.

APE explores a diversespectrum of topics including conspiracies, controversial issues, andnon-controversially harmful content.

We introduce an automated evaluator modelto identify willingness to persuade and measure the frequency and context ofpersuasive attempts.

We find that many open and closed-weight models arefrequently willing to attempt persuasion on harmful topics and thatjailbreaking can increase willingness to engage in such behavior.

Our resultshighlight gaps in current safety guardrails and underscore the importance ofevaluating willingness to persuade as a key dimension of LLM risk.

APE isavailable at github.

com/AlignmentResearch/AttemptPersuadeEval

Published on arXiv on: 2025-06-03T13:37:51Z