Skip to content
arxiv papers 1 min read

sudo rm -rf agentic_security

Link: http://arxiv.org/abs/2503.20279v1

PDF Link: http://arxiv.org/pdf/2503.20279v1

Summary: Large Language Models (LLMs) are increasingly deployed as computer-useagents, autonomously performing tasks within real desktop or web environments.

While this evolution greatly expands practical use cases for humans, it alsocreates serious security exposures.

We present SUDO (Screen-based UniversalDetox2Tox Offense), a novel attack framework that systematically bypassesrefusal trained safeguards in commercial computer-use agents, such as ClaudeComputer Use.

The core mechanism, Detox2Tox, transforms harmful requests (thatagents initially reject) into seemingly benign requests via detoxification,secures detailed instructions from advanced vision language models (VLMs), andthen reintroduces malicious content via toxification just before execution.

Unlike conventional jailbreaks, SUDO iteratively refines its attacks based on abuilt-in refusal feedback, making it increasingly effective against robustpolicy filters.

In extensive tests spanning 50 real-world tasks and multiplestate-of-the-art VLMs, SUDO achieves a stark attack success rate of 24% (withno refinement), and up to 41% (by its iterative refinement) in Claude ComputerUse.

By revealing these vulnerabilities and demonstrating the ease with whichthey can be exploited in real-world computing environments, this paperhighlights an immediate need for robust, context-aware safeguards.

WARNING:This paper includes harmful or offensive model outputs.

Published on arXiv on: 2025-03-26T07:08:15Z