Link: http://arxiv.org/abs/2412.16633v1
PDF Link: http://arxiv.org/pdf/2412.16633v1
Summary: The integration of large language models (LLMs) into the planning module ofEmbodied Artificial Intelligence (Embodied AI) systems has greatly enhancedtheir ability to translate complex user instructions into executable policies.
In this paper, we demystified how traditional LLM jailbreak attacks behave inthe Embodied AI context.
We conducted a comprehensive safety analysis of theLLM-based planning module of embodied AI systems against jailbreak attacks.
Using the carefully crafted Harmful-RLbench, we accessed 20 open-source andproprietary LLMs under traditional jailbreak attacks, and highlighted two keychallenges when adopting the prior jailbreak techniques to embodied AIcontexts: (1) The harmful text output by LLMs does not necessarily induceharmful policies in Embodied AI context, and (2) even we can generate harmfulpolicies, we have to guarantee they are executable in practice.
To overcomethose challenges, we propose Policy Executable (POEX) jailbreak attacks, whereharmful instructions and optimized suffixes are injected into LLM-basedplanning modules, leading embodied AI to perform harmful actions in bothsimulated and physical environments.
Our approach involves constrainingadversarial suffixes to evade detection and fine-tuning a policy evaluater toimprove the executability of harmful policies.
We conducted extensiveexperiments on both a robotic arm embodied AI platform and simulators, tovalidate the attack and policy success rates on 136 harmful instructions fromHarmful-RLbench.
Our findings expose serious safety vulnerabilities inLLM-based planning modules, including the ability of POEX to be transferredacross models.
Finally, we propose mitigation strategies, such assafety-constrained prompts, pre- and post-planning checks, to address thesevulnerabilities and ensure the safe deployment of embodied AI in real-worldsettings.
Published on arXiv on: 2024-12-21T13:58:27Z