FLAME: Flexible LLM-Assisted Moderation Engine

Link: http://arxiv.org/abs/2502.09175v1

PDF Link: http://arxiv.org/pdf/2502.09175v1

Summary: The rapid advancement of Large Language Models (LLMs) has introducedsignificant challenges in moderating user-model interactions.

While LLMsdemonstrate remarkable capabilities, they remain vulnerable to adversarialattacks, particularly ``jailbreaking'' techniques that bypass content safetymeasures.

Current content moderation systems, which primarily rely on inputprompt filtering, have proven insufficient, with techniques like Best-of-N(BoN) jailbreaking achieving success rates of 80% or more against popular LLMs.

In this paper, we introduce Flexible LLM-Assisted Moderation Engine (FLAME): anew approach that shifts the focus from input filtering to output moderation.

Unlike traditional circuit-breaking methods that analyze user queries, FLAMEevaluates model responses, offering several key advantages: (1) computationalefficiency in both training and inference, (2) enhanced resistance to BoNjailbreaking attacks, and (3) flexibility in defining and updating safetycriteria through customizable topic filtering.

Our experiments demonstrate thatFLAME significantly outperforms current moderation systems.

For example, FLAMEreduces attack success rate in GPT-4o-mini and DeepSeek-v3 by a factor of ~9,while maintaining low computational overhead.

We provide comprehensiveevaluation on various LLMs and analyze the engine's efficiency against thestate-of-the-art jailbreaking.

This work contributes to the development of morerobust and adaptable content moderation systems for LLMs.

Published on arXiv on: 2025-02-13T11:05:55Z