Link: http://arxiv.org/abs/2501.17715v1
PDF Link: http://arxiv.org/pdf/2501.17715v1
Summary: User interactions with conversational agents (CAs) evolve in the era ofheavily guardrailed large language models (LLMs).
As users push beyondprogrammed boundaries to explore and build relationships with these systems,there is a growing concern regarding the potential for unauthorized access ormanipulation, commonly referred to as "jailbreaking.
" Moreover, with CAs thatpossess highly human-like qualities, users show a tendency toward initiatingintimate sexual interactions or attempting to tame their chatbots.
To captureand reflect these in-the-wild interactions into chatbot designs, we proposeRICoTA, a Korean red teaming dataset that consists of 609 prompts challengingLLMs with in-the-wild user-made dialogues capturing jailbreak attempts.
Weutilize user-chatbot conversations that were self-posted on a KoreanReddit-like community, containing specific testing and gaming intentions with asocial chatbot.
With these prompts, we aim to evaluate LLMs' ability toidentify the type of conversation and users' testing purposes to derive chatbotdesign implications for mitigating jailbreaking risks.
Our dataset will be madepublicly available via GitHub.
Published on arXiv on: 2025-01-29T15:32:27Z