Jan 30, 2025 • 1 min read RICoTA: Red-teaming of In-the-wild Conversation with Test Attempts arxiv papers
Jan 29, 2025 • 1 min read xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking arxiv papers
Jan 24, 2025 • 1 min read Tune In, Act Up: Exploring the Impact of Audio Modality-Specific Edits on Large Audio Language Models in Jailbreak arxiv papers
Jan 22, 2025 • 1 min read You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense arxiv papers
Jan 17, 2025 • 1 min read A Survey on Responsible LLMs: Inherent Risk, Malicious Use, and Mitigation Strategy arxiv papers
Jan 16, 2025 • 1 min read SAIF: A Comprehensive Framework for Evaluating the Risks of Generative AI in the Public Sector arxiv papers
Jan 15, 2025 • 1 min read Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning arxiv papers
Jan 10, 2025 • 1 min read Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency arxiv papers
Jan 7, 2025 • 1 min read AEIOU: A Unified Defense Framework against NSFW Prompts in Text-to-Image Models arxiv papers
Jan 7, 2025 • 1 min read Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models arxiv papers