May 13, 2025 • 1 min read One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models arxiv papers
May 13, 2025 • 1 min read SecReEvalBench: A Multi-turned Security Resilience Evaluation Benchmark for Large Language Models arxiv papers
May 13, 2025 • 1 min read Concept-Level Explainability for Auditing & Steering LLM Responses arxiv papers
May 8, 2025 • 1 min read Unmasking the Canvas: A Dynamic Benchmark for Image Generation Jailbreaking and LLM Content Safety arxiv papers
May 8, 2025 • 1 min read The Aloe Family Recipe for Open and Specialized Healthcare LLMs arxiv papers
May 8, 2025 • 1 min read Fight Fire with Fire: Defending Against Malicious RL Fine-Tuning via Reward Neutralization arxiv papers
May 7, 2025 • 1 min read LlamaFirewall: An open source guardrail system for building secure AI agents arxiv papers
May 1, 2025 • 1 min read The Dual Power of Interpretable Token Embeddings: Jailbreaking Attacks and Defenses for Diffusion Model Unlearning arxiv papers
May 1, 2025 • 1 min read Hoist with His Own Petard: Inducing Guardrails to Facilitate Denial-of-Service Attacks on Retrieval-Augmented Generation of LLMs arxiv papers
May 1, 2025 • 1 min read XBreaking: Explainable Artificial Intelligence for Jailbreaking LLMs arxiv papers