arxiv papers

May 16, 2025 • 1 min read

Dark LLMs: The Growing Threat of Unaligned AI Models

arxiv papers

May 15, 2025 • 1 min read

Adversarial Suffix Filtering: a Defense Pipeline for LLMs

arxiv papers

May 13, 2025 • 1 min read

One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models

arxiv papers

May 13, 2025 • 1 min read

SecReEvalBench: A Multi-turned Security Resilience Evaluation Benchmark for Large Language Models

arxiv papers

May 13, 2025 • 1 min read

Concept-Level Explainability for Auditing & Steering LLM Responses

arxiv papers

May 8, 2025 • 1 min read

Unmasking the Canvas: A Dynamic Benchmark for Image Generation Jailbreaking and LLM Content Safety

arxiv papers

May 8, 2025 • 1 min read

The Aloe Family Recipe for Open and Specialized Healthcare LLMs

arxiv papers

May 8, 2025 • 1 min read

Fight Fire with Fire: Defending Against Malicious RL Fine-Tuning via Reward Neutralization

arxiv papers

May 7, 2025 • 1 min read

LlamaFirewall: An open source guardrail system for building secure AI agents

arxiv papers

May 1, 2025 • 1 min read

The Dual Power of Interpretable Token Embeddings: Jailbreaking Attacks and Defenses for Diffusion Model Unlearning

arxiv papers

May 1, 2025 • 1 min read

Hoist with His Own Petard: Inducing Guardrails to Facilitate Denial-of-Service Attacks on Retrieval-Augmented Generation of LLMs

arxiv papers

May 1, 2025 • 1 min read

XBreaking: Explainable Artificial Intelligence for Jailbreaking LLMs

arxiv papers