arxiv papers

May 22, 2025 • 1 min read

Advancing LLM Safe Alignment with Safety Representation Ranking

arxiv papers

May 22, 2025 • 1 min read

Alignment Under Pressure: The Case for Informed Adversaries When Evaluating LLM Defenses

arxiv papers

May 22, 2025 • 1 min read

Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval

arxiv papers

May 21, 2025 • 1 min read

PandaGuard: Systematic Evaluation of LLM Safety in the Era of Jailbreaking Attacks

arxiv papers

May 21, 2025 • 1 min read

"Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs

arxiv papers

May 21, 2025 • 1 min read

Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders

arxiv papers

May 21, 2025 • 1 min read

AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models

arxiv papers

May 21, 2025 • 1 min read

Exploring Jailbreak Attacks on LLMs through Intent Concealment and Diversion

arxiv papers

May 21, 2025 • 1 min read

sudoLLM : On Multi-role Alignment of Language Models

arxiv papers

May 21, 2025 • 1 min read

SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment

arxiv papers

May 20, 2025 • 1 min read

I'll believe it when I see it: Images increase misinformation sharing in Vision-Language Models

arxiv papers

May 16, 2025 • 1 min read

PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization

arxiv papers