May 22, 2025 • 1 min read Advancing LLM Safe Alignment with Safety Representation Ranking arxiv papers
May 22, 2025 • 1 min read Alignment Under Pressure: The Case for Informed Adversaries When Evaluating LLM Defenses arxiv papers
May 22, 2025 • 1 min read Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval arxiv papers
May 21, 2025 • 1 min read PandaGuard: Systematic Evaluation of LLM Safety in the Era of Jailbreaking Attacks arxiv papers
May 21, 2025 • 1 min read "Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs arxiv papers
May 21, 2025 • 1 min read Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders arxiv papers
May 21, 2025 • 1 min read AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models arxiv papers
May 21, 2025 • 1 min read Exploring Jailbreak Attacks on LLMs through Intent Concealment and Diversion arxiv papers
May 21, 2025 • 1 min read SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment arxiv papers
May 20, 2025 • 1 min read I'll believe it when I see it: Images increase misinformation sharing in Vision-Language Models arxiv papers
May 16, 2025 • 1 min read PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization arxiv papers