Aug 28, 2025 • 1 min read Evaluating Language Model Reasoning about Confidential Information arxiv papers
Aug 28, 2025 • 1 min read Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks arxiv papers
Aug 26, 2025 • 1 min read Attacking LLMs and AI Agents: Advertisement Embedding Attacks Against Large Language Models arxiv papers
Aug 22, 2025 • 1 min read SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks arxiv papers
Aug 22, 2025 • 1 min read Retrieval-Augmented Review Generation for Poisoning Recommender Systems arxiv papers
Aug 22, 2025 • 1 min read Adversarial Attacks against Neural Ranking Models via In-Context Learning arxiv papers
Aug 22, 2025 • 1 min read SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models arxiv papers
Aug 21, 2025 • 1 min read Universal and Transferable Adversarial Attack on Large Language Models Using Exponentiated Gradient Descent arxiv papers
Aug 20, 2025 • 1 min read Sycophancy under Pressure: Evaluating and Mitigating Sycophantic Bias via Adversarial Dialogues in Scientific QA arxiv papers
Aug 19, 2025 • 1 min read CorrSteer: Steering Improves Task Performance and Safety in LLMs through Correlation-based Sparse Autoencoder Feature Selection arxiv papers
Aug 19, 2025 • 1 min read MAJIC: Markovian Adaptive Jailbreaking via Iterative Composition of Diverse Innovative Strategies arxiv papers