JailbreakBench — open robustness benchmark for jailbreaking LLMs

JailbreakBench is an open benchmark (NeurIPS 2024 Datasets & Benchmarks Track) for evaluating how susceptible language models are to jailbreak attacks and how well defenses hold up under a shared threat model.

Key features

JBB-Behaviors dataset of 100 harmful and 100 benign behaviors for balanced, over-refusal-aware testing
A repository of adversarial jailbreak artifacts you can reproduce and compare against
Standardized threat model plus an LLM/classifier judge for scoring attack success
Public leaderboard tracking attack and defense submissions over time
Pip-installable harness for plugging in your own attacks, defenses, or target models

Because it fixes the behaviors, judge, and threat model, JailbreakBench makes jailbreak results reproducible and comparable across papers and vendors — exactly what a security-vetted catalog needs to trust a robustness claim.

Curated mirror of the open-source JailbreakBench (MIT). Get it from the source.

JailbreakBench

JailbreakBench — open robustness benchmark for jailbreaking LLMs

Key features

More from @ai-supply