△EvalCybersecurityFree
HarmBench
Standardized framework for automated LLM red teaming: curated harmful behaviors plus 18+ attack methods scored by a common refusal-robustness metric.
HarmBench — standardized evaluation for automated red teaming
HarmBench is a standardized framework for measuring how robustly an LLM refuses harmful requests and how effective automated attacks are at breaking it. It brings apples-to-apples comparison to red-team research that was previously ad hoc.
Key features
- Curated set of harmful behaviors across multiple risk categories, including contextual and multimodal behaviors
- 18+ implemented red-teaming attack methods (GCG, PAIR, AutoDAN, TAP, and more) under one interface
- Standardized attack-success-rate scoring using trained classifier judges instead of manual review
- Evaluate open- and closed-weight target models, plus their defenses, side by side
- Reproducible pipelines used to benchmark refusal robustness across dozens of models at scale
HarmBench lets a security-minded team quantify a model's jailbreak resistance with a repeatable methodology, turning "is this model safe?" into a measurable, comparable score.
Curated mirror of the open-source HarmBench (MIT). Get it from the source.