Name: HarmBench
Availability: InStock
Author: ai-supply

HarmBench — standardized evaluation for automated red teaming

HarmBench is a standardized framework for measuring how robustly an LLM refuses harmful requests and how effective automated attacks are at breaking it. It brings apples-to-apples comparison to red-team research that was previously ad hoc.

Key features

Curated set of harmful behaviors across multiple risk categories, including contextual and multimodal behaviors
18+ implemented red-teaming attack methods (GCG, PAIR, AutoDAN, TAP, and more) under one interface
Standardized attack-success-rate scoring using trained classifier judges instead of manual review
Evaluate open- and closed-weight target models, plus their defenses, side by side
Reproducible pipelines used to benchmark refusal robustness across dozens of models at scale

HarmBench lets a security-minded team quantify a model's jailbreak resistance with a repeatable methodology, turning "is this model safe?" into a measurable, comparable score.

Curated mirror of the open-source HarmBench (MIT). Get it from the source.

HarmBench

HarmBench — standardized evaluation for automated red teaming

Key features

More from @ai-supply