△EvalResearchFree
LM Evaluation Harness
EleutherAI's MIT-licensed unified benchmark suite — the de-facto standard for evaluating language models across 200+ tasks.
LM Evaluation Harness
LM Evaluation Harness is the canonical open-source framework for evaluating language models, developed by EleutherAI. It provides a unified interface to run 200+ benchmark tasks (MMLU, HellaSwag, ARC, TruthfulQA, GSM8K, and more) against any HuggingFace model, OpenAI API, or custom endpoint — making reproducible, comparable LLM evaluation easy.
Key features
- 200+ built-in tasks — MMLU, ARC, HellaSwag, TruthfulQA, GSM8K, HumanEval, WinoGrande, and more
- Plug-in architecture: evaluate local HF models, OpenAI API, vLLM, Anthropic, or custom backends
- Powers the Open LLM Leaderboard on HuggingFace
- Supports few-shot, zero-shot, and chain-of-thought modes
- MIT license — use in CI/CD pipelines, commercial workflows
Quick start
pip install lm-eval
# Evaluate Mistral-7B on MMLU (5-shot)
lm_eval --model hf \
--model_args pretrained=mistralai/Mistral-7B-v0.1 \
--tasks mmlu \
--num_fewshot 5 \
--device cuda:0
Python API
import lm_eval
results = lm_eval.simple_evaluate(
model="hf",
model_args="pretrained=microsoft/Phi-3-mini-4k-instruct",
tasks=["arc_easy", "hellaswag"],
num_fewshot=0,
)
print(results["results"])
Install via ai-supply
npx ai-supply add lm-evaluation-harness
Curated mirror of the open-source LM Evaluation Harness (MIT). Get it from the source.