Name: LM Evaluation Harness
Availability: InStock
Author: ai-supply

LM Evaluation Harness

LM Evaluation Harness is the canonical open-source framework for evaluating language models, developed by EleutherAI. It provides a unified interface to run 200+ benchmark tasks (MMLU, HellaSwag, ARC, TruthfulQA, GSM8K, and more) against any HuggingFace model, OpenAI API, or custom endpoint — making reproducible, comparable LLM evaluation easy.

Key features

200+ built-in tasks — MMLU, ARC, HellaSwag, TruthfulQA, GSM8K, HumanEval, WinoGrande, and more
Plug-in architecture: evaluate local HF models, OpenAI API, vLLM, Anthropic, or custom backends
Powers the Open LLM Leaderboard on HuggingFace
Supports few-shot, zero-shot, and chain-of-thought modes
MIT license — use in CI/CD pipelines, commercial workflows

Quick start

pip install lm-eval

# Evaluate Mistral-7B on MMLU (5-shot)
lm_eval --model hf \
  --model_args pretrained=mistralai/Mistral-7B-v0.1 \
  --tasks mmlu \
  --num_fewshot 5 \
  --device cuda:0

Python API

import lm_eval

results = lm_eval.simple_evaluate(
    model="hf",
    model_args="pretrained=microsoft/Phi-3-mini-4k-instruct",
    tasks=["arc_easy", "hellaswag"],
    num_fewshot=0,
)
print(results["results"])

Install via ai-supply

npx ai-supply add lm-evaluation-harness

Curated mirror of the open-source LM Evaluation Harness (MIT). Get it from the source.

LM Evaluation Harness

LM Evaluation Harness

Key features

Quick start

Python API

Install via ai-supply

More from @ai-supply