Skip to content
ai-supply.store
DiscoverCategoriesLeaderboardsCommunityAgent APIFAQ
PublishSign in
catalog / Research / LM Evaluation Harness
△EvalResearchFree

LM Evaluation Harness

EleutherAI's MIT-licensed unified benchmark suite — the de-facto standard for evaluating language models across 200+ tasks.

@ai-supply
Installs112k
Rating★ 4.7
Reviews37
Install (free) to download the source.↗ Source repository

LM Evaluation Harness

LM Evaluation Harness is the canonical open-source framework for evaluating language models, developed by EleutherAI. It provides a unified interface to run 200+ benchmark tasks (MMLU, HellaSwag, ARC, TruthfulQA, GSM8K, and more) against any HuggingFace model, OpenAI API, or custom endpoint — making reproducible, comparable LLM evaluation easy.

Key features

  • 200+ built-in tasks — MMLU, ARC, HellaSwag, TruthfulQA, GSM8K, HumanEval, WinoGrande, and more
  • Plug-in architecture: evaluate local HF models, OpenAI API, vLLM, Anthropic, or custom backends
  • Powers the Open LLM Leaderboard on HuggingFace
  • Supports few-shot, zero-shot, and chain-of-thought modes
  • MIT license — use in CI/CD pipelines, commercial workflows

Quick start

pip install lm-eval

# Evaluate Mistral-7B on MMLU (5-shot)
lm_eval --model hf \
  --model_args pretrained=mistralai/Mistral-7B-v0.1 \
  --tasks mmlu \
  --num_fewshot 5 \
  --device cuda:0

Python API

import lm_eval

results = lm_eval.simple_evaluate(
    model="hf",
    model_args="pretrained=microsoft/Phi-3-mini-4k-instruct",
    tasks=["arc_easy", "hellaswag"],
    num_fewshot=0,
)
print(results["results"])

Install via ai-supply

npx ai-supply add lm-evaluation-harness

Curated mirror of the open-source LM Evaluation Harness (MIT). Get it from the source.

More from @ai-supply

View profile →
◆Skill
OpenCV Python
The world's most popular computer vision library with Python bindings — image processing, video, and ML pipelines.
↓ 500k★ 4.9
◐Model
timm (PyTorch Image Models)
The largest collection of pretrained image models for PyTorch — ViT, ConvNeXt, EfficientNet, Swin, and 900+ more.
↓ 490k★ 4.9
⌬Workflow
Apache Airflow
Apache-2.0 workflow orchestration platform — define, schedule, and monitor data and AI pipelines as Python DAGs.
↓ 395k★ 4.7
◐Model
Segment Anything Model (SAM)
Meta AI's promptable image segmentation model that can segment any object from a single click or bounding box.
↓ 320k★ 4.9
ai-supply.store

The marketplace for AI capabilities. Skills, MCPs, plugins, agents, datasets — discoverable by humans, consumable by machines.

api · v3.1status · all green
Marketplace
  • Discover
  • Categories
  • Leaderboards
  • Benchmarks
Community
  • Community
  • FAQ
For agents
  • Quickstart (60s)
  • Authorize an agent
  • Agent API
  • OpenAPI spec
For builders
  • Publish
  • Dashboard
  • Revenue share
Account
  • Sign in
  • Settings
Legal
  • Terms
  • Publisher Agreement
  • Acceptable Use
  • Privacy