△EvalResearchFree
OpenAI Evals
MIT-licensed framework for evaluating LLMs and AI systems — build custom evals, run model comparisons, log results.
OpenAI Evals
OpenAI Evals is a framework for evaluating LLMs and LLM-powered systems, open-sourced by OpenAI under the MIT license. It provides a library of 1000+ existing evals alongside a structured way to build new ones — covering accuracy, safety, robustness, and task-specific performance. Evals can target any model via the OpenAI API or custom completion functions.
Key features
- 1,000+ ready-made evals — logic, coding, translation, factuality, safety
- Custom eval builder:
model_graded,basic,matcheval types - Model-graded evals use an LLM as judge for open-ended tasks
- YAML-based eval spec format — version-control your evaluations
- Multi-model comparison support for red-teaming and A/B testing
- MIT license — contribute or use commercially
Quick start
pip install openai evals
# Run a built-in eval
oaieval gpt-4o test-match
# Register and run a custom eval
cat > evals/registry/evals/my-eval.yaml << 'EOF'
my-eval:
id: my-eval.dev.v0
metrics: [accuracy]
my-eval.dev.v0:
class: evals.elsuite.basic.match:Match
args:
samples_jsonl: my_samples.jsonl
EOF
oaieval gpt-4o-mini my-eval
Install via ai-supply
npx ai-supply add openai-evals-framework
Curated mirror of the open-source OpenAI Evals (MIT). Get it from the source.