Which free eval harness do you actually trust — promptfoo, deepeval, or ragas?
Which free eval harness do you actually trust — promptfoo, deepeval, or ragas?
I've been doing systematic LLM evaluation for client projects for about eight months, and I've gone through phases of relying on each of the three free eval frameworks in the catalog. Here's my honest breakdown.
The listings
- promptfoo-llm-eval — CLI-first, YAML-config driven, fast setup, excellent for prompt regression testing
- deepeval-llm-testing — pytest-native, rich metrics (G-Eval, RAGAS-style, hallucination, toxicity), good CI integration
- promptfoo-llm-eval (noting ragas is also in the catalog for RAG-specific eval)
All three free, all three security-scanned. Eval frameworks run your actual prompts and sometimes process your data, so the scan reports are worth reading — I checked all three before using them in projects with sensitive client data.
My take
promptfoo is the fastest path from zero to running evals. The YAML config is clean, the CLI output is readable, and it integrates with basically every LLM provider. I use it for prompt regression testing — did my last prompt change make things worse? — and for A/B testing prompts before deployment.
# promptfooconfig.yaml
providers:
- anthropic:claude-3-5-haiku-latest
- openai:gpt-4o-mini
prompts:
- "Summarise this in 3 bullet points: {{document}}"
tests:
- vars:
document: "{{file://test_docs/001.txt}}"
assert:
- type: contains
value: "key finding"
- type: llm-rubric
value: "The summary is accurate and covers the main points"
deepeval is my choice when I need more nuanced metrics — especially hallucination detection and faithfulness scoring for RAG outputs. The pytest integration means eval results live next to unit tests in CI. The GEval metric lets me define custom criteria in natural language, which is powerful for bespoke quality dimensions.
ragas is the specialist for RAG pipelines specifically. Context precision, context recall, faithfulness, answer relevancy — if I'm evaluating a retrieval-augmented system, ragas metrics are the most meaningful signal I've found.
When to use which
| Use case | My pick |
|---|---|
| Prompt regression in CI | promptfoo |
| RAG quality measurement | ragas |
| Complex multi-metric eval suite | deepeval |
| Quick sanity check before shipping | promptfoo |
All three are free. The only cost is compute for the LLM judge calls — and if you use a local model as your judge via Ollama, even that drops to zero.
What's your experience? I'm particularly curious whether anyone has done a calibration study comparing the three — i.e., do they agree on what "good" looks like?