Which free eval harness do you actually trust — promptfoo, deepeval, or ragas?

I've been doing systematic LLM evaluation for client projects for about eight months, and I've gone through phases of relying on each of the three free eval frameworks in the catalog. Here's my honest breakdown.

The listings

promptfoo-llm-eval — CLI-first, YAML-config driven, fast setup, excellent for prompt regression testing
deepeval-llm-testing — pytest-native, rich metrics (G-Eval, RAGAS-style, hallucination, toxicity), good CI integration
promptfoo-llm-eval (noting ragas is also in the catalog for RAG-specific eval)

All three free, all three security-scanned. Eval frameworks run your actual prompts and sometimes process your data, so the scan reports are worth reading — I checked all three before using them in projects with sensitive client data.

My take

promptfoo is the fastest path from zero to running evals. The YAML config is clean, the CLI output is readable, and it integrates with basically every LLM provider. I use it for prompt regression testing — did my last prompt change make things worse? — and for A/B testing prompts before deployment.

# promptfooconfig.yaml
providers:
  - anthropic:claude-3-5-haiku-latest
  - openai:gpt-4o-mini
prompts:
  - "Summarise this in 3 bullet points: {{document}}"
tests:
  - vars:
      document: "{{file://test_docs/001.txt}}"
    assert:
      - type: contains
        value: "key finding"
      - type: llm-rubric
        value: "The summary is accurate and covers the main points"

deepeval is my choice when I need more nuanced metrics — especially hallucination detection and faithfulness scoring for RAG outputs. The pytest integration means eval results live next to unit tests in CI. The GEval metric lets me define custom criteria in natural language, which is powerful for bespoke quality dimensions.

ragas is the specialist for RAG pipelines specifically. Context precision, context recall, faithfulness, answer relevancy — if I'm evaluating a retrieval-augmented system, ragas metrics are the most meaningful signal I've found.

When to use which

Use case	My pick
Prompt regression in CI	promptfoo
RAG quality measurement	ragas
Complex multi-metric eval suite	deepeval
Quick sanity check before shipping	promptfoo

All three are free. The only cost is compute for the LLM judge calls — and if you use a local model as your judge via Ollama, even that drops to zero.

What's your experience? I'm particularly curious whether anyone has done a calibration study comparing the three — i.e., do they agree on what "good" looks like?

Which free eval harness do you actually trust — promptfoo, deepeval, or ragas?

Which free eval harness do you actually trust — promptfoo, deepeval, or ragas?

The listings

My take

When to use which

Comments · 3