△EvalDevOps & InfraFree
DeepEval
Apache-2.0 LLM testing framework — pytest-style unit tests for RAG pipelines, chatbots, and LLM outputs.
DeepEval
DeepEval is an open-source LLM evaluation framework that brings software-testing discipline to AI systems. Written for Python and styled after pytest, it lets teams write unit tests for LLM outputs, RAG pipelines, and chatbot quality — with 14+ built-in metrics including G-Eval, RAGAS-style faithfulness, answer relevancy, hallucination detection, and toxicity.
Key features
- 14+ built-in metrics: G-Eval, answer relevancy, faithfulness, contextual precision/recall, hallucination, toxicity, bias, summarization
- pytest integration —
assertLLM outputs like code - RAG-aware: test retrieval quality, context relevancy, and generation faithfulness in one suite
- CI/CD ready with JUnit XML output and GitHub Actions support
- Apache-2.0 license — commercial use fully permitted
Quick start
pip install deepeval
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
def test_answer_relevancy():
test_case = LLMTestCase(
input="What is RAG?",
actual_output="RAG stands for Retrieval-Augmented Generation.",
retrieval_context=["RAG is a technique that enhances LLM outputs with retrieved documents."]
)
metric = AnswerRelevancyMetric(threshold=0.7)
assert_test(test_case, [metric])
# Run your LLM test suite
deepevals test run test_llm.py
Install via ai-supply
npx ai-supply add deepeval-llm-testing
Curated mirror of the open-source DeepEval (Apache-2.0). Get it from the source.