DeepEval is an open-source LLM evaluation framework built like pytest for AI applications. You write evaluation test cases (input, LLM output, expected output), choose metrics (faithfulness, answer relevancy, contextual precision, hallucination, bias, toxicity), and run `deepeval test run`. It integrates with pytest and CI/CD pipelines to catch quality regressions. Confident AI is the cloud platform for running and tracking DeepEval evaluations.

DeepEval's core framework is completely free and open source (Apache 2.0). The Confident AI cloud platform (for storing results, collaboration, and hosted evaluation) has a free tier and paid plans. Most teams start with the open-source CLI — no Confident AI account needed for basic evaluation.

What evaluation metrics does DeepEval provide?

DeepEval includes: Faithfulness (does the answer match the retrieved context?), Answer Relevancy (does the answer address the question?), Contextual Precision/Recall/Relevancy (RAG-specific retrieval quality), Hallucination (is the answer factually correct?), Bias (does the output show unfair bias?), Toxicity (harmful content?), G-Eval (custom LLM-judge criteria), and more. Most metrics use GPT-4 as a judge by default.

How does DeepEval compare to RAGAS?

Both evaluate RAG pipelines. RAGAS is specialized for RAG evaluation with similar metrics. DeepEval is broader — it covers chatbots, agents, and general LLM outputs beyond RAG, plus non-retrieval metrics like bias, toxicity, and custom criteria via G-Eval. DeepEval's pytest integration is tighter. For pure RAG evaluation, RAGAS is more focused; for comprehensive LLM testing across use cases, DeepEval is more versatile.

DeepEval | db.fyi

Why it matters

pytest-compatible design means LLM quality tests integrate into existing CI/CD pipelines without new tooling.
14+ pre-built metrics eliminate the need to write custom evaluation logic for common quality concerns.
LLM-as-judge approach (using GPT-4 to score outputs) scales evaluation without requiring human annotators.
Coverage of both RAG-specific (faithfulness, contextual recall) and general (bias, toxicity, hallucination) metrics in one framework.

Key capabilities

14+ metrics: Faithfulness, Answer Relevancy, Contextual Precision/Recall, Hallucination, Bias, Toxicity, Summarization Quality, and more.
G-Eval: Define custom evaluation criteria in natural language; GPT-4 scores against them.
pytest integration: @pytest.mark.parametrize style test cases; run with deepeval test run.
RAG evaluation: Complete retrieval quality metrics — what's retrieved, whether it's used faithfully.
Agent evaluation: Evaluate tool selection correctness, multi-step reasoning, and task completion.
Red teaming: Automated adversarial attack generation to test safety and robustness.
Dataset management: Store and version evaluation datasets via Confident AI cloud.
Benchmark comparison: Compare your model against public benchmarks (MMLU, HellaSwag, etc.).

Technical notes

License: Apache 2.0 (open source)
GitHub: github.com/confident-ai/deepeval (5.5K+ stars)
Install: pip install deepeval
LLM judge: GPT-4o by default; configurable to any LLM
CI/CD: GitHub Actions, GitLab CI via pytest
Cloud: Confident AI (confident-ai.com) for hosted tracking
Pricing: Framework free; Confident AI has free tier + paid plans

Ideal for

Engineering teams who want to integrate LLM quality testing into CI/CD — catch regressions before they reach production.
Teams building RAG systems who need quantitative metrics for faithfulness and retrieval quality.
Organizations who need safety evaluation (bias, toxicity) alongside quality metrics in one framework.

Not ideal for

Teams who need a polished cloud UI with team collaboration out of the box — Braintrust has a stronger enterprise evaluation platform.
Large-scale evaluation without LLM API costs — G-Eval and judge-based metrics call GPT-4 per test case.
Non-Python teams — DeepEval is a Python library.

DeepEval

Why it matters

Key capabilities

Technical notes

Ideal for

Not ideal for

See also

FAQ

Alternatives

Integrations

Built on

Why it matters

Key capabilities

Technical notes

Ideal for

Not ideal for

See also

DeepEval

Why it matters

Key capabilities

Technical notes

Ideal for

Not ideal for

See also

FAQ

What is DeepEval?

Is DeepEval free?

What evaluation metrics does DeepEval provide?

How does DeepEval compare to RAGAS?

Alternatives

Integrations

Built on

Related tools

Why it matters

Key capabilities

Technical notes

Ideal for

Not ideal for

See also