Skip to main content

DeepEval

Open-source LLM evaluation framework — 14+ metrics for RAG, agents, and chatbots

LLM FrameworksFree

DeepEval is an open-source Python framework for evaluating LLM outputs with production-grade metrics. It provides 14+ built-in metrics (faithfulness, answer relevancy, hallucination, bias, toxicity) plus LLM-as-judge scoring — all integrated with pytest for CI/CD. Used by engineering teams to catch LLM quality regressions before deployment.

Key specs
5,500 GitHub stars source
as of 2026-03-27
Loading…

FAQ

Alternatives

Integrations

None listed.

Built on

None listed.