RAGAS is an open-source Python library for evaluating how well RAG (Retrieval Augmented Generation) systems perform. It measures dimensions like: faithfulness (does the generated answer contain only facts from the retrieved documents?), answer relevancy (is the answer relevant to the question?), context recall (did the retrieval step find all the necessary information?), and context precision (is the retrieved context relevant, without unnecessary noise?).

How does RAGAS evaluate without human labels?

RAGAS uses LLMs as evaluators (LLM-as-a-judge). You provide the question, retrieved context, and generated answer. RAGAS uses GPT-4 (or another LLM you configure) to evaluate quality dimensions — asking questions like 'Is every claim in this answer supported by the context?' This allows automated evaluation of thousands of examples without human annotation.

Is RAGAS open source?

Yes. RAGAS is fully open source under the Apache 2.0 license. GitHub: github.com/explodinggradients/ragas. You can use the library locally or with any LLM backend for evaluation. Ragas Cloud (managed) provides dashboards and tracking for teams.

What metrics does RAGAS measure?

RAGAS core metrics: Faithfulness (answer grounded in context), Answer Relevancy (answer addresses the question), Context Recall (retrieved docs contain needed information), Context Precision (retrieved docs are relevant, not noisy). Additional metrics: Context Relevancy, Answer Correctness (requires ground truth), Aspect Critique (custom quality dimensions). Each metric returns a 0–1 score.

RAGAS | db.fyi

Why it matters

Provides objective measurement for RAG systems that otherwise require expensive human evaluation.
Identifies which part of a RAG pipeline is failing — retrieval quality vs. generation quality — enabling targeted improvement.
LLM-as-a-judge evaluation scales to thousands of test cases automatically.
Framework-agnostic: evaluates any RAG system built with LangChain, LlamaIndex, or custom code.

Key capabilities

Faithfulness: Measures whether every claim in the answer is supported by retrieved context — detects hallucinations.
Answer Relevancy: Scores how well the answer addresses the user's original question.
Context Recall: Evaluates whether the retrieval step found all information needed to answer correctly.
Context Precision: Measures how relevant the retrieved documents are — penalizes noisy retrieval.
Answer Correctness: When ground truth is available, measures factual accuracy.
Aspect Critique: Define custom evaluation criteria for domain-specific quality requirements.
LangChain integration: Direct evaluator integration for LangChain-based RAG chains.
LlamaIndex integration: Evaluate LlamaIndex query pipelines with RAGAS metrics.
Ragas Cloud: Managed dashboard for tracking evaluation results over time and across experiments.

Technical notes

License: Apache 2.0 (open source)
GitHub: github.com/explodinggradients/ragas (7K+ stars)
Install: pip install ragas
Evaluation LLM: GPT-4 (default); any LangChain-compatible LLM
Frameworks: LangChain, LlamaIndex, custom RAG
Output: Per-metric 0–1 scores; dataset-level aggregate statistics
Pricing: Free (library); Ragas Cloud (managed) pricing available

Ideal for

ML engineers building RAG applications who need to measure and improve retrieval and generation quality.
Teams iterating on RAG pipelines who need automated evaluation to track improvement across iterations.
Research teams comparing different embedding models, chunking strategies, or retrieval approaches quantitatively.

Not ideal for

Evaluating non-RAG LLM applications — use TruLens or Promptfoo for general LLM evaluation.
Real-time monitoring of production LLM calls — RAGAS is designed for offline batch evaluation.
Replacing human evaluation entirely — LLM-as-a-judge has known biases and limitations for subjective quality.

RAGAS

Why it matters

Key capabilities

Technical notes

Ideal for

Not ideal for

See also

FAQ

Alternatives

Integrations

Built on

Why it matters

Key capabilities

Technical notes

Ideal for

Not ideal for

See also

RAGAS

Why it matters

Key capabilities

Technical notes

Ideal for

Not ideal for

See also

FAQ

What is RAGAS?

How does RAGAS evaluate without human labels?

Is RAGAS open source?

What metrics does RAGAS measure?

Alternatives

Integrations

Built on

Related tools

Why it matters

Key capabilities

Technical notes

Ideal for

Not ideal for

See also