Promptfoo is an open-source tool for evaluating LLM prompts and outputs. You define test cases (inputs + expected outputs or grading criteria) in YAML files, and Promptfoo runs them against one or many LLM providers simultaneously, scoring results and displaying them in a side-by-side comparison table. It also includes automated red-teaming to probe for safety issues, hallucinations, and prompt injections.

Promptfoo is completely free and open source (MIT license). You run it locally — no data leaves your machine except to the LLM APIs you configure. There's a hosted cloud option for teams who want shared dashboards and collaboration, but the core evaluation tool is free. GitHub: github.com/promptfoo/promptfoo.

How does Promptfoo's red-teaming work?

Promptfoo's automated red-teaming generates adversarial test cases targeting specific vulnerability categories: jailbreaks, prompt injection attacks, PII disclosure, off-topic responses, harmful content, hallucination, and more. It uses a separate LLM to generate adversarial prompts and then scores whether your application handles them appropriately. It can generate hundreds of red-team attacks and give you a safety score.

How does Promptfoo compare to Braintrust?

Both evaluate LLM outputs, but different approaches. Promptfoo is open-source, CLI-first, and developer-oriented — you define tests in YAML/JS and run them locally. Braintrust is a cloud platform with a polished UI, team collaboration, and production monitoring. Promptfoo is better for CI/CD integration and open-source projects; Braintrust for enterprise teams who need collaboration and logging.

Promptfoo | db.fyi

Why it matters

Open-source and self-hosted means no data privacy concerns — test cases and outputs never leave your infrastructure.
CLI-first design integrates natively with CI/CD pipelines — run promptfoo eval as a GitHub Actions step on every PR.
Multi-model evaluation catches provider-specific quality regressions before deployment — test GPT-4 and Claude simultaneously.
Automated red-teaming finds safety issues proactively rather than waiting for users to discover them in production.

Key capabilities

Multi-model evaluation: Run the same test suite against GPT-4o, Claude 3.5, Llama 3, Gemini, and more simultaneously.
YAML test definition: Define inputs, expected outputs, and grading criteria in readable YAML or JavaScript.
Automated red-teaming: Generate hundreds of adversarial prompts targeting jailbreaks, injections, and safety failures.
Grading: LLM-as-judge scoring, regex matching, semantic similarity, and custom JavaScript graders.
Web UI: Local browser-based comparison table showing all model responses side by side.
CI/CD integration: promptfoo eval CLI command with JSON output for automated pipelines.
Prompt caching: Cache LLM responses during development to save API costs on repeated test runs.
Provider support: OpenAI, Anthropic, Cohere, Mistral, Hugging Face, AWS Bedrock, Vertex AI, Ollama, and more.

Technical notes

License: MIT (open source)
GitHub: github.com/promptfoo/promptfoo
Install: npm install -g promptfoo or npx promptfoo@latest
Languages: TypeScript/JavaScript (core); YAML (test definitions); Python via API
Providers: OpenAI, Anthropic, Cohere, Mistral, HuggingFace, Bedrock, Vertex, Ollama, local models
CI: GitHub Actions, GitLab CI, Jenkins integration
Pricing: Free (self-hosted); Cloud tier for teams (paid)

Ideal for

AI engineers who want CI/CD integration for LLM quality — automated eval on every PR that touches prompts.
Security teams who need to red-team LLM applications for safety issues before deployment.
Organizations evaluating which LLM provider gives the best quality/cost ratio for a specific use case.

Not ideal for

Teams who need a polished cloud UI with team collaboration and production monitoring — Braintrust is better.
Non-technical stakeholders who need to run evaluations without a CLI — Braintrust or LangSmith have better UIs.
RAG-specific evaluation metrics (faithfulness, recall, precision) — RAGAS has specialized metrics for retrieval.

Promptfoo

Why it matters

Key capabilities

Technical notes

Ideal for

Not ideal for

See also

FAQ

Alternatives

Integrations

Built on

Why it matters

Key capabilities

Technical notes

Ideal for

Not ideal for

See also

Promptfoo

Why it matters

Key capabilities

Technical notes

Ideal for

Not ideal for

See also

FAQ

What is Promptfoo?

Is Promptfoo free?

How does Promptfoo's red-teaming work?

How does Promptfoo compare to Braintrust?

Alternatives

Integrations

Built on

Related tools

Why it matters

Key capabilities

Technical notes

Ideal for

Not ideal for

See also