Why it matters
- Open-source and self-hosted means no data privacy concerns — test cases and outputs never leave your infrastructure.
- CLI-first design integrates natively with CI/CD pipelines — run
promptfoo evalas a GitHub Actions step on every PR. - Multi-model evaluation catches provider-specific quality regressions before deployment — test GPT-4 and Claude simultaneously.
- Automated red-teaming finds safety issues proactively rather than waiting for users to discover them in production.
Key capabilities
- Multi-model evaluation: Run the same test suite against GPT-4o, Claude 3.5, Llama 3, Gemini, and more simultaneously.
- YAML test definition: Define inputs, expected outputs, and grading criteria in readable YAML or JavaScript.
- Automated red-teaming: Generate hundreds of adversarial prompts targeting jailbreaks, injections, and safety failures.
- Grading: LLM-as-judge scoring, regex matching, semantic similarity, and custom JavaScript graders.
- Web UI: Local browser-based comparison table showing all model responses side by side.
- CI/CD integration:
promptfoo evalCLI command with JSON output for automated pipelines. - Prompt caching: Cache LLM responses during development to save API costs on repeated test runs.
- Provider support: OpenAI, Anthropic, Cohere, Mistral, Hugging Face, AWS Bedrock, Vertex AI, Ollama, and more.
Technical notes
- License: MIT (open source)
- GitHub: github.com/promptfoo/promptfoo
- Install:
npm install -g promptfooornpx promptfoo@latest - Languages: TypeScript/JavaScript (core); YAML (test definitions); Python via API
- Providers: OpenAI, Anthropic, Cohere, Mistral, HuggingFace, Bedrock, Vertex, Ollama, local models
- CI: GitHub Actions, GitLab CI, Jenkins integration
- Pricing: Free (self-hosted); Cloud tier for teams (paid)
Ideal for
- AI engineers who want CI/CD integration for LLM quality — automated eval on every PR that touches prompts.
- Security teams who need to red-team LLM applications for safety issues before deployment.
- Organizations evaluating which LLM provider gives the best quality/cost ratio for a specific use case.
Not ideal for
- Teams who need a polished cloud UI with team collaboration and production monitoring — Braintrust is better.
- Non-technical stakeholders who need to run evaluations without a CLI — Braintrust or LangSmith have better UIs.
- RAG-specific evaluation metrics (faithfulness, recall, precision) — RAGAS has specialized metrics for retrieval.
See also
- Braintrust — Cloud-based enterprise eval platform with stronger team collaboration.
- RAGAS — RAG evaluation focused on retrieval quality metrics (faithfulness, recall).
- TruLens — Alternative open-source eval library; stronger for LangChain/LlamaIndex pipelines.