Replicate is a platform that lets you run open-source ML models via a REST API. Browse replicate.com to find thousands of models — Llama for text, Stable Diffusion for images, Whisper for audio, AnimateDiff for video, ControlNet for image manipulation, and more — then call them with a single API request. No GPU setup, no Docker containers, no infrastructure management. Pay per prediction with no minimums.

How is Replicate priced?

Replicate charges per prediction based on compute time. Pricing varies by model and hardware: CPU models are cheapest ($0.0001/sec), GPU models range from $0.00055/sec (T4) to $0.0023/sec (A100). A typical SDXL image costs $0.003-0.008 depending on steps. LLM inference costs vary by model. There's no minimum spend — pay only for what you use. New accounts get a small free credit.

What models does Replicate have?

Replicate hosts thousands of models: Llama 3.1 (text generation), Stable Diffusion XL and Flux (image generation), Whisper and Whisper Diarization (speech-to-text), Bark (text-to-speech), AnimateDiff (video from image), LLaVA (vision-language), BLIP (image captioning), ControlNet (controlled image generation), Depth estimation, Object detection, and hundreds of specialized models. Community members publish new models regularly.

Can I push and run my own models on Replicate?

Yes — Replicate's Cog framework lets you package any ML model as a container and push it to Replicate. Once pushed, your model gets a public or private API endpoint. This is how community members publish new models. You can also share models publicly with the community or keep them private for your application.

Replicate | db.fyi

Why it matters

Breadth of model availability unmatched elsewhere — 10,000+ models covering every ML modality (text, image, audio, video, code) accessible through one API account.
Zero infrastructure overhead — no GPU servers, no Docker setup, no model serving code; run frontier models in 3 lines of Python.
Pay-per-prediction with no minimums is ideal for low-volume or experimental use — no $20/month minimum when you need occasional model access.
Cog framework for model publishing enables a community marketplace where new models appear within days of research publication.

Key capabilities

10,000+ models: Llama, SDXL, Flux, Whisper, ControlNet, and thousands more.
REST API: Simple HTTP predictions; Python, Node.js, and HTTP clients.
Custom models: Push your own models with Cog; public or private.
Streaming: Real-time token streaming for LLM models.
Webhooks: Get notified when predictions complete.
Versions: Pin to specific model versions for reproducibility.
Private models: Host models for your application without making them public.
Fine-tuned models: Run SDXL DreamBooth and LoRA fine-tunes via API.

Technical notes

API: REST; base URL api.replicate.com
Python: pip install replicate
Pricing: Per-second compute; no minimums; ~$0.003-0.008/SDXL image
Hardware: T4, A40, A100 (model-dependent)
Stars: 32K (Cog repo)
Custom models: Cog framework (github.com/replicate/cog)

Usage example

import replicate

# Run SDXL image generation
output = replicate.run(
    "stability-ai/sdxl:39ed52f2a78e934b3ba6e2a89f5b1c712de7dfea535525255b1aa35c5565e08b",
    input={"prompt": "a photorealistic fox in a forest, golden hour lighting"}
)
print(output)  # URL to generated image

# Run Llama 3.1
for event in replicate.stream(
    "meta/meta-llama-3.1-405b-instruct",
    input={"prompt": "Explain transformer attention in one paragraph"}
):
    print(str(event), end="")

Ideal for

Developers who need access to many different ML modalities (text, image, audio, video) through one API account.
Researchers and hobbyists experimenting with new models without GPU infrastructure investment.
Applications with moderate prediction volume where per-prediction pricing is more cost-effective than dedicated GPU.

Not ideal for

High-volume single-model production workloads — at scale, dedicated GPU (Modal, Baseten) is significantly cheaper per inference.
Applications requiring sub-200ms latency — Replicate's cold starts can be 30+ seconds for larger models.
Fully managed inference with enterprise SLA — use Modal, Baseten, or Together AI for production reliability.

Replicate

Why it matters

Key capabilities

Technical notes

Usage example

Ideal for

Not ideal for

See also

FAQ

Alternatives

Integrations

Built on

Why it matters

Key capabilities

Technical notes

Usage example

Ideal for

Not ideal for

See also

Replicate

Why it matters

Key capabilities

Technical notes

Usage example

Ideal for

Not ideal for

See also

FAQ

What is Replicate?

How is Replicate priced?

What models does Replicate have?

Can I push and run my own models on Replicate?

Alternatives

Integrations

Built on

Related tools

Why it matters

Key capabilities

Technical notes

Usage example

Ideal for

Not ideal for

See also