Groq is a cloud AI inference service running open-source LLMs at extremely high speed on custom LPU (Language Processing Unit) chips. You access Groq like an OpenAI-compatible API — send a chat completion request, get a response. The difference is speed: Groq typically achieves 500-800+ tokens/second, compared to 50-150 t/s for typical GPU inference. This means a full 500-word response streams out in about 1 second rather than 5-10 seconds.

Groq has a free tier with daily rate limits — enough for development, testing, and moderate personal use. The free tier covers: Llama 3.1 8B, Mixtral 8x7B, Gemma 7B, and other models with limits on requests per minute and tokens per day. Paid plans remove rate limits and provide higher throughput. Check groq.com for current pricing, as it changes with new model additions.

Groq's LPU (Language Processing Unit) is purpose-built for sequential LLM token generation. GPUs excel at parallel matrix operations (great for training); the LPU excels at sequential computation with deterministic performance (critical for inference). Groq's architecture eliminates the memory bandwidth bottlenecks that slow GPU inference during the autoregressive decoding phase, enabling much higher tokens-per-second throughput.

What models does Groq support?

Groq supports leading open-source models: Meta Llama 3.1 (8B, 70B, 405B), Mixtral 8x7B, Mistral 7B, Google Gemma 7B/9B, Llama 3 Groq Tool Use models (fine-tuned for function calling), Llama Guard (content moderation), and Whisper (speech-to-text). The model list expands regularly. Check groq.com/docs for the current complete list.

Groq | db.fyi | db.fyi

Why it matters

500-800+ tokens/second makes conversations and agent responses feel instantaneous — the most noticeable UX improvement available for open-model applications.
OpenAI-compatible API means Groq is a drop-in replacement for many applications — change the base URL and model name, keep existing code.
Free tier with no credit card makes it the default choice for developers prototyping open-model applications.
Function calling support via specialized Llama models enables building production-ready agents on Groq's fast inference.

Key capabilities

Ultra-fast inference: 500-800+ tokens/second for supported models.
Open models: Llama 3.1, Mixtral, Gemma, Mistral, and more.
OpenAI-compatible API: Drop-in replacement for most OpenAI SDK code.
Free tier: Daily rate-limited access to all models; no credit card required.
Function calling: Groq-fine-tuned Llama models with tool use capability.
Whisper: High-speed Whisper speech-to-text via same API.
Streaming: Real-time token streaming for chat applications.
Low latency: First token in under 200ms for most requests.

Technical notes

Hardware: Custom LPU (Language Processing Unit)
Speed: 500-800+ tokens/second (varies by model and load)
API: OpenAI-compatible REST; groq.com/docs
Base URL: https://api.groq.com/openai/v1
Free tier: Rate-limited; no credit card
Python: pip install groq or use OpenAI SDK with base_url override
Models: Llama 3.1, Mixtral 8x7B, Gemma, Whisper

Usage example

from groq import Groq

client = Groq(api_key="YOUR_GROQ_API_KEY")

response = client.chat.completions.create(
    model="llama-3.1-70b-versatile",
    messages=[{"role": "user", "content": "Explain the difference between TCP and UDP"}],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content or "", end="")

Ideal for

Developers building real-time chat interfaces and AI agents where response latency directly impacts UX.
Prototyping with open models without managing GPU infrastructure or paying for premium model APIs.
Applications that can tolerate open-model quality (Llama 70B) but need speed comparable to or faster than GPT-4o.

Not ideal for

Applications needing GPT-4-level reasoning — Llama 70B is capable but below frontier for complex tasks.
High-volume production at scale without rate limit concerns — evaluate paid plan limits carefully.
Custom or fine-tuned model hosting — Groq runs their supported model list only.

Groq

Why it matters

Key capabilities

Technical notes

Usage example

Ideal for

Not ideal for

See also

FAQ

Alternatives

Integrations

Built on

Why it matters

Key capabilities

Technical notes

Usage example

Ideal for

Not ideal for

See also

Groq

Why it matters

Key capabilities

Technical notes

Usage example

Ideal for

Not ideal for

See also

FAQ

What is Groq?

Is Groq free?

Why is Groq so fast?

What models does Groq support?

Alternatives

Integrations

Built on

Related tools

Why it matters

Key capabilities

Technical notes

Usage example

Ideal for

Not ideal for

See also