Together AI is a cloud API service for open-source LLMs. You get: serverless inference for 100+ models (Llama, Mistral, Mixtral, Code Llama, etc.) with OpenAI-compatible API endpoints; fine-tuning to train custom models on your data; and dedicated deployments for consistent performance. Together AI is particularly popular for its Llama model hosting, competitive pricing, and the ability to go from prototyping on the serverless API to deploying a fine-tuned model in production.

How much does Together AI cost?

Together AI pricing is per-token with a free tier for new accounts. Costs vary by model: Llama 3.1 8B is ~$0.18/M tokens (among the cheapest), Llama 3.1 70B is ~$0.88/M tokens, Llama 3.1 405B is ~$5/M tokens. Fine-tuning has separate per-token training costs. Dedicated deployments have hourly rates. Together AI is generally cheaper than OpenAI for comparable open-model quality and competitive with Groq for non-speed-critical workloads.

Does Together AI support fine-tuning?

Yes — fine-tuning is a key differentiator for Together AI. You can fine-tune: Llama, Mistral, and other supported models on your custom dataset; then deploy the fine-tuned model on serverless inference or dedicated instances. The fine-tuning uses LoRA/QLoRA for efficient training. This enables teams to create specialized versions of open models (customer support tone, specific domain knowledge, output format adherence) without managing their own GPU cluster.

How does Together AI compare to Groq?

Groq is faster (500-800 t/s vs. Together's 50-200 t/s) but has a smaller model selection and no fine-tuning. Together AI offers 100+ models, fine-tuning, dedicated deployments, and slightly lower pricing at the cost of speed. For latency-sensitive applications (real-time chat, interactive agents), Groq wins. For batch processing, fine-tuning, or cost optimization, Together AI is often better. Many teams use both.

Together AI | db.fyi

Why it matters

Fine-tuning + inference in one platform lets teams go from raw open model to domain-specific production model without managing multiple vendors.
100+ model selection covers cutting-edge open models as they're released — Together AI adds new models quickly, often within days of a model's public release.
OpenAI-compatible API means minimal code changes to switch from OpenAI to open models — change base URL and model name, keep existing logic.
Competitive pricing makes large-scale open model deployment economical — Llama 70B at $0.88/M tokens vs. GPT-4o at $15/M tokens for comparable tasks.

Key capabilities

100+ models: Llama 3.1 (8B, 70B, 405B), Mistral, Mixtral, Code Llama, Qwen, DBRX, Gemma, and more.
OpenAI-compatible API: Drop-in replacement for most OpenAI SDK integrations.
Fine-tuning: LoRA/QLoRA fine-tuning on custom datasets; deploy fine-tuned models.
Dedicated deployments: Reserved GPU instances for consistent performance and privacy.
Serverless inference: Pay-per-token with no idle costs.
Embeddings: Vector embeddings via BAAI/bge and other embedding models.
Image generation: SDXL and other image models alongside text models.
Streaming: Real-time token streaming for chat applications.

Technical notes

API: OpenAI-compatible REST at api.together.xyz/v1
Python: pip install together or use OpenAI SDK with base_url override
Models: 100+ open-source text, code, image, embedding models
Pricing: From $0.18/M tokens (Llama 8B) to $5/M tokens (Llama 405B)
Fine-tuning: LoRA/QLoRA; deploy fine-tuned models via API
Stars: 12K (together-python)

Usage example

from openai import OpenAI

# Together AI with OpenAI SDK
client = OpenAI(
    api_key="YOUR_TOGETHER_API_KEY",
    base_url="https://api.together.xyz/v1"
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Explain the CAP theorem simply"}]
)
print(response.choices[0].message.content)

Ideal for

Teams wanting to fine-tune open models on custom data and deploy them in production without managing GPU infrastructure.
Cost-sensitive applications where GPT-4 pricing is unsustainable and open model quality is sufficient.
Developers prototyping with many different open models to find the best fit for their use case.

Not ideal for

Latency-sensitive real-time applications — use Groq for maximum speed.
Teams needing GPT-4-class frontier reasoning — open models are capable but still below GPT-4o/Claude 3.5 Sonnet on complex tasks.
Fully offline or air-gapped requirements — Together AI is cloud-only.

Together AI

Why it matters

Key capabilities

Technical notes

Usage example

Ideal for

Not ideal for

See also

FAQ

Alternatives

Integrations

Built on

Why it matters

Key capabilities

Technical notes

Usage example

Ideal for

Not ideal for

See also

Together AI

Why it matters

Key capabilities

Technical notes

Usage example

Ideal for

Not ideal for

See also

FAQ

What is Together AI?

How much does Together AI cost?

Does Together AI support fine-tuning?

How does Together AI compare to Groq?

Alternatives

Integrations

Built on

Related tools

Why it matters

Key capabilities

Technical notes

Usage example

Ideal for

Not ideal for

See also