Why it matters
- Fine-tuning + inference in one platform lets teams go from raw open model to domain-specific production model without managing multiple vendors.
- 100+ model selection covers cutting-edge open models as they're released — Together AI adds new models quickly, often within days of a model's public release.
- OpenAI-compatible API means minimal code changes to switch from OpenAI to open models — change base URL and model name, keep existing logic.
- Competitive pricing makes large-scale open model deployment economical — Llama 70B at $0.88/M tokens vs. GPT-4o at $15/M tokens for comparable tasks.
Key capabilities
- 100+ models: Llama 3.1 (8B, 70B, 405B), Mistral, Mixtral, Code Llama, Qwen, DBRX, Gemma, and more.
- OpenAI-compatible API: Drop-in replacement for most OpenAI SDK integrations.
- Fine-tuning: LoRA/QLoRA fine-tuning on custom datasets; deploy fine-tuned models.
- Dedicated deployments: Reserved GPU instances for consistent performance and privacy.
- Serverless inference: Pay-per-token with no idle costs.
- Embeddings: Vector embeddings via BAAI/bge and other embedding models.
- Image generation: SDXL and other image models alongside text models.
- Streaming: Real-time token streaming for chat applications.
Technical notes
- API: OpenAI-compatible REST at api.together.xyz/v1
- Python:
pip install together or use OpenAI SDK with base_url override
- Models: 100+ open-source text, code, image, embedding models
- Pricing: From $0.18/M tokens (Llama 8B) to $5/M tokens (Llama 405B)
- Fine-tuning: LoRA/QLoRA; deploy fine-tuned models via API
- Stars: 12K (together-python)
Usage example
from openai import OpenAI
# Together AI with OpenAI SDK
client = OpenAI(
api_key="YOUR_TOGETHER_API_KEY",
base_url="https://api.together.xyz/v1"
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
messages=[{"role": "user", "content": "Explain the CAP theorem simply"}]
)
print(response.choices[0].message.content)
Ideal for
- Teams wanting to fine-tune open models on custom data and deploy them in production without managing GPU infrastructure.
- Cost-sensitive applications where GPT-4 pricing is unsustainable and open model quality is sufficient.
- Developers prototyping with many different open models to find the best fit for their use case.
Not ideal for
- Latency-sensitive real-time applications — use Groq for maximum speed.
- Teams needing GPT-4-class frontier reasoning — open models are capable but still below GPT-4o/Claude 3.5 Sonnet on complex tasks.
- Fully offline or air-gapped requirements — Together AI is cloud-only.
See also
- Groq — Ultra-fast open model inference; better for latency-critical applications.
- Fireworks AI — Another fast open-source LLM inference provider; competitive with Together AI.
- Replicate — Broader model variety including image/audio/video; different pricing model.