Overview

Llama 3.1 405B is Meta's largest openly released model and a landmark moment in the AI industry. When it launched in July 2024, it was the first open-weight model to genuinely compete with GPT-4 and Claude 3.5 Sonnet on major benchmarks — scoring 88.6 on MMLU and 73.5 on MATH. The fact that this level of capability is available under an open license, free to download and self-host, fundamentally changed what developers and organisations could build without vendor dependency.

GPT-4-Class, Fully Open

Before Llama 3.1 405B, reaching GPT-4-level intelligence meant accepting a closed API with usage restrictions, rate limits, and data policies you couldn't control. Llama 3.1 405B changed that calculation:

No API fees: Run it yourself and the only cost is compute.
No data sharing: Your prompts and outputs stay on your infrastructure.
No rate limits: Scale to whatever your hardware supports.
Full customisation: Fine-tune, quantise, and modify the weights for your specific use case.
Apache 2.0 license: Commercial use permitted for most organisations (usage policy applies for very large deployments).

128K Context Window

With a 131,072 token context window, Llama 3.1 405B handles lengthy documents, codebases, and extended conversations that smaller models would have to chunk and retrieve. This makes it practical for document-heavy workflows even when self-hosted.

Strong Multilingual Support

Llama 3.1 405B was trained on a multilingual dataset covering English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai, with meaningful performance across all of these. This makes it a compelling option for internationally-focused products that need strong non-English capability without sending data to external APIs.

Function Calling

Native function calling support enables the model to interact with external tools and APIs, making it suitable for agentic workflows — search, code execution, database queries, and multi-step task completion — all on your own infrastructure.

Self-Hosting Options

Running a 405B parameter model requires substantial hardware, but the ecosystem has matured significantly:

vLLM: High-throughput serving framework, well-optimised for Llama models. Requires multiple high-VRAM GPUs (e.g., 8× A100 80GB).
Together AI: Managed hosting that treats the model as an API with no infrastructure management.
Fireworks AI: Another managed provider offering fast inference with pay-per-token pricing.
Groq: Ultra-low-latency inference hardware, available for Llama models.
Ollama: For smaller quantised versions, local deployment on a single machine with 64–128GB RAM.

When to Use 405B vs Smaller Models

The 405B is the right choice when:

Task quality is paramount and compute cost is secondary.
You need the most capable open model for fine-tuning (distillation target).
You're building a product that cannot share data with external providers.
You need GPT-4-class reasoning on a self-hosted stack.

For most production use cases, Llama 3.3 70B offers near-identical performance at a fraction of the cost.

Access

Download from Hugging Face (requires accepting Meta's community license). Self-host using vLLM, llama.cpp (quantised), or use a managed provider like Together AI or Fireworks AI for immediate API access.

Context

131K

66K

131K

MMLU

88.6

88.5

90.8

86.0

HumanEval

—

91.6

—

MATH

73.5

90.2

97.3

77.0

GPQA

—

71.5

—

Pricing

Open

Input $/M

—

$0.27

$0.55

—

Overview

GPT-4-Class, Fully Open

No API fees: Run it yourself and the only cost is compute.

No data sharing: Your prompts and outputs stay on your infrastructure.

No rate limits: Scale to whatever your hardware supports.

Full customisation: Fine-tune, quantise, and modify the weights for your specific use case.

Apache 2.0 license: Commercial use permitted for most organisations (usage policy applies for very large deployments).

Strong Multilingual Support

Self-Hosting Options

Running a 405B parameter model requires substantial hardware, but the ecosystem has matured significantly:

vLLM: High-throughput serving framework, well-optimised for Llama models. Requires multiple high-VRAM GPUs (e.g., 8× A100 80GB).

Together AI: Managed hosting that treats the model as an API with no infrastructure management.

Fireworks AI: Another managed provider offering fast inference with pay-per-token pricing.

Groq: Ultra-low-latency inference hardware, available for Llama models.

Ollama: For smaller quantised versions, local deployment on a single machine with 64–128GB RAM.

When to Use 405B vs Smaller Models

The 405B is the right choice when:

Task quality is paramount and compute cost is secondary.

You need the most capable open model for fine-tuning (distillation target).

You're building a product that cannot share data with external providers.

You need GPT-4-class reasoning on a self-hosted stack.

For most production use cases, Llama 3.3 70B offers near-identical performance at a fraction of the cost.

Provider	Meta
Released	2024-07-23
Status	Current
Context window	131K tokens
Pricing	Open
Capabilities	textcodefunction-callingreasoning
Hugging Face	View on HF ↗

Llama 3.1 405B

Benchmarks

Overview

GPT-4-Class, Fully Open

128K Context Window

Strong Multilingual Support

Function Calling

Self-Hosting Options

When to Use 405B vs Smaller Models

Access

Compare with similar models

Overview

GPT-4-Class, Fully Open

128K Context Window

Strong Multilingual Support

Function Calling

Self-Hosting Options

When to Use 405B vs Smaller Models

Access