Overview

Grok 3 is xAI's frontier reasoning model, released in February 2025 after a rapid scaling push that leveraged xAI's Colossus supercluster — one of the largest AI training clusters ever assembled. The results were immediately apparent: Grok 3 achieved 93.3 on MMLU, 97.7 on MATH, and 84.6 on GPQA at launch, placing it at the very top of public benchmark rankings alongside the best models from OpenAI, Anthropic, and Google.

Colossus: The Training Infrastructure

The story behind Grok 3 begins with infrastructure. xAI built the Colossus supercluster in Memphis, Tennessee — a facility housing 100,000 NVIDIA H100 GPUs that came online in record time (reportedly assembled in 122 days from ground-up construction). This scale of compute enabled training runs that simply weren't possible on smaller clusters, which contributed directly to Grok 3's benchmark-leading performance.

The scale of Colossus demonstrates xAI's ambition and provides the infrastructure foundation for continued model improvement beyond Grok 3.

Benchmark Performance at Launch

Grok 3's benchmark scores at launch were among the highest reported for any publicly available model:

| Benchmark | Score | Context | |-----------|-------|---------| | MMLU | 93.3 | Top-tier broad knowledge | | MATH | 97.7 | Near-perfect mathematical reasoning | | GPQA | 84.6 | Graduate-level science, competitive with best models |

On AIME (American Invitational Mathematics Examination — a competition mathematics benchmark that separates truly strong reasoning models from the rest), Grok 3 posted competitive scores against o1 and DeepSeek R1, demonstrating that the reasoning capability is real and not just benchmark optimisation.

Think Mode: Extended Reasoning

Grok 3 includes a "Think" mode that enables extended chain-of-thought reasoning before producing a final answer. Like o1 and DeepSeek R1, this mode allows the model to:

Break down complex problems into steps before attempting to solve them.
Reconsider and backtrack when an approach isn't working.
Verify intermediate results before proceeding.
Produce more reliable answers on problems where quick intuition fails.

Think mode is particularly valuable for mathematical proofs, complex coding tasks, multi-step logical reasoning, and scientific problem-solving. Users can toggle between standard and Think mode depending on the task.

Integrated into Grok.com and X

Grok 3 powers the flagship Grok assistant available at Grok.com and within the X platform for Premium subscribers. This gives a large existing user base immediate access to frontier-model capability within a familiar interface. X Premium+ subscribers get access to Think mode for extended reasoning.

Real-Time X Data Access

Like Grok 2, Grok 3 retains access to real-time X/Twitter data, allowing it to answer questions about current events, trending topics, and live information — a capability that static-knowledge models lack regardless of their benchmark scores.

API Access

Available via the xAI API at $3 per million input tokens and $15 per million output tokens. The API is OpenAI-compatible, simplifying integration for developers already working with the OpenAI SDK.

Best Use Cases

Competitive mathematics and science: Problems at or near competition level where extended reasoning and deep knowledge matter.
Complex coding: Architecture design, algorithm optimisation, debugging subtle logical errors.
Research assistance: Graduate-level reasoning across STEM domains.
Real-time information tasks: Combining frontier intelligence with live X data access.
Extended reasoning workflows: Multi-step problems where Think mode can explore and verify before committing to an answer.

Context

131K

200K

128K

MMLU

93.3

90.1

88.7

HumanEval

—

93.5

93.7

90.2

MATH

97.7

93.7

78.3

76.6

GPQA

84.6

70.0

65.0

53.6

Pricing

Freemium

Input $/M

$3.00

$2.50

Overview

Colossus: The Training Infrastructure

The scale of Colossus demonstrates xAI's ambition and provides the infrastructure foundation for continued model improvement beyond Grok 3.

Benchmark Performance at Launch

Grok 3's benchmark scores at launch were among the highest reported for any publicly available model:

Think Mode: Extended Reasoning

Grok 3 includes a "Think" mode that enables extended chain-of-thought reasoning before producing a final answer. Like o1 and DeepSeek R1, this mode allows the model to:

Break down complex problems into steps before attempting to solve them.

Reconsider and backtrack when an approach isn't working.

Verify intermediate results before proceeding.

Produce more reliable answers on problems where quick intuition fails.

Best Use Cases

Competitive mathematics and science: Problems at or near competition level where extended reasoning and deep knowledge matter.

Complex coding: Architecture design, algorithm optimisation, debugging subtle logical errors.

Research assistance: Graduate-level reasoning across STEM domains.

Real-time information tasks: Combining frontier intelligence with live X data access.

Extended reasoning workflows: Multi-step problems where Think mode can explore and verify before committing to an answer.

Provider	xAI
Released	2025-02-17
Status	Current
Context window	131K tokens
Pricing	Freemium
Input price	$3.00/M
Output price	$15.00/M
Capabilities	textvisioncodefunction-callingreasoning
API docs	Docs ↗

Grok 3

Benchmarks

Overview

Colossus: The Training Infrastructure

Benchmark Performance at Launch

Think Mode: Extended Reasoning

Integrated into Grok.com and X

Real-Time X Data Access

API Access

Best Use Cases

Compare with similar models

Overview

Colossus: The Training Infrastructure

Benchmark Performance at Launch

Think Mode: Extended Reasoning

Integrated into Grok.com and X

Real-Time X Data Access

API Access

Best Use Cases