Overview
DeepSeek V3 is a 671 billion parameter Mixture-of-Experts (MoE) language model released in December 2024 under the MIT license. Its benchmark performance — 88.5 on MMLU, 91.6 on HumanEval, and 90.2 on MATH — places it firmly in GPT-4-class territory. What makes DeepSeek V3 remarkable is not just its performance but what it reveals about the economics of AI training: the model was reportedly trained for approximately $6 million, which is 10 to 50 times less than comparable models from US labs trained around the same period.
Mixture-of-Experts Architecture
DeepSeek V3 uses a Mixture-of-Experts (MoE) architecture, which is the key to both its efficiency and its capability. In a dense model, every parameter is activated for every token processed. In an MoE model, the parameters are divided into "expert" groups, and only a subset are activated for each token.
Specifically, DeepSeek V3 has:
- 671B total parameters: Comparable in scale to other large models.
- 37B activated parameters per token: Only ~5.5% of the total parameters are used for any given inference step.
This means inference cost is determined by the 37B activated parameters rather than the full 671B — giving MoE models a significant cost advantage over dense models of equivalent capacity. The trade-off is more complex training dynamics and larger storage requirements for the full model.
$6 Million Training Cost
DeepSeek published detailed training information in their technical report, including an estimated total compute cost of approximately $6 million USD for the full training run. For context:
- GPT-4 was estimated to cost over $100 million to train.
- Llama 3 405B training reportedly required significant compute budgets well above this range.
- Gemini Ultra training costs have not been disclosed but are assumed to be in the hundreds of millions.
Whether the $6M figure captures all costs (data preparation, research compute, failed runs) is a matter of debate, but the efficiency claims are supported by the model's architectural choices: MoE activations, efficient attention mechanisms, and training infrastructure optimisations documented in the technical report.
The implication is significant: frontier AI capabilities may be achievable at costs far below what Western labs have been spending, which changes the competitive landscape fundamentally.
MIT License
Like DeepSeek R1, DeepSeek V3 is released under the MIT license — one of the most permissive open-source licenses available. Full model weights are available for download, commercial use is explicitly permitted, and there are no restrictions on modification or redistribution beyond standard MIT terms.
Benchmark Performance
| Benchmark | DeepSeek V3 | Notes | |-----------|-------------|-------| | MMLU | 88.5 | GPT-4-class broad knowledge | | HumanEval | 91.6 | Competitive with leading code models | | MATH | 90.2 | Strong mathematical reasoning |
The HumanEval score of 91.6 is particularly notable — placing DeepSeek V3 among the best code-generation models available, competitive with Codestral and Claude 3.5 Sonnet on coding tasks.
Coding Capability
DeepSeek V3's strong coding benchmark performance reflects genuine capability in practice. The model handles:
- Complex algorithm implementation in Python, C++, JavaScript, and other languages.
- Multi-file code generation and refactoring.
- Debugging — identifying logic errors, runtime errors, and performance issues.
- Translating specifications into working implementations.
For teams that want a powerful open-weight code model that does not specialise at the expense of general capability, DeepSeek V3 is a strong choice.
Self-Hosting Considerations
At 671B total parameters, hosting the full model in BF16 requires approximately 1.3TB of GPU VRAM — practical only on dedicated multi-node GPU clusters. However:
- 4-bit quantised: ~350GB VRAM — achievable on 4–5× A100 80GB nodes.
- Via managed APIs: DeepSeek's own API, Together AI, and Fireworks provide managed access without infrastructure overhead.
- Community quantisations: The open-source community has produced GGUF and other formats for various quantisation levels via llama.cpp.
For most teams, the managed API at $0.27 per million input tokens offers the most practical path — GPT-4-class performance at roughly 10–20× lower cost than GPT-4 API pricing.
Best Use Cases
- High-quality code generation: Complex software development tasks where a strong open model is preferred over closed APIs.
- Cost-sensitive production workloads: Where GPT-4-class quality is needed at GPT-3.5-level API pricing.
- Self-hosted AI deployments: Organisations with the infrastructure to run large MoE models, wanting GPT-4 capability without vendor dependency.
- Research into efficient AI: Studying how MoE architectures achieve frontier performance at lower training cost.
- General text generation: Strong writing, summarisation, and analysis tasks across domains.