Overview

Gemini 2.0 Flash represents the next generation of Google's fast model family, building on the success of Gemini 1.5 Flash while adding significant new capabilities: the Multimodal Live API for real-time streaming interactions, native image and audio generation, and improved agentic tool use. It maintains the full 1,048,576 token context window while delivering better performance than its predecessor at a still-competitive price.

Faster Than 1.5 Flash

Gemini 2.0 Flash is faster than Gemini 1.5 Flash across the board, with lower latency and higher throughput. On benchmark tasks, it also outperforms its predecessor despite the smaller model class — Google's second-generation training improvements translate into meaningful gains in reasoning, instruction following, and multilingual performance.

The MMLU score of 82.0 represents a meaningful step up from 1.5 Flash's 79.9, achieved while maintaining the speed characteristics that make Flash models suitable for production workloads.

Multimodal Live API

The standout new capability in Gemini 2.0 Flash is the Multimodal Live API — a streaming interface that enables real-time, low-latency interactions with audio and video input. This unlocks a new class of applications:

Real-time voice assistants: Bidirectional audio streaming with sub-second response times, enabling natural spoken conversation.
Live video analysis: Stream a webcam or screen feed and have the model respond to what it sees in real time.
Interactive coding assistants: Get immediate feedback as you type or speak, without waiting for a full round-trip.
Accessibility tools: Real-time description of visual content for users who need it.

Native Image and Audio Generation

Unlike previous Gemini models that relied on separate generation models, Gemini 2.0 Flash includes native capabilities for generating images and audio directly. This simplifies multimodal pipelines by reducing the number of API calls and model handoffs required.

Agentic Tool Use

Gemini 2.0 Flash has been specifically improved for agentic workflows — tasks where the model needs to plan, call tools, interpret results, and iterate. Native function calling is tighter and more reliable, making it well-suited for:

Automated research agents: Browse, extract, and synthesise information from multiple sources.
Code execution loops: Write code, run it, interpret errors, and fix them autonomously.
Multi-step workflows: Chain together tool calls to complete complex tasks without human intervention at each step.

Best Use Cases

Real-time voice and video applications: Any product requiring low-latency multimodal interaction.
Agentic systems: Autonomous task completion with tool use and iterative reasoning.
Production pipelines: High-volume, fast-turnaround processing with improved accuracy over 1.5 Flash.
Interactive applications: Chat interfaces, live assistants, and co-pilots that need to feel responsive.
Multimodal content generation: Combined text, image, and audio outputs from a single model.

Access

Gemini 2.0 Flash is available via the Google Gemini API, Google AI Studio, and Google Cloud Vertex AI. The Multimodal Live API requires a separate connection setup for streaming — refer to the API documentation for implementation details.

Context

1.0M

128K

200K

MMLU

82.0

79.9

82.0

83.0

HumanEval

—

87.2

88.0

MATH

—

GPQA

—

Pricing

Free

Freemium

Input $/M

$0.10

$0.07

$0.15

$0.80

Overview

Faster Than 1.5 Flash

The MMLU score of 82.0 represents a meaningful step up from 1.5 Flash's 79.9, achieved while maintaining the speed characteristics that make Flash models suitable for production workloads.

Multimodal Live API

Real-time voice assistants: Bidirectional audio streaming with sub-second response times, enabling natural spoken conversation.

Live video analysis: Stream a webcam or screen feed and have the model respond to what it sees in real time.

Interactive coding assistants: Get immediate feedback as you type or speak, without waiting for a full round-trip.

Accessibility tools: Real-time description of visual content for users who need it.

Agentic Tool Use

Automated research agents: Browse, extract, and synthesise information from multiple sources.

Code execution loops: Write code, run it, interpret errors, and fix them autonomously.

Multi-step workflows: Chain together tool calls to complete complex tasks without human intervention at each step.

Best Use Cases

Real-time voice and video applications: Any product requiring low-latency multimodal interaction.

Agentic systems: Autonomous task completion with tool use and iterative reasoning.

Production pipelines: High-volume, fast-turnaround processing with improved accuracy over 1.5 Flash.

Interactive applications: Chat interfaces, live assistants, and co-pilots that need to feel responsive.

Multimodal content generation: Combined text, image, and audio outputs from a single model.

Provider	Google
Released	2025-01-21
Status	Current
Context window	1.0M tokens
Pricing	Free
Input price	$0.10/M
Output price	$0.40/M
Capabilities	textvisionaudiovideocodefunction-callinglive-api
API docs	Docs ↗

Gemini 2.0 Flash

Benchmarks

Overview

Faster Than 1.5 Flash

Multimodal Live API

Native Image and Audio Generation

Agentic Tool Use

Best Use Cases

Access

Compare with similar models

Tools built on this model

Overview

Faster Than 1.5 Flash

Multimodal Live API

Native Image and Audio Generation

Agentic Tool Use

Best Use Cases

Access