vLLM is an open-source LLM inference and serving library developed at UC Berkeley. It uses a novel PagedAttention algorithm to manage KV cache memory efficiently, allowing much higher throughput than naive implementations. It exposes an OpenAI-compatible REST API and supports hundreds of models from HuggingFace.

vLLM is fully open source under Apache 2.0 — free to use and self-host at any scale. You provide your own GPU hardware or cloud instances. No usage fees, no licensing costs.

What models does vLLM support?

vLLM supports most major open-source LLMs: Llama 2/3/3.1, Mistral, Mixtral, Gemma, Qwen, DeepSeek, Phi, Command-R, and hundreds more from HuggingFace Hub. Multi-modal models (LLaVA, InternVL) are supported with vision input. It also supports GPTQ, AWQ, and GGUF quantized models.

What is PagedAttention?

PagedAttention is vLLM's core innovation — inspired by virtual memory paging in operating systems, it manages the KV (key-value) cache in non-contiguous memory blocks. This eliminates KV cache fragmentation, allows much larger batch sizes, and achieves 2–24× higher throughput vs. naive attention implementations.

vLLM | db.fyi | db.fyi

Why it matters

The gold standard for self-hosted LLM serving — widely used in research and production by companies serving millions of requests.
PagedAttention achieves up to 24× higher throughput than HuggingFace Transformers for the same hardware.
OpenAI-compatible API (/v1/completions, /v1/chat/completions) means any OpenAI SDK works out of the box — swap the base URL.
Supports virtually every open-source LLM model family — if it's on HuggingFace, vLLM likely supports it.

Key capabilities

PagedAttention: Novel KV cache management that drastically increases throughput and reduces memory waste.
Continuous batching: Dynamically batches incoming requests without blocking — processes new requests while others are mid-generation.
OpenAI-compatible API: /v1/completions and /v1/chat/completions endpoints; swap base_url in any OpenAI client.
Model support: Llama, Mistral, Mixtral, Gemma, Qwen, DeepSeek, Phi, Falcon, Command-R, BLOOM, and 200+ models.
Quantization: GPTQ, AWQ, GGUF, FP8, INT8 quantization for fitting larger models in available VRAM.
Multi-GPU tensor parallelism: Shard models across multiple GPUs for models too large for a single card.
LoRA serving: Load and serve multiple LoRA adapters on top of a base model simultaneously.
Speculative decoding: Speed up token generation using draft models for early-exit speculation.

Technical notes

Language: Python; requires CUDA (NVIDIA GPUs); AMD ROCm support experimental
Installation: pip install vllm (requires CUDA toolkit and Python 3.9+)
Memory: Minimum 16GB VRAM for 7B models; 80GB+ for 70B models without quantization
API: OpenAI-compatible REST; also Python API for direct integration
License: Apache 2.0
Hardware: NVIDIA A100, H100, A10G, L40S, and consumer-grade RTX cards (with reduced throughput)
Founded/maintained: Created at UC Berkeley Sky Lab in 2023; now maintained by the vLLM project (independent OSS)

Ideal for

ML teams and companies self-hosting open-source LLMs who need maximum throughput from their GPU budget.
Organizations building OpenAI-compatible APIs on open-source models (Llama, Mistral) for cost control or privacy.
Researchers benchmarking or evaluating LLMs at scale who need a fast, configurable serving stack.

Not ideal for

Local single-user inference on consumer laptops — llama.cpp/Ollama is more appropriate for CPU/M-series.
Teams without GPU infrastructure — vLLM requires NVIDIA CUDA hardware.
Beginners seeking a simple UI to chat with models — use Open WebUI or LM Studio instead.

vLLM

Why it matters

Key capabilities

Technical notes

Ideal for

Not ideal for

See also

FAQ

Alternatives

Integrations

Built on

Why it matters

Key capabilities

Technical notes

Ideal for

Not ideal for

See also

vLLM

Why it matters

Key capabilities

Technical notes

Ideal for

Not ideal for

See also

FAQ

What is vLLM?

Is vLLM free?

What models does vLLM support?

What is PagedAttention?

Alternatives

Integrations

Built on

Related tools

Why it matters

Key capabilities

Technical notes

Ideal for

Not ideal for

See also