Why it matters
- The gold standard for self-hosted LLM serving — widely used in research and production by companies serving millions of requests.
- PagedAttention achieves up to 24× higher throughput than HuggingFace Transformers for the same hardware.
- OpenAI-compatible API (
/v1/completions,/v1/chat/completions) means any OpenAI SDK works out of the box — swap the base URL. - Supports virtually every open-source LLM model family — if it's on HuggingFace, vLLM likely supports it.
Key capabilities
- PagedAttention: Novel KV cache management that drastically increases throughput and reduces memory waste.
- Continuous batching: Dynamically batches incoming requests without blocking — processes new requests while others are mid-generation.
- OpenAI-compatible API:
/v1/completionsand/v1/chat/completionsendpoints; swapbase_urlin any OpenAI client. - Model support: Llama, Mistral, Mixtral, Gemma, Qwen, DeepSeek, Phi, Falcon, Command-R, BLOOM, and 200+ models.
- Quantization: GPTQ, AWQ, GGUF, FP8, INT8 quantization for fitting larger models in available VRAM.
- Multi-GPU tensor parallelism: Shard models across multiple GPUs for models too large for a single card.
- LoRA serving: Load and serve multiple LoRA adapters on top of a base model simultaneously.
- Speculative decoding: Speed up token generation using draft models for early-exit speculation.
Technical notes
- Language: Python; requires CUDA (NVIDIA GPUs); AMD ROCm support experimental
- Installation:
pip install vllm(requires CUDA toolkit and Python 3.9+) - Memory: Minimum 16GB VRAM for 7B models; 80GB+ for 70B models without quantization
- API: OpenAI-compatible REST; also Python API for direct integration
- License: Apache 2.0
- Hardware: NVIDIA A100, H100, A10G, L40S, and consumer-grade RTX cards (with reduced throughput)
- Founded/maintained: Created at UC Berkeley Sky Lab in 2023; now maintained by the vLLM project (independent OSS)
Ideal for
- ML teams and companies self-hosting open-source LLMs who need maximum throughput from their GPU budget.
- Organizations building OpenAI-compatible APIs on open-source models (Llama, Mistral) for cost control or privacy.
- Researchers benchmarking or evaluating LLMs at scale who need a fast, configurable serving stack.
Not ideal for
- Local single-user inference on consumer laptops — llama.cpp/Ollama is more appropriate for CPU/M-series.
- Teams without GPU infrastructure — vLLM requires NVIDIA CUDA hardware.
- Beginners seeking a simple UI to chat with models — use Open WebUI or LM Studio instead.
See also
- SGLang — Alternative LLM serving framework with structured generation support.
- Text Generation WebUI — Local LLM UI (oobabooga) better suited for consumer GPU setups.
- Ollama — Simple local LLM running on Apple Silicon and consumer hardware.