Skip to main content

vLLM

High-throughput LLM inference server with continuous batching and PagedAttention

LLM FrameworksFreeOpen source

vLLM is an open-source, high-throughput inference engine for large language models. Its PagedAttention algorithm manages GPU memory efficiently, enabling 2–24× higher throughput than HuggingFace Transformers. It serves an OpenAI-compatible API, making it a drop-in replacement for production LLM serving.

Key specs
42,000 GitHub stars source
as of 2026-03-27
Loading…

FAQ

Alternatives

Integrations

None listed.

Built on

None listed.