Why it matters
- Shared-base-model architecture makes multi-adapter serving 10-100× cheaper than running separate model instances.
- Solves the enterprise problem of deploying many department-specific or use-case-specific fine-tuned models at reasonable cost.
- Built by the Predibase team who built production LoRA serving — not a research prototype but battle-tested infrastructure.
- Open source (Apache 2.0) means no vendor lock-in — self-host on any GPU cloud (AWS, GCP, Azure, RunPod, Lambda).
Key capabilities
- Multi-adapter serving: Load and serve 100+ LoRA adapters on a single GPU instance simultaneously.
- Dynamic adapter loading: Load adapters on-demand from Hugging Face Hub, local filesystem, or S3 — no restart required.
- Shared base model: All adapters share the same base model weights loaded once in GPU memory.
- Continuous batching: Batch requests across different adapters for maximum GPU utilization.
- PagedAttention: Memory-efficient KV cache management (from vLLM) for longer context and higher throughput.
- Streaming: SSE streaming for token-by-token LLM output.
- OpenAI-compatible API: Drop-in replacement for OpenAI's
/v1/chat/completionsendpoint. - Multi-GPU: Tensor parallelism across multiple GPUs for larger base models (13B, 70B).
Technical notes
- License: Apache 2.0 (open source)
- GitHub: github.com/predibase/lorax
- Base models: Llama 2/3, Mistral, Mixtral, Gemma, Phi, Falcon, and all PEFT-compatible models
- Adapter source: Hugging Face Hub, local path, S3
- API: OpenAI-compatible REST; Docker image available
- Hardware: NVIDIA GPU (A10G, A100, H100 recommended); supports multiple GPUs
- Deployment: Docker (
ghcr.io/predibase/lorax); Helm chart for Kubernetes
Ideal for
- Organizations deploying multiple department-specific or use-case-specific fine-tuned models who need cost-effective serving.
- Teams with dozens of LoRA adapters trained for different tasks (customer support, legal, technical) who can't afford separate GPU instances per adapter.
- Infrastructure engineers who want open-source, self-hosted multi-LoRA serving without a managed platform.
Not ideal for
- Teams who just need a single fine-tuned model served — vLLM or TGI are simpler for single-adapter serving.
- Base model serving (no fine-tuning) — vLLM has better throughput optimization for pure base model inference.
- Teams who want managed infrastructure — Predibase (the company) offers LoRAX as a managed service.