LoRAX (LoRA eXchange) is an open-source inference server for serving many LoRA fine-tuned adapters efficiently. Instead of running a separate GPU process for each fine-tuned model variant, LoRAX loads a single base model (Llama 3, Mistral, etc.) and dynamically loads small LoRA adapter weights per request. You can serve 100+ different fine-tuned models on a single A10G GPU.

How does LoRAX reduce serving costs?

Traditional serving requires one GPU instance per fine-tuned model. LoRAX shares the base model weights across all adapters — only the small LoRA delta weights (typically 50-200MB) are loaded per adapter vs. the full 7B+ model parameters. A single A10G GPU that could serve one fine-tuned Llama 3 model can serve 50+ different LoRA adapters simultaneously with LoRAX.

Is LoRAX production-ready?

LoRAX is production-ready and powers Predibase's managed serving infrastructure. It supports continuous batching, paged attention (via PagedAttention from vLLM), streaming responses, and multi-GPU tensor parallelism. The project is actively maintained by the Predibase team with regular updates. Docker images are available for deployment.

What models and adapters does LoRAX support?

LoRAX supports any LoRA-compatible model from Hugging Face. Base models include Llama 2/3, Mistral, Mixtral, Gemma, Phi-2/3, Falcon, and others. Adapters are loaded from Hugging Face Hub, local storage, or S3. Any PEFT-compatible LoRA adapter (trained with PEFT, Unsloth, or Axolotl) can be served.

LoRAX | db.fyi

Why it matters

Shared-base-model architecture makes multi-adapter serving 10-100× cheaper than running separate model instances.
Solves the enterprise problem of deploying many department-specific or use-case-specific fine-tuned models at reasonable cost.
Built by the Predibase team who built production LoRA serving — not a research prototype but battle-tested infrastructure.
Open source (Apache 2.0) means no vendor lock-in — self-host on any GPU cloud (AWS, GCP, Azure, RunPod, Lambda).

Key capabilities

Multi-adapter serving: Load and serve 100+ LoRA adapters on a single GPU instance simultaneously.
Dynamic adapter loading: Load adapters on-demand from Hugging Face Hub, local filesystem, or S3 — no restart required.
Shared base model: All adapters share the same base model weights loaded once in GPU memory.
Continuous batching: Batch requests across different adapters for maximum GPU utilization.
PagedAttention: Memory-efficient KV cache management (from vLLM) for longer context and higher throughput.
Streaming: SSE streaming for token-by-token LLM output.
OpenAI-compatible API: Drop-in replacement for OpenAI's /v1/chat/completions endpoint.
Multi-GPU: Tensor parallelism across multiple GPUs for larger base models (13B, 70B).

Technical notes

License: Apache 2.0 (open source)
GitHub: github.com/predibase/lorax
Base models: Llama 2/3, Mistral, Mixtral, Gemma, Phi, Falcon, and all PEFT-compatible models
Adapter source: Hugging Face Hub, local path, S3
API: OpenAI-compatible REST; Docker image available
Hardware: NVIDIA GPU (A10G, A100, H100 recommended); supports multiple GPUs
Deployment: Docker (ghcr.io/predibase/lorax); Helm chart for Kubernetes

Ideal for

Organizations deploying multiple department-specific or use-case-specific fine-tuned models who need cost-effective serving.
Teams with dozens of LoRA adapters trained for different tasks (customer support, legal, technical) who can't afford separate GPU instances per adapter.
Infrastructure engineers who want open-source, self-hosted multi-LoRA serving without a managed platform.

Not ideal for

Teams who just need a single fine-tuned model served — vLLM or TGI are simpler for single-adapter serving.
Base model serving (no fine-tuning) — vLLM has better throughput optimization for pure base model inference.
Teams who want managed infrastructure — Predibase (the company) offers LoRAX as a managed service.

LoRAX

Why it matters

Key capabilities

Technical notes

Ideal for

Not ideal for

See also

FAQ

Alternatives

Integrations

Built on

Why it matters

Key capabilities

Technical notes

Ideal for

Not ideal for

See also

LoRAX

Why it matters

Key capabilities

Technical notes

Ideal for

Not ideal for

See also

FAQ

What is LoRAX?

How does LoRAX reduce serving costs?

Is LoRAX production-ready?

What models and adapters does LoRAX support?

Alternatives

Integrations

Built on

Related tools

Why it matters

Key capabilities

Technical notes

Ideal for

Not ideal for

See also