Why it matters
- Supports the widest range of model formats (GGUF, GPTQ, AWQ, EXL2) — run any quantized model regardless of which format it was released in.
- Granular control over generation: temperature, repetition penalty, top-p, min-p, top-k, CFG, and 20+ other parameters.
- Extensions ecosystem adds RAG, TTS, voice input, and image generation — making it a complete AI toolkit.
- The benchmark tool of choice for researchers comparing model behaviors at specific generation settings.
Key capabilities
- Multi-format model support: GGUF (llama.cpp), GPTQ, AWQ, EXL2, GGML (legacy), and Transformers float16.
- Multiple backends: llama.cpp, ExLlamaV2, AutoGPTQ, AutoAWQ, and HuggingFace Transformers.
- Chat modes: Instruct mode, chat mode, and character roleplay with custom persona cards.
- Generation parameter control: 20+ sampling parameters — temperature, repetition penalty, top-p, DRY, Mirostat, CFG, etc.
- Extensions: Community extensions for TTS, voice input, image generation (SD), RAG, and more.
- LoRA loading: Apply LoRA adapters on top of base models for fine-tuned behavior.
- API server: OpenAI-compatible REST API + additional endpoints for Ooba-specific features.
- Training tab: Basic LoRA fine-tuning on custom datasets from the UI.
Technical notes
- Install: Python 3.11+; GPU drivers; install script handles environment setup
- Hardware: NVIDIA GPU (CUDA), AMD GPU (ROCm), Apple Silicon (MPS), or CPU
- License: AGPL-3.0 — open source; commercial use has restrictions under AGPL
- Backends: ExLlamaV2 (best for GPTQ/EXL2), llama.cpp (best for GGUF), Transformers (most compatible)
- API: OpenAI-compatible at
localhost:5000; also SSE streaming - Model download: Manual download from HuggingFace and copy to models/ folder (more complex than LM Studio)
- Maintained by: oobabooga (GitHub username); actively maintained community project
Ideal for
- Power users and ML researchers who need precise control over model loading, quantization formats, and generation parameters.
- Developers building around local LLMs who need the most flexible and extensible local inference UI available.
- Enthusiasts who want to experiment with LoRA loading, custom sampling strategies, and model comparison.
Not ideal for
- Beginners — complex installation process; LM Studio or GPT4All are far more approachable.
- Purely CPU-only machines — performance is acceptable but the UI complexity is wasted without GPU.
- Production API serving with many concurrent users — vLLM handles high concurrency better.