Overview
Olla is a high-performance, low-overhead, low-latency proxy, model unifier and load balancer for managing LLM infrastructure.
It intelligently routes LLM requests across local and remote inference nodes - including Ollama, LM Studio, LiteLLM (100+ cloud providers), and OpenAI-compatible endpoints like vLLM. Olla provides model discovery and unified model catalogues across all providers, enabling seamless routing to available models on compatible endpoints.
With native LiteLLM support, Olla bridges local and cloud infrastructure - use local models when available, automatically failover to cloud APIs when needed. Unlike orchestration platforms like GPUStack, Olla focuses on making your existing LLM infrastructure reliable through intelligent routing and failover.
Key Features¶
- Unified Model Registry: Unifies models registered across instances (of the same type - Eg. Ollama or LMStudio)
- Dual Proxy Engines: Choose between Sherpa (simple, maintainable) and Olla (high-performance with advanced features)
- Intelligent Load Balancing: Priority-based, round-robin, and least-connections strategies
- Health Monitoring: Circuit breakers and automatic failover
- High Performance: Connection pooling, object pooling, and lock-free statistics
- Security: Built-in rate limiting and request validation
- Observability: Comprehensive metrics and request tracing
Core Concepts¶
Understand these key concepts to get the most from Olla:
- Proxy Engines - Choose between Sherpa (simple) or Olla (high-performance) engines
- Proxy Profiles - Learn about different proxy behaviours for streaming or buffering
- Load Balancing - Distribute requests across multiple endpoints
- Model Routing - Different ways Olla routes traffic based on model availability & health
- Model Unification - Single catalogue of models across all your backends
- Health Checking - Automatic endpoint monitoring and intelligent failover
- Profile System - Customise backend behaviour without writing code
Quick Start¶
Get up and running with Olla in minutes:
Visit Github Releases
Response Headers¶
Olla provides detailed response headers for observability:
Header | Description |
---|---|
X-Olla-Endpoint | Backend endpoint name |
X-Olla-Model | Model used for the request |
X-Olla-Backend-Type | Backend type (ollama/openai/lmstudio/vllm/litellm) |
X-Olla-Request-ID | Unique request identifier |
X-Olla-Response-Time | Total processing time |
Why Olla?¶
- Production Ready: Built for high-throughput production environments
- Flexible: Works with any OpenAI-compatible endpoint
- Observable: Rich metrics and tracing out of the box
- Reliable: Circuit breakers and automatic failover
- Fast: Optimised for minimal latency and maximum throughput
See how Olla compares to LiteLLM, GPUStack and LocalAI in our comparison guide.
Next Steps¶
- Installation Guide - Get Olla installed
- Quick Start - Basic setup and configuration
- Architecture Overview - Understand how Olla works
- Configuration Reference - Complete configuration options