Overview
Olla is a high-performance, low-overhead, low-latency proxy and load balancer for managing LLM infrastructure. It intelligently routes LLM requests across local and remote inference nodes with a wide variety of natively supported endpoints and extensible enough to support others. Olla provides model discovery and unified model catalogues within each provider, enabling seamless routing to available models on compatible endpoints.
Olla works alongside API gateways like LiteLLM or orchestration platforms like GPUStack, focusing on making your existing LLM infrastructure reliable through intelligent routing and failover. You can choose between two proxy engines: Sherpa for simplicity and maintainability or Olla for maximum performance with advanced features like circuit breakers and connection pooling.
Key Features¶
- Unified Model Registry: Unifies models registered across instances (of the same type - Eg. Ollama or LMStudio)
- Dual Proxy Engines: Choose between Sherpa (simple, maintainable) and Olla (high-performance with advanced features)
- Intelligent Load Balancing: Priority-based, round-robin, and least-connections strategies
- Health Monitoring: Circuit breakers and automatic failover
- High Performance: Connection pooling, object pooling, and lock-free statistics
- Security: Built-in rate limiting and request validation
- Observability: Comprehensive metrics and request tracing
- API Translation: Anthropic Messages API support for Claude-compatible clients
Core Concepts¶
Understand these key concepts to get the most from Olla:
- Proxy Engines - Choose between Sherpa (simple) or Olla (high-performance) engines
- Proxy Profiles - Learn about different proxy behaviours for streaming or buffering
- Load Balancing - Distribute requests across multiple endpoints
- Model Routing - Different ways Olla routes traffic based on model availability & health
- Model Unification - Single catalogue of models across all your backends
- Health Checking - Automatic endpoint monitoring and intelligent failover
- Profile System - Customise backend behaviour without writing code
Quick Start¶
Get up and running with Olla in minutes:
Visit Github Releases
Response Headers¶
Olla provides detailed response headers for observability:
| Header | Description |
|---|---|
X-Olla-Endpoint | Backend endpoint name |
X-Olla-Model | Model used for the request |
X-Olla-Backend-Type | Backend type (ollama/openai/lmstudio/llamacpp/vllm/sglang/lemonade/litellm) |
X-Olla-Request-ID | Unique request identifier |
X-Olla-Response-Time | Total processing time |
Why Olla?¶
- Production Ready: Built for high-throughput production environments
- Flexible: Works with any OpenAI-compatible endpoint
- Observable: Rich metrics and tracing out of the box
- Reliable: Circuit breakers and automatic failover
- Fast: Optimised for minimal latency and maximum throughput
See how Olla compares to LiteLLM, GPUStack and LocalAI in our comparison guide.
Next Steps¶
- Installation Guide - Get Olla installed
- Quick Start - Basic setup and configuration
- Architecture Overview - Understand how Olla works
- Configuration Reference - Complete configuration options