Overview
Olla is a high-performance, low-overhead, low-latency proxy, model unifier and load balancer for managing LLM infrastructure.
It intelligently routes LLM requests across local and remote inference nodes - including Ollama, LM Studio and OpenAI-compatible endpoints like vLLM. Olla provides model discovery and unified model catalogues within each provider, enabling seamless routing to available models on compatible endpoints.
Key Features¶
- Unified Model Registry: Unifies models registered across instances (of the same type - Eg. Ollama or LMStudio)
- Dual Proxy Engines: Choose between Sherpa (simple, maintainable) and Olla (high-performance with advanced features)
- Intelligent Load Balancing: Priority-based, round-robin, and least-connections strategies
- Health Monitoring: Circuit breakers and automatic failover
- High Performance: Connection pooling, object pooling, and lock-free statistics
- Security: Built-in rate limiting and request validation
- Observability: Comprehensive metrics and request tracing
Core Concepts¶
Understand these key concepts to get the most from Olla:
- Proxy Engines - Choose between Sherpa (simple) or Olla (high-performance) engines
- Load Balancing - Distribute requests across multiple endpoints with priority, round-robin, or least-connections
- Model Unification - Single catalogue of models across all your backends
- Health Checking - Automatic endpoint monitoring and intelligent failover
- Profile System - Customise backend behaviour without writing code
Quick Start¶
Get up and running with Olla in minutes:
Response Headers¶
Olla provides detailed response headers for observability:
Header | Description |
---|---|
X-Olla-Endpoint | Backend endpoint name |
X-Olla-Model | Model used for the request |
X-Olla-Backend-Type | Backend type (ollama/openai/lmstudio) |
X-Olla-Request-ID | Unique request identifier |
X-Olla-Response-Time | Total processing time |
Why Olla?¶
- Production Ready: Built for high-throughput production environments
- Flexible: Works with any OpenAI-compatible endpoint
- Observable: Rich metrics and tracing out of the box
- Reliable: Circuit breakers and automatic failover
- Fast: Optimised for minimal latency and maximum throughput
Next Steps¶
- Installation Guide - Get Olla installed
- Quick Start - Basic setup and configuration
- Architecture Overview - Understand how Olla works
- Configuration Reference - Complete configuration options