Overview

Olla is a high-performance, low-overhead, low-latency proxy and load balancer for managing LLM infrastructure. It intelligently routes LLM requests across local and remote inference nodes with a wide variety of natively supported endpoints and extensible enough to support others. Olla provides model discovery and unified model catalogues within each provider, enabling seamless routing to available models on compatible endpoints.

Olla works alongside API gateways like LiteLLM or orchestration platforms like GPUStack, focusing on making your existing LLM infrastructure reliable through intelligent routing and failover. You can choose between two proxy engines: Sherpa for simplicity and maintainability or Olla for maximum performance with advanced features like circuit breakers and connection pooling.

Key Features¶

Unified Model Registry: Unifies models registered across instances (of the same type - Eg. Ollama or LMStudio)
Dual Proxy Engines: Choose between Sherpa (simple, maintainable) and Olla (high-performance with advanced features)
Intelligent Load Balancing: Priority-based, round-robin, and least-connections strategies
Health Monitoring: Circuit breakers and automatic failover
High Performance: Connection pooling, object pooling, and lock-free statistics
Security: Built-in rate limiting and request validation
Observability: Comprehensive metrics and request tracing
API Translation: Anthropic Messages API support for Claude-compatible clients

Core Concepts¶

Understand these key concepts to get the most from Olla:

Proxy Engines - Choose between Sherpa (simple) or Olla (high-performance) engines
Proxy Profiles - Learn about different proxy behaviours for streaming or buffering
Load Balancing - Distribute requests across multiple endpoints
Model Routing - Different ways Olla routes traffic based on model availability & health
Model Unification - Single catalogue of models across all your backends
Health Checking - Automatic endpoint monitoring and intelligent failover
Profile System - Customise backend behaviour without writing code

Quick Start¶

Get up and running with Olla in minutes:

InstallerUsing DockerUsing GoFrom BinariesFrom Source

# Linux/macOS
bash <(curl -s https://raw.githubusercontent.com/thushan/olla/main/install.sh)

# If you have ollama or lmstudio locally
docker run -t -p 40114:40114 ghcr.io/thushan/olla:latest

go install github.com/thushan/olla@latest
olla

Visit Github Releases

git clone https://github.com/thushan/olla.git
cd olla
make build-release
./olla

Response Headers¶

Olla provides detailed response headers for observability:

Header	Description
`X-Olla-Endpoint`	Backend endpoint name
`X-Olla-Model`	Model used for the request
`X-Olla-Backend-Type`	Backend type (ollama/openai/lmstudio/llamacpp/vllm/sglang/lemonade/litellm)
`X-Olla-Request-ID`	Unique request identifier
`X-Olla-Response-Time`	Total processing time

Why Olla?¶

Production Ready: Built for high-throughput production environments
Flexible: Works with any OpenAI-compatible endpoint
Observable: Rich metrics and tracing out of the box
Reliable: Circuit breakers and automatic failover
Fast: Optimised for minimal latency and maximum throughput

See how Olla compares to LiteLLM, GPUStack and LocalAI in our comparison guide.

Next Steps¶

Installation Guide - Get Olla installed
Quick Start - Basic setup and configuration
Architecture Overview - Understand how Olla works
Configuration Reference - Complete configuration options