Skip to content

Overview

Olla - LLM Proxy & Load Balancer

License Go CI Go Report Card Latest Release
Ollama: Native Support LM Studio: Native Support vLLM: Native Support LiteLLM: Native Support Lemonade AI: OpenAI Compatible Lemonade AI: OpenAI Compatible

Olla is a high-performance, low-overhead, low-latency proxy, model unifier and load balancer for managing LLM infrastructure.

It intelligently routes LLM requests across local and remote inference nodes - including Ollama, LM Studio, LiteLLM (100+ cloud providers), and OpenAI-compatible endpoints like vLLM. Olla provides model discovery and unified model catalogues across all providers, enabling seamless routing to available models on compatible endpoints.

With native LiteLLM support, Olla bridges local and cloud infrastructure - use local models when available, automatically failover to cloud APIs when needed. Unlike orchestration platforms like GPUStack, Olla focuses on making your existing LLM infrastructure reliable through intelligent routing and failover.

Key Features

  • Unified Model Registry: Unifies models registered across instances (of the same type - Eg. Ollama or LMStudio)
  • Dual Proxy Engines: Choose between Sherpa (simple, maintainable) and Olla (high-performance with advanced features)
  • Intelligent Load Balancing: Priority-based, round-robin, and least-connections strategies
  • Health Monitoring: Circuit breakers and automatic failover
  • High Performance: Connection pooling, object pooling, and lock-free statistics
  • Security: Built-in rate limiting and request validation
  • Observability: Comprehensive metrics and request tracing

Core Concepts

Understand these key concepts to get the most from Olla:

  • Proxy Engines - Choose between Sherpa (simple) or Olla (high-performance) engines
  • Proxy Profiles - Learn about different proxy behaviours for streaming or buffering
  • Load Balancing - Distribute requests across multiple endpoints
  • Model Routing - Different ways Olla routes traffic based on model availability & health
  • Model Unification - Single catalogue of models across all your backends
  • Health Checking - Automatic endpoint monitoring and intelligent failover
  • Profile System - Customise backend behaviour without writing code

Quick Start

Get up and running with Olla in minutes:

# Linux/macOS
bash <(curl -s https://raw.githubusercontent.com/thushan/olla/main/install.sh)
# If you have ollama or lmstudio locally
docker run -t -p 40114:40114 ghcr.io/thushan/olla:latest
go install github.com/thushan/olla@latest
olla
git clone https://github.com/thushan/olla.git
cd olla
make build-release
./olla

Response Headers

Olla provides detailed response headers for observability:

Header Description
X-Olla-Endpoint Backend endpoint name
X-Olla-Model Model used for the request
X-Olla-Backend-Type Backend type (ollama/openai/lmstudio/vllm/litellm)
X-Olla-Request-ID Unique request identifier
X-Olla-Response-Time Total processing time

Why Olla?

  • Production Ready: Built for high-throughput production environments
  • Flexible: Works with any OpenAI-compatible endpoint
  • Observable: Rich metrics and tracing out of the box
  • Reliable: Circuit breakers and automatic failover
  • Fast: Optimised for minimal latency and maximum throughput

See how Olla compares to LiteLLM, GPUStack and LocalAI in our comparison guide.

Next Steps

Community