Skip to content

Overview

Olla - LLM Proxy & Load Balancer

License Go CI Go Report Card Latest Release
Ollama: Native Support LM Studio: Native Support vLLM: Native Support Lemonade AI: OpenAI Compatible Lemonade AI: OpenAI Compatible

Olla is a high-performance, low-overhead, low-latency proxy, model unifier and load balancer for managing LLM infrastructure.

It intelligently routes LLM requests across local and remote inference nodes - including Ollama, LM Studio and OpenAI-compatible endpoints like vLLM. Olla provides model discovery and unified model catalogues within each provider, enabling seamless routing to available models on compatible endpoints.

Key Features

  • Unified Model Registry: Unifies models registered across instances (of the same type - Eg. Ollama or LMStudio)
  • Dual Proxy Engines: Choose between Sherpa (simple, maintainable) and Olla (high-performance with advanced features)
  • Intelligent Load Balancing: Priority-based, round-robin, and least-connections strategies
  • Health Monitoring: Circuit breakers and automatic failover
  • High Performance: Connection pooling, object pooling, and lock-free statistics
  • Security: Built-in rate limiting and request validation
  • Observability: Comprehensive metrics and request tracing

Core Concepts

Understand these key concepts to get the most from Olla:

  • Proxy Engines - Choose between Sherpa (simple) or Olla (high-performance) engines
  • Load Balancing - Distribute requests across multiple endpoints with priority, round-robin, or least-connections
  • Model Unification - Single catalogue of models across all your backends
  • Health Checking - Automatic endpoint monitoring and intelligent failover
  • Profile System - Customise backend behaviour without writing code

Quick Start

Get up and running with Olla in minutes:

# If you have ollama or lmstudio locally
docker run -t -p 40114:40114 ghcr.io/thushan/olla:latest
go install github.com/thushan/olla@latest
olla
git clone https://github.com/thushan/olla.git
cd olla
make build-release
./olla

Response Headers

Olla provides detailed response headers for observability:

Header Description
X-Olla-Endpoint Backend endpoint name
X-Olla-Model Model used for the request
X-Olla-Backend-Type Backend type (ollama/openai/lmstudio)
X-Olla-Request-ID Unique request identifier
X-Olla-Response-Time Total processing time

Why Olla?

  • Production Ready: Built for high-throughput production environments
  • Flexible: Works with any OpenAI-compatible endpoint
  • Observable: Rich metrics and tracing out of the box
  • Reliable: Circuit breakers and automatic failover
  • Fast: Optimised for minimal latency and maximum throughput

Next Steps

Community