Load Balancing¶

Default Configuration
proxy:
  load_balancer: "priority"  # priority, round-robin, or least-connections
Supported:

priority (default) - Routes to highest priority endpoint

round-robin - Cycles through endpoints evenly

least-connections - Routes to least busy endpoint

Environment Variable: OLLA_PROXY_LOAD_BALANCER

Olla provides multiple load balancing strategies to distribute requests across backend endpoints efficiently. Each strategy has specific use cases and characteristics to optimise for different deployment scenarios.

Overview¶

Load balancing determines which backend endpoint receives each incoming request. The strategy you choose affects:

Request distribution - How evenly load spreads across endpoints
Failover behaviour - How quickly traffic shifts from unhealthy endpoints
Resource utilisation - How efficiently backend resources are used
Response times - Latency based on endpoint selection

Available Strategies¶

Priority (Recommended)¶

Priority-based selection routes to the highest-priority healthy endpoint first.

How it works:

Endpoints assigned priority values (0-100, higher = preferred)
Selects from highest priority tier only
Within same priority, uses weighted random selection
Automatically falls back to lower priorities if higher ones fail

Best for:

Production deployments with primary/backup servers
GPU vs CPU server hierarchies
Geographic or network-based preferences
Cost-optimised deployments (prefer cheaper endpoints)

Configuration:

proxy:
  load_balancer: "priority"

discovery:
  static:
    endpoints:
      - url: "http://gpu-server:11434"
        name: "primary-gpu"
        type: "ollama"
        priority: 100  # Highest priority

      - url: "http://backup-gpu:11434"
        name: "backup-gpu"
        type: "ollama"
        priority: 90   # Second choice

      - url: "http://cpu-server:11434"
        name: "cpu-fallback"
        type: "ollama"
        priority: 50   # Last resort

Round Robin¶

Distributes requests evenly across all healthy endpoints in sequence.

How it works:

Maintains internal counter
Selects next endpoint in rotation
Skips unhealthy endpoints
Wraps around at end of list

Best for:

Homogeneous server deployments
Equal capacity endpoints
Development and testing
Simple load distribution

Configuration:

proxy:
  load_balancer: "round-robin"

discovery:
  static:
    endpoints:
      - url: "http://server1:11434"
        name: "ollama-1"
        type: "ollama"

      - url: "http://server2:11434"
        name: "ollama-2"
        type: "ollama"

      - url: "http://server3:11434"
        name: "ollama-3"
        type: "ollama"

Least Connections¶

Routes to the endpoint with fewest active connections.

How it works:

Tracks active connections per endpoint
Selects endpoint with minimum connections
Updates count on connection start/end
Ideal for long-running requests

Best for:

Variable request durations
Streaming responses
Mixed model sizes
Preventing endpoint overload

Configuration:

proxy:
  load_balancer: "least-connections"

discovery:
  static:
    endpoints:
      - url: "http://fast-server:8000"
        name: "vllm-fast"
        type: "vllm"

      - url: "http://slow-server:8000"
        name: "vllm-slow"
        type: "vllm"

Strategy Comparison¶

Strategy	Distribution	Failover Speed	Best Use Case	Complexity
Priority	Weighted by tier	Fast	Production with primary/backup	Low
Round Robin	Equal	Moderate	Homogeneous servers	Low
Least Connections	Dynamic	Fast	Variable workloads	Medium

Health-Aware Routing¶

All strategies respect endpoint health status:

Health States¶

Status	Routable	Weight	Description
Healthy	✅ Yes	1.0	Fully operational
Degraded	✅ Yes	0.7	Slow but working
Recovering	✅ Yes	0.3	Coming back online
Unhealthy	❌ No	0.0	Failed health checks
Unknown	❌ No	0.0	Not yet checked

Circuit Breaker Integration¶

Olla includes built-in circuit breaker functionality to prevent cascading failures. The circuit breaker automatically manages endpoint health without requiring configuration.

Circuit states:

Closed - Normal operation, requests pass through
Open - Endpoint marked unhealthy after consecutive failures
Half-Open - Testing recovery with limited requests

Circuit Breaker Behaviour

The circuit breaker thresholds are currently hardcoded in the implementation. Configuration support may be added in future versions.

Advanced Configurations¶

Multi-Tier Priority¶

Create sophisticated failover hierarchies:

discovery:
  static:
    endpoints:
      # Tier 1: Local GPU servers
      - url: "http://localhost:11434"
        name: "local-gpu"
        priority: 100

      - url: "http://192.168.1.10:11434"
        name: "lan-gpu"
        priority: 95

      # Tier 2: Cloud GPU servers
      - url: "https://gpu1.cloud.example.com"
        name: "cloud-gpu-1"
        priority: 80

      - url: "https://gpu2.cloud.example.com"
        name: "cloud-gpu-2"
        priority: 80

      # Tier 3: CPU fallbacks
      - url: "http://cpu-server:11434"
        name: "cpu-fallback"
        priority: 50

Model-Based Load Distribution¶

The load balancer works with model unification to distribute requests across endpoints that have the requested model available. When multiple endpoints have the same model, the configured load balancing strategy (priority, round-robin, or least-connections) determines which endpoint receives the request.

Geographic Distribution¶

Use priority for region-based routing:

endpoints:
  # US East - Primary
  - url: "https://us-east.example.com"
    name: "us-east-primary"
    priority: 100

  # US West - Secondary
  - url: "https://us-west.example.com"
    name: "us-west-secondary"
    priority: 90

  # EU - Tertiary
  - url: "https://eu.example.com"
    name: "eu-tertiary"
    priority: 70

Monitoring Load Distribution¶

Metrics Endpoints¶

Monitor load balancer effectiveness:

# Endpoint connection counts
curl http://localhost:40114/internal/status/endpoints

# Model routing statistics
curl http://localhost:40114/internal/stats/models

# Health status overview
curl http://localhost:40114/internal/health

Response Headers¶

Track routing decisions via headers:

X-Olla-Endpoint: gpu-server-1
X-Olla-Backend-Type: ollama
X-Olla-Request-ID: abc-123
X-Olla-Response-Time: 1.234s

Logging¶

Enable debug logging for routing decisions:

logging:
  level: debug

# Logs show:
# - Endpoint selection reasoning
# - Health state changes
# - Circuit breaker transitions
# - Connection tracking

Performance Tuning¶

Connection Pooling¶

Optimise connection reuse:

proxy:
  connection_pool:
    max_idle_conns: 100
    max_idle_conns_per_host: 10
    idle_conn_timeout: 90s

Timeout Configuration¶

Balance reliability and responsiveness:

proxy:
  # Initial connection timeout
  connection_timeout: 10s

  # Request/response timeout
  response_timeout: 300s

  # Idle connection timeout
  idle_timeout: 90s

  # Health check specific
  health_check_timeout: 5s

Buffering Strategies¶

Choose appropriate proxy profile:

proxy:
  profile: "auto"  # auto, streaming, or standard

  # Auto profile adapts based on:
  # - Response size
  # - Streaming detection
  # - Content type

Troubleshooting¶

Uneven Distribution¶

Issue: Requests not spreading evenly

Diagnosis:

# Check endpoint stats
curl http://localhost:40114/internal/status/endpoints | jq '.endpoints[] | {name, requests}'

Solutions:

Verify all endpoints are healthy
Check priority values aren't identical
Review connection tracking for least-connections
Ensure round-robin counter isn't stuck

Slow Failover¶

Issue: Takes too long to stop using failed endpoints

Solutions:

health:
  check_interval: 5s      # Reduce for faster detection
  check_timeout: 2s       # Fail fast on timeouts

  circuit_breaker:
    failure_threshold: 3  # Lower for quicker opening

Connection Exhaustion¶

Issue: "Too many connections" errors

Solutions:

# Increase connection limits
proxy:
  connection_pool:
    max_idle_conns_per_host: 20  # Increase per-host limit

# Use least-connections to spread load
proxy:
  load_balancer: "least-connections"

Best Practices¶

1. Match Strategy to Deployment¶

Priority: Production with clear server tiers
Round Robin: Development or identical servers
Least Connections: Mixed workloads or streaming

2. Configure Health Checks¶

health:
  check_interval: 10s    # Balance between speed and load
  check_timeout: 5s      # Allow for model loading
  unhealthy_threshold: 3 # Avoid flapping

3. Set Appropriate Priorities¶

# Clear gaps between tiers
priority: 100  # Primary
priority: 75   # Secondary (clear gap)
priority: 50   # Tertiary (clear gap)
priority: 25   # Last resort

4. Monitor and Adjust¶

Regularly review:

Request distribution patterns
Endpoint utilisation
Response time percentiles
Error rates by endpoint

5. Plan for Failure¶

Always have:

At least one backup endpoint
Clear priority tiers
Appropriate circuit breaker settings
Monitoring and alerting

Integration Examples¶

Kubernetes Service¶

apiVersion: v1
kind: Service
metadata:
  name: ollama-backends
spec:
  selector:
    app: ollama
  ports:
    - port: 11434
  type: ClusterIP
---
# Olla configuration
discovery:
  static:
    endpoints:
      - url: "http://ollama-backends:11434"
        name: "k8s-ollama"
        type: "ollama"

Docker Swarm¶

# docker-compose.yml
services:
  ollama:
    image: ollama/ollama
    deploy:
      replicas: 3

  olla:
    image: thushan/olla
    environment:
      OLLA_LOAD_BALANCER: "round-robin"

Consul Discovery¶

discovery:
  consul:
    enabled: true
    service_name: "ollama"
    health_check: true

proxy:
  load_balancer: "least-connections"

Next Steps¶

Health Checking - Configure health monitoring
Circuit Breaker - Prevent cascade failures
Model Unification - Unified model routing
Performance Tuning - Optimisation guide