Skip to content

Integration Patterns

This guide shows how to combine Olla with other tools to build robust LLM infrastructure for different use cases.

Common Architectures

1. The Reliability Layer Pattern

Add Olla as a reliability layer on top of existing infrastructure:

Before:  Apps → LLM Endpoints (single point of failure)

After:   Apps → Olla → LLM Endpoints (automatic failover)
                   ├── Primary endpoint
                   ├── Secondary endpoint
                   └── Tertiary endpoint

Benefits: - Zero changes to existing endpoints - Instant failover capability - Gradual migration path

Example Config:

proxy:
  engine: sherpa  # Start simple
  load_balancer: priority

endpoints:
  - name: existing-setup
    url: http://current-llm:8080
    priority: 1
  - name: new-backup
    url: http://backup-llm:8080
    priority: 2

2. The Hybrid Cloud Pattern

Combine local and cloud resources intelligently:

         Olla
           ├── Local GPU (priority 1)
           ├── [LiteLLM](./litellm.md) → Cloud APIs (priority 10)
           └── [LocalAI](./localai.md)/[Ollama](https://github.com/ollama/ollama) (priority 2)

Use Cases: - Prefer local for privacy/cost - Overflow to cloud for capacity - Different models on different platforms

Example Config:

endpoints:
  - name: local-gpu
    url: http://localhost:11434
    priority: 1
    type: ollama

  - name: edge-server
    url: http://edge:11434
    priority: 2
    type: ollama

  - name: cloud-litellm
    url: http://litellm:8000
    priority: 10  # Only when local unavailable
    type: openai

3. The Multi-Tenant Pattern

Different teams/projects get different routing:

Team A Apps → Olla Config A → Team A Resources
Team B Apps → Olla Config B → Shared Resources + Team B
Production  → Olla Config C → Production Pool

Implementation:

# Run multiple Olla instances with different configs
olla -c team-a-config.yaml -p 8080
olla -c team-b-config.yaml -p 8081
olla -c production-config.yaml -p 8082

4. The Geographic Distribution Pattern

Route to nearest endpoints with fallback:

     Global Olla
         ├── Sydney [GPUStack](./gpustack.md) (for ANZ users)
         ├── Singapore [LocalAI](./localai.md) (for APAC users)
         └── US [vLLM](https://github.com/vllm-project/vllm) (for Americas users)

Config with Regional Preferences:

endpoints:
  - name: syd-primary
    url: http://syd.internal:8080
    priority: 1  # Highest for local users

  - name: sing-secondary
    url: http://sing.internal:8080
    priority: 5  # Regional fallback

  - name: us-tertiary
    url: http://us.internal:8080
    priority: 10  # Last resort

Tool-Specific Integrations

Olla + LiteLLM

Option 1: LiteLLM for Cloud, Olla for Everything

# Olla manages routing, LiteLLM handles cloud APIs
endpoints:
  - name: local-models
    url: http://ollama:11434
    priority: 1

  - name: litellm-cloud
    url: http://litellm:8000
    priority: 5  # When local unavailable

Option 2: Redundant LiteLLM Instances

# Olla provides HA for LiteLLM
endpoints:
  - name: litellm-primary
    url: http://litellm1:8000
    priority: 1

  - name: litellm-backup
    url: http://litellm2:8000
    priority: 1  # Round-robin

Olla + GPUStack

Production GPU Cluster:

endpoints:
  # GPUStack managed cluster
  - name: gpustack-pool-a
    url: http://gpustack-a:8080
    priority: 1

  - name: gpustack-pool-b
    url: http://gpustack-b:8080
    priority: 1

  # Manual fallback
  - name: static-ollama
    url: http://backup:11434
    priority: 10

Olla + Ollama

Multi-Instance Ollama:

endpoints:
  - name: ollama-3090
    url: http://desktop:11434
    priority: 1  # Fastest GPU

  - name: ollama-m1
    url: http://macbook:11434
    priority: 2  # Fallback

  - name: ollama-cpu
    url: http://server:11434
    priority: 10  # Emergency only

Olla + LocalAI

OpenAI Compatibility Layer:

endpoints:
  - name: localai-primary
    url: http://localai:8080
    priority: 1
    type: openai

  - name: localai-backup
    url: http://localai2:8080
    priority: 2
    type: openai

Advanced Patterns

Circuit Breaker Pattern

Olla automatically implements circuit breakers, but you can tune them:

# Olla engine provides circuit breakers
proxy:
  engine: olla  # Required for circuit breakers

health:
  interval: 10s
  timeout: 5s

# Circuit breaker opens after 5 failures
# Attempts recovery after 30s

Canary Deployment Pattern

Test new models/endpoints gradually:

endpoints:
  - name: stable-model
    url: http://stable:8080
    priority: 1

  - name: canary-model
    url: http://canary:8080
    priority: 10  # Low priority = less traffic

Gradually increase canary priority as confidence grows.

Model Specialisation Pattern

Route different model types to optimised endpoints:

# Embeddings to CPU-optimised endpoint
# LLMs to GPU endpoint
# Using path-based routing with different Olla configs

Chaos Engineering Pattern

Test resilience by randomly failing endpoints:

endpoints:
  - name: primary
    url: http://primary:8080
    priority: 1

  - name: chaos-endpoint
    url: http://chaos:8080  # Fails 10% of requests
    priority: 1  # Equal priority to test failover

  - name: backup
    url: http://backup:8080
    priority: 2

Production Best Practices

1. Start Simple, Evolve Gradually

# Phase 1: Basic failover
proxy:
  engine: sherpa
  load_balancer: priority

# Phase 2: Add circuit breakers
proxy:
  engine: olla
  load_balancer: priority

# Phase 3: Sophisticated routing
proxy:
  engine: olla
  load_balancer: least_connections

2. Monitor Everything

  • Use /internal/status for metrics
  • Set up alerts on circuit breaker trips
  • Monitor endpoint health scores
  • Track failover events

3. Test Failure Scenarios

  • Kill endpoints during load
  • Simulate network issues
  • Test circuit breaker recovery
  • Verify model routing

4. Capacity Planning

# Reserve capacity with priorities
endpoints:
  - name: primary-pool
    priority: 1  # 80% traffic

  - name: overflow-pool
    priority: 5  # 20% traffic

  - name: emergency-pool
    priority: 10  # Only when needed

Docker Compose Examples

Development Setup

version: '3.8'
services:
  olla:
    image: thushan/olla:latest
    ports:
      - "8080:8080"
    volumes:
      - ./config.yaml:/config.yaml

  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"

  litellm:
    image: ghcr.io/berriai/litellm:latest
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    ports:
      - "8000:8000"

Production Setup

version: '3.8'
services:
  olla:
    image: thushan/olla:latest
    deploy:
      replicas: 2  # HA Olla
    ports:
      - "8080:8080"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/internal/health"]
      interval: 10s

  # Multiple backend services...

Troubleshooting Integration Issues

Issue: Endpoints Not Being Discovered

# Ensure discovery is enabled
discovery:
  enabled: true
  interval: 30s

Issue: Circuit Breaker Too Aggressive

# Tune circuit breaker settings
health:
  interval: 10s  # Check more frequently
  timeout: 10s   # Allow more time

Issue: Load Not Distributing

# Check load balancer setting
proxy:
  load_balancer: round_robin  # For even distribution

Conclusion

Olla's strength lies in its ability to integrate with existing tools rather than replace them. Use these patterns as starting points and adapt them to your specific needs. Remember: the best architecture is one that solves your actual problems, not theoretical ones.