Integration Patterns¶
This guide shows how to combine Olla with other tools to build robust LLM infrastructure for different use cases.
Common Architectures¶
1. The Reliability Layer Pattern¶
Add Olla as a reliability layer on top of existing infrastructure:
Before: Apps → LLM Endpoints (single point of failure)
After: Apps → Olla → LLM Endpoints (automatic failover)
├── Primary endpoint
├── Secondary endpoint
└── Tertiary endpoint
Benefits: - Zero changes to existing endpoints - Instant failover capability - Gradual migration path
Example Config:
proxy:
engine: sherpa # Start simple
load_balancer: priority
endpoints:
- name: existing-setup
url: http://current-llm:8080
priority: 1
- name: new-backup
url: http://backup-llm:8080
priority: 2
2. The Hybrid Cloud Pattern¶
Combine local and cloud resources intelligently:
Olla
├── Local GPU (priority 1)
├── [LiteLLM](./litellm.md) → Cloud APIs (priority 10)
└── [LocalAI](./localai.md)/[Ollama](https://github.com/ollama/ollama) (priority 2)
Use Cases: - Prefer local for privacy/cost - Overflow to cloud for capacity - Different models on different platforms
Example Config:
endpoints:
- name: local-gpu
url: http://localhost:11434
priority: 1
type: ollama
- name: edge-server
url: http://edge:11434
priority: 2
type: ollama
- name: cloud-litellm
url: http://litellm:8000
priority: 10 # Only when local unavailable
type: openai
3. The Multi-Tenant Pattern¶
Different teams/projects get different routing:
Team A Apps → Olla Config A → Team A Resources
Team B Apps → Olla Config B → Shared Resources + Team B
Production → Olla Config C → Production Pool
Implementation:
# Run multiple Olla instances with different configs
olla -c team-a-config.yaml -p 8080
olla -c team-b-config.yaml -p 8081
olla -c production-config.yaml -p 8082
4. The Geographic Distribution Pattern¶
Route to nearest endpoints with fallback:
Global Olla
├── Sydney [GPUStack](./gpustack.md) (for ANZ users)
├── Singapore [LocalAI](./localai.md) (for APAC users)
└── US [vLLM](https://github.com/vllm-project/vllm) (for Americas users)
Config with Regional Preferences:
endpoints:
- name: syd-primary
url: http://syd.internal:8080
priority: 1 # Highest for local users
- name: sing-secondary
url: http://sing.internal:8080
priority: 5 # Regional fallback
- name: us-tertiary
url: http://us.internal:8080
priority: 10 # Last resort
Tool-Specific Integrations¶
Olla + LiteLLM¶
Option 1: LiteLLM for Cloud, Olla for Everything
# Olla manages routing, LiteLLM handles cloud APIs
endpoints:
- name: local-models
url: http://ollama:11434
priority: 1
- name: litellm-cloud
url: http://litellm:8000
priority: 5 # When local unavailable
Option 2: Redundant LiteLLM Instances
# Olla provides HA for LiteLLM
endpoints:
- name: litellm-primary
url: http://litellm1:8000
priority: 1
- name: litellm-backup
url: http://litellm2:8000
priority: 1 # Round-robin
Olla + GPUStack¶
Production GPU Cluster:
endpoints:
# GPUStack managed cluster
- name: gpustack-pool-a
url: http://gpustack-a:8080
priority: 1
- name: gpustack-pool-b
url: http://gpustack-b:8080
priority: 1
# Manual fallback
- name: static-ollama
url: http://backup:11434
priority: 10
Olla + Ollama¶
Multi-Instance Ollama:
endpoints:
- name: ollama-3090
url: http://desktop:11434
priority: 1 # Fastest GPU
- name: ollama-m1
url: http://macbook:11434
priority: 2 # Fallback
- name: ollama-cpu
url: http://server:11434
priority: 10 # Emergency only
Olla + LocalAI¶
OpenAI Compatibility Layer:
endpoints:
- name: localai-primary
url: http://localai:8080
priority: 1
type: openai
- name: localai-backup
url: http://localai2:8080
priority: 2
type: openai
Advanced Patterns¶
Circuit Breaker Pattern¶
Olla automatically implements circuit breakers, but you can tune them:
# Olla engine provides circuit breakers
proxy:
engine: olla # Required for circuit breakers
health:
interval: 10s
timeout: 5s
# Circuit breaker opens after 5 failures
# Attempts recovery after 30s
Canary Deployment Pattern¶
Test new models/endpoints gradually:
endpoints:
- name: stable-model
url: http://stable:8080
priority: 1
- name: canary-model
url: http://canary:8080
priority: 10 # Low priority = less traffic
Gradually increase canary priority as confidence grows.
Model Specialisation Pattern¶
Route different model types to optimised endpoints:
# Embeddings to CPU-optimised endpoint
# LLMs to GPU endpoint
# Using path-based routing with different Olla configs
Chaos Engineering Pattern¶
Test resilience by randomly failing endpoints:
endpoints:
- name: primary
url: http://primary:8080
priority: 1
- name: chaos-endpoint
url: http://chaos:8080 # Fails 10% of requests
priority: 1 # Equal priority to test failover
- name: backup
url: http://backup:8080
priority: 2
Production Best Practices¶
1. Start Simple, Evolve Gradually¶
# Phase 1: Basic failover
proxy:
engine: sherpa
load_balancer: priority
# Phase 2: Add circuit breakers
proxy:
engine: olla
load_balancer: priority
# Phase 3: Sophisticated routing
proxy:
engine: olla
load_balancer: least_connections
2. Monitor Everything¶
- Use
/internal/status
for metrics - Set up alerts on circuit breaker trips
- Monitor endpoint health scores
- Track failover events
3. Test Failure Scenarios¶
- Kill endpoints during load
- Simulate network issues
- Test circuit breaker recovery
- Verify model routing
4. Capacity Planning¶
# Reserve capacity with priorities
endpoints:
- name: primary-pool
priority: 1 # 80% traffic
- name: overflow-pool
priority: 5 # 20% traffic
- name: emergency-pool
priority: 10 # Only when needed
Docker Compose Examples¶
Development Setup¶
version: '3.8'
services:
olla:
image: thushan/olla:latest
ports:
- "8080:8080"
volumes:
- ./config.yaml:/config.yaml
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
litellm:
image: ghcr.io/berriai/litellm:latest
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
ports:
- "8000:8000"
Production Setup¶
version: '3.8'
services:
olla:
image: thushan/olla:latest
deploy:
replicas: 2 # HA Olla
ports:
- "8080:8080"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/internal/health"]
interval: 10s
# Multiple backend services...
Troubleshooting Integration Issues¶
Issue: Endpoints Not Being Discovered¶
Issue: Circuit Breaker Too Aggressive¶
# Tune circuit breaker settings
health:
interval: 10s # Check more frequently
timeout: 10s # Allow more time
Issue: Load Not Distributing¶
Conclusion¶
Olla's strength lies in its ability to integrate with existing tools rather than replace them. Use these patterns as starting points and adapt them to your specific needs. Remember: the best architecture is one that solves your actual problems, not theoretical ones.