Best Practices Overview¶
This guide provides recommended practices for deploying and operating Olla in production environments.
Quick Checklist¶
Essential items for production deployment:
- Use the Olla engine for high-performance requirements
- Configure appropriate rate limits
- Set request size limits
- Enable health checking with reasonable intervals
- Use priority load balancing for cost optimisation
- Configure proper timeouts for your use case
- Set logging to appropriate level (info or warn)
- Monitor circuit breaker trips
- Use model unification for same-type endpoints
- Implement graceful shutdown handling
Configuration Recommendations¶
Engine Selection¶
Choose the right proxy engine for your needs:
# Development or moderate load
proxy:
engine: "sherpa"
# Production high-throughput
proxy:
engine: "olla"
Sherpa is recommended for:
- Development environments
- Small to medium deployments
- When simplicity is preferred
- Load under 100 requests/second
Olla is recommended for:
- Production environments
- High-throughput requirements
- When performance is critical
- Load over 100 requests/second
Load Balancer Strategy¶
Select based on your requirements:
Strategy | Use When |
---|---|
priority | You have preferred endpoints (local vs cloud) |
round-robin | All endpoints are equal, want even distribution |
least-connections | Optimising for response time |
# Cost optimisation - prefer local
proxy:
load_balancer: "priority"
# Even distribution
proxy:
load_balancer: "round-robin"
# Performance optimisation
proxy:
load_balancer: "least-connections"
Timeout Configuration¶
Set appropriate timeouts for your use case:
server:
read_timeout: 30s # Time to read request
write_timeout: 0s # Must be 0 for streaming
shutdown_timeout: 30s # Graceful shutdown time
proxy:
connection_timeout: 30s # Backend connection timeout
Important: Always set write_timeout: 0s
to support LLM streaming responses.
Health Check Intervals¶
Balance between detection speed and overhead:
endpoints:
- url: "http://localhost:11434"
check_interval: 5s # Critical endpoints
check_timeout: 2s
- url: "http://backup:11434"
check_interval: 30s # Less critical endpoints
check_timeout: 5s
Guidelines:
- Critical endpoints: 5-10 second intervals
- Secondary endpoints: 15-30 second intervals
- Backup endpoints: 30-60 second intervals
Rate Limiting¶
Protect your infrastructure:
server:
rate_limits:
# Overall system capacity
global_requests_per_minute: 10000
# Per-client limits
per_ip_requests_per_minute: 100
# Monitoring endpoints
health_requests_per_minute: 5000
# Burst allowance
burst_size: 50
Sizing guidelines:
- Set global limit to 80% of tested capacity
- Per-IP limits prevent single client abuse
- Health endpoints need higher limits for monitoring
- Burst size handles temporary spikes
Request Size Limits¶
Prevent resource exhaustion:
Recommendations by use case:
- Chat applications: 10-50MB
- Code generation: 50-100MB
- Document processing: 100MB+
- API gateway: 5-10MB
Deployment Patterns¶
Single Instance¶
For simple deployments:
discovery:
type: "static"
static:
endpoints:
- url: "http://localhost:11434"
name: "single-instance"
type: "ollama"
priority: 100
Active-Passive Failover¶
Primary with automatic failover:
proxy:
load_balancer: "priority"
discovery:
static:
endpoints:
- url: "http://primary:11434"
name: "primary"
priority: 100
check_interval: 5s
- url: "http://backup:11434"
name: "backup"
priority: 10
check_interval: 10s
Active-Active Load Balancing¶
Distribute load across multiple endpoints:
proxy:
load_balancer: "least-connections"
discovery:
static:
endpoints:
- url: "http://node1:11434"
name: "node-1"
priority: 100
- url: "http://node2:11434"
name: "node-2"
priority: 100
- url: "http://node3:11434"
name: "node-3"
priority: 100
Geographic Distribution¶
Multi-region deployment:
proxy:
load_balancer: "priority"
max_retries: 3
discovery:
static:
endpoints:
# Primary region
- url: "http://us-east-1:11434"
name: "us-east-primary"
priority: 100
check_interval: 5s
# Secondary region
- url: "http://us-west-1:11434"
name: "us-west-backup"
priority: 50
check_interval: 10s
# Disaster recovery
- url: "http://eu-west-1:11434"
name: "eu-west-dr"
priority: 10
check_interval: 30s
Operational Guidelines¶
Monitoring¶
Key metrics to track:
- Request Rate: Monitor throughput trends
- Response Times: Track P50, P95, P99 latencies
- Error Rates: Watch for increases
- Circuit Breaker: Track trip frequency
- Endpoint Health: Monitor availability
- Model Discovery: Track model availability
Logging¶
Configure appropriate log levels:
# Development
logging:
level: "debug"
format: "text"
# Production
logging:
level: "info" # or "warn" for less verbosity
format: "json"
Graceful Shutdown¶
Handle shutdown properly:
Ensure your deployment:
- Sends SIGTERM for shutdown
- Waits for shutdown_timeout
- Only sends SIGKILL if necessary
Resource Allocation¶
Memory considerations:
- Sherpa engine: ~50-100MB base + request overhead
- Olla engine: ~100-200MB base + connection pools
- Model registry: ~10MB per 1000 models
CPU considerations:
- Primarily I/O bound
- 1-2 cores sufficient for most deployments
- Scale horizontally for higher throughput
Security Considerations¶
Network Security¶
- Bind Address: Use
localhost
unless network access needed - TLS Termination: Use reverse proxy for HTTPS
- Firewall Rules: Restrict access to Olla port
Rate Limiting¶
Always configure rate limits in production:
Request Validation¶
Set appropriate size limits:
Common Pitfalls¶
1. Forgetting write_timeout¶
Problem: Streaming responses timeout
Solution:
2. Aggressive Health Checks¶
Problem: Overloading endpoints with health checks
Solution: Use appropriate intervals
3. No Rate Limiting¶
Problem: Single client can overwhelm system
Solution: Always configure rate limits
4. Wrong Engine Choice¶
Problem: Poor performance with Sherpa under high load
Solution: Use Olla engine for production
5. Missing Model Unification¶
Problem: Models appear/disappear randomly
Solution:
Performance Tuning¶
Connection Pooling¶
The Olla engine maintains connection pools:
Retry Strategy¶
Configure retries appropriately:
Model Discovery¶
Optimise discovery frequency:
discovery:
model_discovery:
enabled: true
interval: 5m # Not too frequent
concurrent_workers: 5 # Parallel discovery
Next Steps¶
- Security Best Practices - Secure your deployment
- Performance Tuning - Optimise for your workload
- Monitoring Guide - Track system health
- Configuration Reference - Complete configuration options