Concepts Overview¶
Olla is built on several core concepts that work together to provide intelligent LLM request routing and management. This section introduces the key components and how they interact.
Core Components¶
Proxy Engines¶
The heart of Olla - two distinct proxy implementations optimised for different scenarios:
- Sherpa: Simple, maintainable proxy with shared HTTP transport
- Olla: High-performance proxy with per-endpoint connection pools
Understanding the trade-offs between these engines helps you choose the right one for your workload.
Load Balancing¶
Intelligent request distribution across multiple endpoints:
- Priority-based: Routes to preferred endpoints first
- Round-robin: Even distribution across endpoints
- Least-connections: Routes to least busy endpoints
Load balancing ensures optimal resource utilisation and failover capability.
Health Checking¶
Automatic endpoint monitoring and failure detection:
- Periodic health checks with configurable intervals
- Automatic endpoint recovery with exponential backoff
- Circuit breaker pattern to prevent cascade failures
Health checking maintains service reliability by routing around failures.
Model Unification¶
Standardised model discovery and management:
- Automatic model discovery from connected endpoints
- Per-provider model format unification
- Consistent model naming across different backends
Model unification simplifies working with heterogeneous LLM infrastructure.
Proxy Profiles¶
Response handling strategies for different use cases:
- Auto: Intelligent detection based on content
- Streaming: Immediate token streaming for chat
- Standard: Buffered responses for APIs
Profiles optimise response handling for specific workload patterns.
Profile System¶
Provider-specific configuration templates:
- Pre-configured profiles for Ollama, LM Studio, OpenAI, vLLM
- Custom header mappings and endpoint patterns
- Model format converters for each provider
The profile system ensures compatibility with various LLM providers.
How Components Work Together¶
graph TD
Request[Client Request] --> Proxy[Proxy Engine]
Proxy --> Health[Health Checker]
Health --> LB[Load Balancer]
LB --> EP1[Endpoint 1]
LB --> EP2[Endpoint 2]
LB --> EP3[Endpoint 3]
Models[Model Unification] --> EP1
Models --> EP2
Models --> EP3
Profile[Proxy Profile] --> Proxy
System[Profile System] --> EP1
System --> EP2
System --> EP3
style Request fill:#e1f5fe
style Proxy fill:#fff3e0
style Health fill:#f3e5f5
style LB fill:#e8f5e9
style Models fill:#fce4ec
- Requests arrive at the proxy engine (Sherpa or Olla)
- Health checking ensures only healthy endpoints are used
- Load balancer selects the optimal endpoint
- Profile system applies provider-specific configurations
- Model unification ensures consistent model access
- Proxy profile determines response handling strategy
Choosing Components¶
For Development¶
- Engine: Sherpa (simpler, easier to debug)
- Balancer: Priority (predictable routing)
- Profile: Auto (handles most cases)
For Production¶
- Engine: Olla (optimised for performance)
- Balancer: Priority or Least-connections
- Profile: Auto or Streaming (for chat applications)
For High Availability¶
- Multiple endpoints with health checking
- Circuit breakers to prevent cascade failures
- Appropriate timeout configurations
Configuration Philosophy¶
Olla follows a convention over configuration approach:
- Sensible defaults that work for most cases
- Progressive disclosure of advanced options
- Provider-specific profiles for quick setup
- Fine-grained control when needed
Next Steps¶
- Start with Proxy Engines to understand request handling
- Configure Load Balancing for your infrastructure
- Set up Health Checking for reliability
- Choose appropriate Proxy Profiles for your use case