Configuration Reference¶
Complete reference for all Olla configuration options.
Default Configuration
Minimal Setup: Olla starts with sensible defaults - just runserver: host: "localhost" port: 40114 proxy: engine: "sherpa" load_balancer: "priority" discovery: model_discovery: enabled: true interval: 5m logging: level: "info" format: "json"ollaand it works!Environment Variables: All settings support
OLLA_prefix (e.g.,OLLA_SERVER_PORT=8080)
Configuration Structure¶
server: # HTTP server configuration
proxy: # Proxy engine settings
discovery: # Endpoint discovery
model_registry: # Model management
logging: # Logging configuration
engineering: # Debug features
Server Configuration¶
HTTP server and security settings.
Basic Settings¶
| Field | Type | Default | Description |
|---|---|---|---|
host | string | "localhost" | Network interface to bind |
port | int | 40114 | TCP port to listen on |
request_logging | bool | false | Enable request logging |
Example:
Timeouts¶
| Field | Type | Default | Description |
|---|---|---|---|
read_timeout | duration | 20s | Time to read request |
write_timeout | duration | 0s | Response write timeout (must be 0 for streaming) |
idle_timeout | duration | 120s | Keep-alive timeout |
shutdown_timeout | duration | 10s | Graceful shutdown timeout |
Example:
server:
read_timeout: 30s
write_timeout: 0s # Required for streaming
idle_timeout: 120s
shutdown_timeout: 30s
Request Limits¶
| Field | Type | Default | Description |
|---|---|---|---|
request_limits.max_body_size | int64 | 52428800 | Max request body (bytes) |
request_limits.max_header_size | int64 | 524288 | Max header size (bytes) |
Example:
Rate Limits¶
| Field | Type | Default | Description |
|---|---|---|---|
rate_limits.global_requests_per_minute | int | 0 | Global rate limit (0=disabled) |
rate_limits.per_ip_requests_per_minute | int | 0 | Per-IP rate limit (0=disabled) |
rate_limits.health_requests_per_minute | int | 0 | Health endpoint limit |
rate_limits.burst_size | int | 50 | Token bucket burst size |
rate_limits.cleanup_interval | duration | 1m | Rate limiter cleanup |
rate_limits.trust_proxy_headers | bool | false | Trust X-Forwarded-For |
rate_limits.trusted_proxy_cidrs | []string | [] | Trusted proxy CIDRs |
Example:
server:
rate_limits:
global_requests_per_minute: 10000
per_ip_requests_per_minute: 100
health_requests_per_minute: 5000
burst_size: 50
cleanup_interval: 1m
trust_proxy_headers: true
trusted_proxy_cidrs:
- "10.0.0.0/8"
- "172.16.0.0/12"
Proxy Configuration¶
Proxy engine and request handling settings.
Basic Settings¶
| Field | Type | Default | Description |
|---|---|---|---|
engine | string | "sherpa" | Proxy engine (sherpa or olla) |
profile | string | "auto" | Proxy profile (auto, streaming, standard) |
load_balancer | string | "priority" | Load balancer strategy |
Example:
Connection Settings¶
| Field | Type | Default | Description |
|---|---|---|---|
connection_timeout | duration | 30s | Backend connection timeout |
response_timeout | duration | 0s | Response timeout (0=disabled) |
read_timeout | duration | 0s | Read timeout (0=disabled) |
Example:
Retry Behaviour¶
As of v0.0.16, the retry mechanism is automatic and built-in for connection failures. When a connection error occurs (e.g., connection refused, network unreachable, timeout), Olla will automatically:
- Mark the failed endpoint as unhealthy
- Try the next available healthy endpoint
- Continue until a successful connection is made or all endpoints have been tried
- Use exponential backoff for unhealthy endpoints to prevent overwhelming them
Note: The fields max_retries and retry_backoff that may still appear in the configuration are deprecated and ignored. The retry behaviour is now automatic and cannot be configured.
Streaming Settings¶
| Field | Type | Default | Description |
|---|---|---|---|
stream_buffer_size | int | 4096 | Stream buffer size (bytes) |
Example:
Profile Filtering¶
Control which inference profiles are loaded at startup. See Filter Concepts for pattern details.
| Field | Type | Default | Description |
|---|---|---|---|
profile_filter.include | []string | [] | Profiles to include (glob patterns) |
profile_filter.exclude | []string | [] | Profiles to exclude (glob patterns) |
Example:
proxy:
profile_filter:
include:
- "ollama" # Include Ollama
- "openai*" # Include all OpenAI variants
exclude:
- "*test*" # Exclude test profiles
- "*debug*" # Exclude debug profiles
Discovery Configuration¶
Endpoint discovery and health checking.
Discovery Type¶
| Field | Type | Default | Description |
|---|---|---|---|
type | string | "static" | Discovery type (only static supported) |
refresh_interval | duration | 5m | Discovery refresh interval |
Example:
Static Endpoints¶
| Field | Type | Required | Description |
|---|---|---|---|
static.endpoints[].url | string | Yes | Endpoint base URL |
static.endpoints[].name | string | Yes | Unique endpoint name |
static.endpoints[].type | string | Yes | Backend type (ollama, lm-studio, llamacpp, vllm, sglang, lemonade, litellm, openai) |
static.endpoints[].priority | int | No | Selection priority (higher=preferred) |
static.endpoints[].preserve_path | bool | No | Preserve base path in URL when proxying (default: false) |
static.endpoints[].health_check_url | string | No | Health check path |
static.endpoints[].model_url | string | No | Model discovery path |
static.endpoints[].check_interval | duration | No | Health check interval |
static.endpoints[].check_timeout | duration | No | Health check timeout |
static.endpoints[].model_filter | object | No | Model filtering for this endpoint |
Endpoint Model Filtering¶
Filter models at the endpoint level during discovery. See Filter Concepts for pattern syntax.
| Field | Type | Description |
|---|---|---|
model_filter.include | []string | Models to include (glob patterns) |
model_filter.exclude | []string | Models to exclude (glob patterns) |
Path Preservation¶
The preserve_path field controls how Olla handles base paths in endpoint URLs during proxying. This is particularly important for endpoints that serve multiple services or use path-based routing.
Default Behaviour (preserve_path: false) When preserve_path is false (default), Olla strips the base path from the endpoint URL before proxying:
- Endpoint URL:
http://localhost:8080/api/v1 - Request to Olla:
/v1/chat/completions - Proxied to:
http://localhost:8080/v1/chat/completions(base path/api/v1is replaced)
Path Preservation (preserve_path: true) When preserve_path is true, Olla preserves the base path:
- Endpoint URL:
http://localhost:8080/api/v1 - Request to Olla:
/v1/chat/completions - Proxied to:
http://localhost:8080/api/v1/v1/chat/completions(base path is preserved)
When to Use Path Preservation:
- Docker Model Runner endpoints with base paths
- APIs deployed behind path-based routers
- Services that require specific URL structures
- Multi-service endpoints using path differentiation
Example:
discovery:
static:
endpoints:
# Docker Model Runner with base path
- url: "http://localhost:8080/api/models/llama"
name: "docker-llama"
type: "openai"
preserve_path: true # Keep /api/models/llama in requests
# Standard endpoint without base path
- url: "http://localhost:11434"
name: "local-ollama"
type: "ollama"
preserve_path: false # Default behaviour
priority: 100
health_check_url: "/"
model_url: "/api/tags"
check_interval: 30s
check_timeout: 5s
model_filter:
exclude:
- "*embed*" # No embedding models
- "*uncensored*" # No uncensored models
- "nomic-*" # No Nomic models
- url: "http://remote:11434"
name: "remote-ollama"
type: "ollama"
priority: 50
check_interval: 60s
model_filter:
include:
- "llama*" # Only Llama models
- "mistral*" # And Mistral models
Model Discovery¶
| Field | Type | Default | Description |
|---|---|---|---|
model_discovery.enabled | bool | true | Enable model discovery |
model_discovery.interval | duration | 5m | Discovery interval |
model_discovery.timeout | duration | 30s | Discovery timeout |
model_discovery.concurrent_workers | int | 5 | Parallel workers |
model_discovery.retry_attempts | int | 3 | Retry attempts |
model_discovery.retry_backoff | duration | 5s | Retry backoff |
Example:
discovery:
model_discovery:
enabled: true
interval: 10m
timeout: 30s
concurrent_workers: 10
retry_attempts: 3
retry_backoff: 5s
Model Registry Configuration¶
Model management and unification settings.
Registry Type¶
| Field | Type | Default | Description |
|---|---|---|---|
type | string | "memory" | Registry type (only memory supported) |
enable_unifier | bool | true | Enable model unification |
routing_strategy.type | string | "strict" | Model routing strategy (strict/optimistic/discovery) |
Example:
model_registry:
type: "memory"
enable_unifier: true
routing_strategy:
type: strict # Default: only route to endpoints with the model
Model Routing Strategy¶
Controls how requests are routed when models aren't available on all endpoints:
| Field | Type | Default | Description |
|---|---|---|---|
routing_strategy.type | string | "strict" | Strategy: strict, optimistic, or discovery |
routing_strategy.options.fallback_behavior | string | "compatible_only" | Fallback: compatible_only, all, or none |
routing_strategy.options.discovery_timeout | duration | 2s | Timeout for discovery refresh |
routing_strategy.options.discovery_refresh_on_miss | bool | false | Refresh discovery when model not found |
Example configurations:
# Production - strict routing
model_registry:
routing_strategy:
type: strict
# Development - optimistic with fallback
model_registry:
routing_strategy:
type: optimistic
options:
fallback_behavior: compatible_only
# Dynamic environments - discovery mode
model_registry:
routing_strategy:
type: discovery
options:
discovery_refresh_on_miss: true
discovery_timeout: 2s
Unification Settings¶
| Field | Type | Default | Description |
|---|---|---|---|
unification.enabled | bool | true | Enable unification |
unification.stale_threshold | duration | 24h | Model retention time |
unification.cleanup_interval | duration | 10m | Cleanup frequency |
unification.cache_ttl | duration | 5m | Cache TTL |
Example:
model_registry:
unification:
enabled: true
stale_threshold: 12h
cleanup_interval: 15m
cache_ttl: 10m
Custom Unification Rules¶
| Field | Type | Description |
|---|---|---|
unification.custom_rules[].platform | string | Platform to apply rules |
unification.custom_rules[].name_patterns | map | Name pattern mappings |
unification.custom_rules[].family_overrides | map | Family overrides |
Example:
model_registry:
unification:
custom_rules:
- platform: "ollama"
name_patterns:
"llama3.*": "llama3"
"mistral.*": "mistral"
family_overrides:
"llama3": "meta-llama"
Routing Configuration¶
Model routing strategy settings for handling requests when models aren't available on all endpoints.
Model Routing Strategy¶
| Field | Type | Default | Description |
|---|---|---|---|
routing.model_routing.type | string | "strict" | Routing strategy (strict, optimistic, discovery) |
routing.model_routing.options.fallback_behavior | string | "compatible_only" | Fallback behavior (compatible_only, all, none) |
routing.model_routing.options.discovery_refresh_on_miss | bool | false | Refresh discovery when model not found |
routing.model_routing.options.discovery_timeout | duration | 2s | Discovery refresh timeout |
Strategy Types¶
strict: Only routes to endpoints known to have the modeloptimistic: Falls back to healthy endpoints when model not founddiscovery: Refreshes model discovery before routing decisions
Example:
routing:
model_routing:
type: strict
options:
fallback_behavior: compatible_only
discovery_refresh_on_miss: false
discovery_timeout: 2s
Response Headers¶
Routing decisions are exposed via response headers:
| Header | Description |
|---|---|
X-Olla-Routing-Strategy | Strategy used (strict/optimistic/discovery) |
X-Olla-Routing-Decision | Action taken (routed/fallback/rejected) |
X-Olla-Routing-Reason | Human-readable reason for decision |
Logging Configuration¶
Application logging settings.
| Field | Type | Default | Description |
|---|---|---|---|
level | string | "info" | Log level (debug, info, warn, error) |
format | string | "json" | Log format (json or text) |
output | string | "stdout" | Output destination |
Example:
Log levels:
debug: Detailed debugging informationinfo: Normal operational messageswarn: Warning conditionserror: Error conditions only
Engineering Configuration¶
Debug and development features.
| Field | Type | Default | Description |
|---|---|---|---|
show_nerdstats | bool | false | Show memory stats on shutdown |
Example:
When enabled, displays:
- Memory allocation statistics
- Garbage collection metrics
- Goroutine counts
- Runtime information
Environment Variables¶
All configuration can be overridden via environment variables.
Pattern: OLLA_<SECTION>_<KEY> in uppercase with underscores.
Examples:
# Server settings
OLLA_SERVER_HOST=0.0.0.0
OLLA_SERVER_PORT=8080
OLLA_SERVER_REQUEST_LOGGING=true
# Proxy settings
OLLA_PROXY_ENGINE=olla
OLLA_PROXY_LOAD_BALANCER=round-robin
OLLA_PROXY_PROFILE=auto
# Logging
OLLA_LOGGING_LEVEL=debug
OLLA_LOGGING_FORMAT=text
# Rate limits
OLLA_SERVER_RATE_LIMITS_GLOBAL_REQUESTS_PER_MINUTE=1000
Duration Format¶
Duration values use Go duration syntax:
s- seconds (e.g.,30s)m- minutes (e.g.,5m)h- hours (e.g.,2h)ms- milliseconds (e.g.,500ms)us- microseconds (e.g.,100us)
Examples:
30s- 30 seconds5m- 5 minutes1h30m- 1 hour 30 minutes500ms- 500 milliseconds
Default Configuration¶
Complete default configuration:
server:
host: "localhost"
port: 40114
read_timeout: 20s
write_timeout: 0s
idle_timeout: 120s
shutdown_timeout: 10s
request_logging: false
request_limits:
max_body_size: 52428800 # 50MB
max_header_size: 524288 # 512KB
rate_limits:
global_requests_per_minute: 0
per_ip_requests_per_minute: 0
health_requests_per_minute: 0
burst_size: 50
cleanup_interval: 1m
trust_proxy_headers: false
trusted_proxy_cidrs: []
proxy:
engine: "sherpa"
profile: "auto"
load_balancer: "priority"
connection_timeout: 30s
response_timeout: 0s
read_timeout: 0s
# DEPRECATED as of v0.0.16 - retry is now automatic
# max_retries: 3
# retry_backoff: 1s
stream_buffer_size: 4096
discovery:
type: "static"
refresh_interval: 5m
model_discovery:
enabled: true
interval: 5m
timeout: 30s
concurrent_workers: 5
retry_attempts: 3
retry_backoff: 5s
static:
endpoints: []
model_registry:
type: "memory"
enable_unifier: true
routing_strategy:
type: "strict"
options:
fallback_behavior: "compatible_only"
discovery_timeout: 2s
discovery_refresh_on_miss: false
unification:
enabled: true
stale_threshold: 24h
cleanup_interval: 10m
cache_ttl: 5m
custom_rules: []
logging:
level: "info"
format: "json"
output: "stdout"
engineering:
show_nerdstats: false
Validation¶
Olla validates configuration on startup:
- Required fields are checked
- URLs must be valid
- Durations must parse correctly
- Endpoints must have unique names
- Ports must be in valid range (1-65535)
- CIDR blocks must be valid
Next Steps¶
- Configuration Examples - Common configurations
- Best Practices - Production recommendations
- Environment Variables - Override configuration