Configuration Reference¶
Complete reference for all Olla configuration options.
Default Configuration
Minimal Setup: Olla starts with sensible defaults - just runserver: host: "localhost" port: 40114 proxy: engine: "sherpa" load_balancer: "priority" discovery: model_discovery: enabled: true interval: 5m logging: level: "info" format: "json"ollaand it works!Environment Variables: All settings support
OLLA_prefix (e.g.,OLLA_SERVER_PORT=8080)
Configuration Structure¶
server: # HTTP server configuration
proxy: # Proxy engine settings
discovery: # Endpoint discovery
model_registry: # Model management
translators: # API translation (e.g., Anthropic ↔ OpenAI)
logging: # Logging configuration
engineering: # Debug features
Server Configuration¶
HTTP server and security settings.
Basic Settings¶
| Field | Type | Default | Description |
|---|---|---|---|
host | string | "localhost" | Network interface to bind |
port | int | 40114 | TCP port to listen on |
request_logging | bool | true | Enable request logging |
Example:
Timeouts¶
| Field | Type | Default | Description |
|---|---|---|---|
read_timeout | duration | 30s | Time to read request |
write_timeout | duration | 0s | Response write timeout (must be 0 for streaming) |
idle_timeout | duration | 0s | Keep-alive timeout (0 = use read_timeout) |
shutdown_timeout | duration | 10s | Graceful shutdown timeout |
Example:
server:
read_timeout: 30s
write_timeout: 0s # Required for streaming
idle_timeout: 120s
shutdown_timeout: 30s
Request Limits¶
| Field | Type | Default | Description |
|---|---|---|---|
request_limits.max_body_size | int64 | 104857600 | Max request body (100MB) |
request_limits.max_header_size | int64 | 1048576 | Max header size (1MB) |
Example:
Rate Limits¶
| Field | Type | Default | Description |
|---|---|---|---|
rate_limits.global_requests_per_minute | int | 1000 | Global rate limit (0=disabled) |
rate_limits.per_ip_requests_per_minute | int | 100 | Per-IP rate limit (0=disabled) |
rate_limits.health_requests_per_minute | int | 1000 | Health endpoint limit |
rate_limits.burst_size | int | 50 | Token bucket burst size |
rate_limits.cleanup_interval | duration | 5m | Rate limiter cleanup |
rate_limits.trust_proxy_headers | bool | false | Trust X-Forwarded-For |
rate_limits.trusted_proxy_cidrs | []string | ["127.0.0.0/8","10.0.0.0/8","172.16.0.0/12","192.168.0.0/16"] | Trusted proxy CIDRs |
Example:
server:
rate_limits:
global_requests_per_minute: 10000
per_ip_requests_per_minute: 100
health_requests_per_minute: 5000
burst_size: 50
cleanup_interval: 5m
trust_proxy_headers: true
trusted_proxy_cidrs:
- "10.0.0.0/8"
- "172.16.0.0/12"
Proxy Configuration¶
Proxy engine and request handling settings.
Basic Settings¶
| Field | Type | Default | Description |
|---|---|---|---|
engine | string | "sherpa" | Proxy engine (sherpa or olla) |
profile | string | "auto" | Proxy profile (auto, streaming, standard) |
load_balancer | string | "priority" | Load balancer strategy |
Example:
Connection Settings¶
| Field | Type | Default | Description |
|---|---|---|---|
connection_timeout | duration | 30s | Backend connection timeout |
response_timeout | duration | 10m | Response timeout |
read_timeout | duration | 120s | Read timeout |
Example:
Retry Behaviour¶
As of v0.0.16, the retry mechanism is automatic and built-in for connection failures. When a connection error occurs (e.g., connection refused, network unreachable, timeout), Olla will automatically:
- Mark the failed endpoint as unhealthy
- Try the next available healthy endpoint
- Continue until a successful connection is made or all endpoints have been tried
- Use exponential backoff for unhealthy endpoints to prevent overwhelming them
Note: The fields max_retries and retry_backoff that may still appear in the configuration are deprecated and ignored. The retry behaviour is now automatic and cannot be configured.
Streaming Settings¶
| Field | Type | Default | Description |
|---|---|---|---|
stream_buffer_size | int | 8192 | Stream buffer size (bytes) |
Example:
Profile Filtering¶
Control which inference profiles are loaded at startup. See Filter Concepts for pattern details.
| Field | Type | Default | Description |
|---|---|---|---|
profile_filter.include | []string | [] | Profiles to include (glob patterns) |
profile_filter.exclude | []string | [] | Profiles to exclude (glob patterns) |
Example:
proxy:
profile_filter:
include:
- "ollama" # Include Ollama
- "openai*" # Include all OpenAI variants
exclude:
- "*test*" # Exclude test profiles
- "*debug*" # Exclude debug profiles
Discovery Configuration¶
Endpoint discovery and health checking.
Discovery Type¶
| Field | Type | Default | Description |
|---|---|---|---|
type | string | "static" | Discovery type (only static supported) |
refresh_interval | duration | 30s | Discovery refresh interval |
Example:
Static Endpoints¶
| Field | Type | Required | Description |
|---|---|---|---|
static.endpoints[].url | string | Yes | Endpoint base URL |
static.endpoints[].name | string | Yes | Unique endpoint name |
static.endpoints[].type | string | Yes | Backend type (ollama, lm-studio, llamacpp, vllm, sglang, lemonade, litellm, openai) |
static.endpoints[].priority | int | No | Selection priority (higher=preferred) |
static.endpoints[].preserve_path | bool | No | Preserve base path in URL when proxying (default: false) |
static.endpoints[].health_check_url | string | No | Health check path (optional, uses profile default if not specified) |
static.endpoints[].model_url | string | No | Model discovery path (optional, uses profile default if not specified) |
static.endpoints[].check_interval | duration | No | Health check interval |
static.endpoints[].check_timeout | duration | No | Health check timeout |
static.endpoints[].model_filter | object | No | Model filtering for this endpoint |
URL Configuration¶
The health_check_url and model_url fields are optional. When not specified, Olla uses profile-specific defaults based on the endpoint type:
Profile Defaults:
| Endpoint Type | Default health_check_url | Default model_url |
|---|---|---|
ollama | / | /api/tags |
llamacpp | /health | /v1/models |
lm-studio | /v1/models | /api/v0/models |
vllm | /health | /v1/models |
sglang | /health | /v1/models |
openai | /v1/models | /v1/models |
auto (or unknown) | / | /v1/models |
Both fields support:
-
Relative paths (recommended) - joined with the endpoint base URL:
-
Absolute URLs - used as-is for external services:
When using relative paths, any base path prefix in the endpoint URL is automatically preserved (e.g., http://localhost:8080/api/ + /v1/models = http://localhost:8080/api/v1/models).
Endpoint Model Filtering¶
Filter models at the endpoint level during discovery. See Filter Concepts for pattern syntax.
| Field | Type | Description |
|---|---|---|
model_filter.include | []string | Models to include (glob patterns) |
model_filter.exclude | []string | Models to exclude (glob patterns) |
Path Preservation¶
The preserve_path field controls how Olla handles base paths in endpoint URLs during proxying. This is particularly important for endpoints that serve multiple services or use path-based routing.
Default Behaviour (preserve_path: false) When preserve_path is false (default), Olla strips the base path from the endpoint URL before proxying:
- Endpoint URL:
http://localhost:8080/api/v1 - Request to Olla:
/v1/chat/completions - Proxied to:
http://localhost:8080/v1/chat/completions(base path/api/v1is replaced)
Path Preservation (preserve_path: true) When preserve_path is true, Olla preserves the base path:
- Endpoint URL:
http://localhost:8080/api/v1 - Request to Olla:
/v1/chat/completions - Proxied to:
http://localhost:8080/api/v1/v1/chat/completions(base path is preserved)
When to Use Path Preservation:
- Docker Model Runner endpoints with base paths
- APIs deployed behind path-based routers
- Services that require specific URL structures
- Multi-service endpoints using path differentiation
Example:
discovery:
static:
endpoints:
# Minimal configuration - uses profile defaults
- url: "http://localhost:11434"
name: "local-ollama"
type: "ollama"
priority: 100
# health_check_url: "/" (default for ollama)
# model_url: "/api/tags" (default for ollama)
# Custom health check URL
- url: "http://localhost:8080"
name: "llamacpp-server"
type: "llamacpp"
priority: 90
health_check_url: "/health"
# model_url: "/v1/models" (default for llamacpp)
# Endpoint with base path - URLs are preserved
- url: "http://localhost:8080/api/"
name: "vllm-gateway"
type: "vllm"
priority: 80
# health_check_url: "/health" -> http://localhost:8080/api/health
# model_url: "/v1/models" -> http://localhost:8080/api/v1/models
# External health check on different host
- url: "http://localhost:11434"
name: "monitored-ollama"
type: "ollama"
health_check_url: "http://monitoring.local:9090/health/ollama"
# Absolute URL used as-is
# Docker Model Runner with base path
- url: "http://localhost:8080/api/models/llama"
name: "docker-llama"
type: "openai"
preserve_path: true # Keep /api/models/llama in requests
# Endpoint with model filtering
- url: "http://remote:11434"
name: "remote-ollama"
type: "ollama"
priority: 50
check_interval: 60s
model_filter:
include:
- "llama*" # Only Llama models
- "mistral*" # And Mistral models
Model Discovery¶
| Field | Type | Default | Description |
|---|---|---|---|
model_discovery.enabled | bool | true | Enable model discovery |
model_discovery.interval | duration | 5m | Discovery interval |
model_discovery.timeout | duration | 30s | Discovery timeout |
model_discovery.concurrent_workers | int | 5 | Parallel workers |
model_discovery.retry_attempts | int | 3 | Retry attempts |
model_discovery.retry_backoff | duration | 5s | Retry backoff |
Example:
discovery:
model_discovery:
enabled: true
interval: 10m
timeout: 30s
concurrent_workers: 10
retry_attempts: 3
retry_backoff: 5s
Model Registry Configuration¶
Model management and unification settings.
Registry Type¶
| Field | Type | Default | Description |
|---|---|---|---|
type | string | "memory" | Registry type (only memory supported) |
enable_unifier | bool | true | Enable model unification |
routing_strategy.type | string | "strict" | Model routing strategy (strict/optimistic/discovery) |
Example:
model_registry:
type: "memory"
enable_unifier: true
routing_strategy:
type: strict # Default: only route to endpoints with the model
Model Routing Strategy¶
Controls how requests are routed when models aren't available on all endpoints:
| Field | Type | Default | Description |
|---|---|---|---|
routing_strategy.type | string | "strict" | Strategy: strict, optimistic, or discovery |
routing_strategy.options.fallback_behavior | string | "compatible_only" | Fallback: compatible_only, all, or none |
routing_strategy.options.discovery_timeout | duration | 2s | Timeout for discovery refresh |
routing_strategy.options.discovery_refresh_on_miss | bool | false | Refresh discovery when model not found |
Example configurations:
# Production - strict routing
model_registry:
routing_strategy:
type: strict
# Development - optimistic with fallback
model_registry:
routing_strategy:
type: optimistic
options:
fallback_behavior: compatible_only
# Dynamic environments - discovery mode
model_registry:
routing_strategy:
type: discovery
options:
discovery_refresh_on_miss: true
discovery_timeout: 2s
Unification Settings¶
| Field | Type | Default | Description |
|---|---|---|---|
unification.enabled | bool | true | Enable unification |
unification.stale_threshold | duration | 24h | Model retention time |
unification.cleanup_interval | duration | 10m | Cleanup frequency |
unification.cache_ttl | duration | 10m | Cache TTL |
Example:
model_registry:
unification:
enabled: true
stale_threshold: 12h
cleanup_interval: 15m
cache_ttl: 10m
Custom Unification Rules¶
| Field | Type | Description |
|---|---|---|
unification.custom_rules[].platform | string | Platform to apply rules |
unification.custom_rules[].name_patterns | map | Name pattern mappings |
unification.custom_rules[].family_overrides | map | Family overrides |
Example:
model_registry:
unification:
custom_rules:
- platform: "ollama"
name_patterns:
"llama3.*": "llama3"
"mistral.*": "mistral"
family_overrides:
"llama3": "meta-llama"
Routing Configuration¶
Model routing strategy settings for handling requests when models aren't available on all endpoints.
Model Routing Strategy¶
| Field | Type | Default | Description |
|---|---|---|---|
routing.model_routing.type | string | "strict" | Routing strategy (strict, optimistic, discovery) |
routing.model_routing.options.fallback_behavior | string | "compatible_only" | Fallback behavior (compatible_only, all, none) |
routing.model_routing.options.discovery_refresh_on_miss | bool | false | Refresh discovery when model not found |
routing.model_routing.options.discovery_timeout | duration | 2s | Discovery refresh timeout |
Strategy Types¶
strict: Only routes to endpoints known to have the modeloptimistic: Falls back to healthy endpoints when model not founddiscovery: Refreshes model discovery before routing decisions
Example:
routing:
model_routing:
type: strict
options:
fallback_behavior: compatible_only
discovery_refresh_on_miss: false
discovery_timeout: 2s
Response Headers¶
Routing decisions are exposed via response headers:
| Header | Description |
|---|---|
X-Olla-Routing-Strategy | Strategy used (strict/optimistic/discovery) |
X-Olla-Routing-Decision | Action taken (routed/fallback/rejected) |
X-Olla-Routing-Reason | Human-readable reason for decision |
Translators Configuration¶
API translation settings. Translators enable clients designed for one API format to work with backends that use a different format.
Anthropic Translation (v0.0.20+) Enabled by default. Still actively being improved -- please report any issues or feedback.
Anthropic Translator¶
The Anthropic translator enables Claude-compatible clients (Claude Code, OpenCode, Crush CLI) to work with OpenAI-compatible backends.
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Master switch for the Anthropic translator. When false, the /olla/anthropic/v1/* endpoints do not exist. |
passthrough_enabled | bool | true | Optimisation mode (only applies when enabled: true). When true, requests are forwarded directly to backends with native Anthropic support for zero translation overhead. When false, all requests go through the Anthropic-to-OpenAI translation pipeline regardless of backend capabilities. |
max_message_size | int | 10485760 | Maximum request body size in bytes (10MB default). |
Two-Level Control: enabled + passthrough_enabled¶
The Anthropic translator uses a two-level configuration model:
enabledis the master switch. Whenfalse, the translator is completely disabled and thepassthrough_enabledsetting has no effect. It istrueby default.passthrough_enabledis the optimisation flag. It only takes effect whenenabled: true.
When both are active, passthrough mode also requires that the backend profile declares native Anthropic support via api.anthropic_support.enabled: true. Both conditions must be true for passthrough to activate:
translators.anthropic.passthrough_enabled: true(global configuration)- Backend profile has
api.anthropic_support.enabled: true(per-backend profile)
If either condition is false, Olla falls back to translation mode automatically.
Examples¶
Enable translator with passthrough (recommended for production):
translators:
anthropic:
enabled: true
passthrough_enabled: true # Forward directly to backends with native Anthropic support
max_message_size: 10485760 # 10MB
Enable translator with translation only (useful for debugging/testing):
translators:
anthropic:
enabled: true
passthrough_enabled: false # Always translate Anthropic ↔ OpenAI format
max_message_size: 10485760
Disable translator entirely:
translators:
anthropic:
enabled: false
# passthrough_enabled has no effect when enabled=false
passthrough_enabled: true
Performance Implications¶
| Mode | Overhead | When Used |
|---|---|---|
| Passthrough | Near-zero (~0ms) | passthrough_enabled: true and backend has native Anthropic support |
| Translation | ~1-5ms per request | passthrough_enabled: false, or backend lacks native Anthropic support |
| Disabled | N/A | enabled: false -- endpoints return 404 |
Detecting the Active Mode¶
Check the X-Olla-Mode response header:
X-Olla-Mode: passthrough-- passthrough mode was used- Header absent -- translation mode was used
Inspector (Development Only)¶
Do not enable in production -- logs full request/response bodies including potentially sensitive user data.
| Field | Type | Default | Description |
|---|---|---|---|
inspector.enabled | bool | false | Enable request/response logging |
inspector.output_dir | string | "logs/inspector/anthropic" | Directory for log output |
inspector.session_header | string | "X-Session-ID" | Header for session grouping |
See Anthropic Inspector for details.
Logging Configuration¶
Application logging settings.
| Field | Type | Default | Description |
|---|---|---|---|
level | string | "info" | Log level (debug, info, warn, error) |
format | string | "json" | Log format (json or text) |
output | string | "stdout" | Output destination |
Example:
Log levels:
debug: Detailed debugging informationinfo: Normal operational messageswarn: Warning conditionserror: Error conditions only
Engineering Configuration¶
Debug and development features.
| Field | Type | Default | Description |
|---|---|---|---|
show_nerdstats | bool | false | Show memory stats on shutdown |
Example:
When enabled, displays:
- Memory allocation statistics
- Garbage collection metrics
- Goroutine counts
- Runtime information
Environment Variables¶
All configuration can be overridden via environment variables.
Pattern: OLLA_<SECTION>_<KEY> in uppercase with underscores.
Examples:
# Server settings
OLLA_SERVER_HOST=0.0.0.0
OLLA_SERVER_PORT=8080
OLLA_SERVER_REQUEST_LOGGING=true
# Proxy settings
OLLA_PROXY_ENGINE=olla
OLLA_PROXY_LOAD_BALANCER=round-robin
OLLA_PROXY_PROFILE=auto
# Logging
OLLA_LOGGING_LEVEL=debug
OLLA_LOGGING_FORMAT=text
# Rate limits
OLLA_SERVER_RATE_LIMITS_GLOBAL_REQUESTS_PER_MINUTE=1000
Duration Format¶
Duration values use Go duration syntax:
s- seconds (e.g.,30s)m- minutes (e.g.,5m)h- hours (e.g.,2h)ms- milliseconds (e.g.,500ms)us- microseconds (e.g.,100us)
Examples:
30s- 30 seconds5m- 5 minutes1h30m- 1 hour 30 minutes500ms- 500 milliseconds
Default Configuration¶
Complete default configuration:
server:
host: "localhost"
port: 40114
read_timeout: 30s
write_timeout: 0s
# idle_timeout: 0s # Optional (0 = use read_timeout)
shutdown_timeout: 10s
request_logging: true
request_limits:
max_body_size: 104857600 # 100MB
max_header_size: 1048576 # 1MB
rate_limits:
global_requests_per_minute: 1000
per_ip_requests_per_minute: 100
health_requests_per_minute: 1000
burst_size: 50
cleanup_interval: 5m
trust_proxy_headers: false
trusted_proxy_cidrs:
- "127.0.0.0/8"
- "10.0.0.0/8"
- "172.16.0.0/12"
- "192.168.0.0/16"
proxy:
engine: "sherpa"
profile: "auto"
load_balancer: "priority"
connection_timeout: 30s
response_timeout: 10m
read_timeout: 120s
# DEPRECATED as of v0.0.16 - retry is now automatic
# max_retries: 3
# retry_backoff: 1s
stream_buffer_size: 8192
discovery:
type: "static"
refresh_interval: 30s
model_discovery:
enabled: true
interval: 5m
timeout: 30s
concurrent_workers: 5
retry_attempts: 3
retry_backoff: 5s
static:
endpoints: []
model_registry:
type: "memory"
enable_unifier: true
routing_strategy:
type: "strict"
options:
fallback_behavior: "compatible_only"
discovery_timeout: 2s
discovery_refresh_on_miss: false
unification:
enabled: true
stale_threshold: 24h
cleanup_interval: 10m
cache_ttl: 10m
custom_rules: []
translators:
anthropic:
enabled: true
passthrough_enabled: true
max_message_size: 10485760 # 10MB
inspector:
enabled: false
output_dir: "logs/inspector/anthropic"
session_header: "X-Session-ID"
logging:
level: "info"
format: "json"
output: "stdout"
engineering:
show_nerdstats: false
Validation¶
Olla validates configuration on startup:
- Required fields are checked
- URLs must be valid
- Durations must parse correctly
- Endpoints must have unique names
- Ports must be in valid range (1-65535)
- CIDR blocks must be valid
Next Steps¶
- Configuration Examples - Common configurations
- Best Practices - Production recommendations
- Environment Variables - Override configuration