Configuration Reference¶
Complete reference for all Olla configuration options.
Default Configuration
Minimal Setup: Olla starts with sensible defaults - just runserver: host: "localhost" port: 40114 proxy: engine: "sherpa" load_balancer: "priority" discovery: model_discovery: enabled: true interval: 5m logging: level: "info" format: "json"ollaand it works!Environment Variables: All settings support
OLLA_prefix (e.g.,OLLA_SERVER_PORT=8080)
Configuration Structure¶
server: # HTTP server configuration
proxy: # Proxy engine settings
discovery: # Endpoint discovery
model_registry: # Model management
translators: # API translation (e.g., Anthropic ↔ OpenAI)
logging: # Logging configuration
engineering: # Debug features
Server Configuration¶
HTTP server and security settings.
Basic Settings¶
| Field | Type | Default | Description |
|---|---|---|---|
host | string | "localhost" | Network interface to bind |
port | int | 40114 | TCP port to listen on |
request_logging | bool | true | Enable request logging |
Example:
Timeouts¶
| Field | Type | Default | Description |
|---|---|---|---|
read_timeout | duration | 30s | Time to read request |
write_timeout | duration | 0s | Response write timeout (must be 0 for streaming) |
idle_timeout | duration | 0s | Keep-alive timeout (0 = use read_timeout) |
shutdown_timeout | duration | 10s | Graceful shutdown timeout |
Example:
server:
read_timeout: 30s
write_timeout: 0s # Required for streaming
idle_timeout: 120s
shutdown_timeout: 30s
Request Limits¶
| Field | Type | Default | Description |
|---|---|---|---|
request_limits.max_body_size | int64 | 104857600 | Max request body (100MB) |
request_limits.max_header_size | int64 | 1048576 | Max header size (1MB) |
Example:
Rate Limits¶
| Field | Type | Default | Description |
|---|---|---|---|
rate_limits.global_requests_per_minute | int | 1000 | Global rate limit (0=disabled) |
rate_limits.per_ip_requests_per_minute | int | 100 | Per-IP rate limit (0=disabled) |
rate_limits.health_requests_per_minute | int | 1000 | Health endpoint limit |
rate_limits.burst_size | int | 50 | Token bucket burst size |
rate_limits.cleanup_interval | duration | 5m | Rate limiter cleanup |
rate_limits.trust_proxy_headers | bool | false | Trust X-Forwarded-For |
rate_limits.trusted_proxy_cidrs | []string | ["127.0.0.0/8","10.0.0.0/8","172.16.0.0/12","192.168.0.0/16"] | Trusted proxy CIDRs |
Example:
server:
rate_limits:
global_requests_per_minute: 10000
per_ip_requests_per_minute: 100
health_requests_per_minute: 5000
burst_size: 50
cleanup_interval: 5m
trust_proxy_headers: true
trusted_proxy_cidrs:
- "10.0.0.0/8"
- "172.16.0.0/12"
Proxy Configuration¶
Proxy engine and request handling settings.
Basic Settings¶
| Field | Type | Default | Description |
|---|---|---|---|
engine | string | "sherpa" | Proxy engine (sherpa or olla) |
profile | string | "auto" | Proxy profile (auto, streaming, standard) |
load_balancer | string | "priority" | Load balancer strategy |
Example:
Connection Settings¶
| Field | Type | Default | Description |
|---|---|---|---|
connection_timeout | duration | 30s | Backend connection timeout |
response_timeout | duration | 10m | Response timeout |
read_timeout | duration | 120s | Read timeout |
Example:
Retry Behaviour¶
As of v0.0.16, the retry mechanism is automatic and built-in for connection failures. When a connection error occurs (e.g., connection refused, network unreachable, timeout), Olla will automatically:
- Mark the failed endpoint as unhealthy
- Try the next available healthy endpoint
- Continue until a successful connection is made or all endpoints have been tried
- Use exponential backoff for unhealthy endpoints to prevent overwhelming them
Note: The fields max_retries and retry_backoff that may still appear in the configuration are deprecated and ignored. The retry behaviour is now automatic and cannot be configured.
Streaming Settings¶
| Field | Type | Default | Description |
|---|---|---|---|
stream_buffer_size | int | 8192 | Stream buffer size (bytes) |
Example:
Profile Filtering¶
Control which inference profiles are loaded at startup. See Filter Concepts for pattern details.
| Field | Type | Default | Description |
|---|---|---|---|
profile_filter.include | []string | [] | Profiles to include (glob patterns) |
profile_filter.exclude | []string | [] | Profiles to exclude (glob patterns) |
Example:
proxy:
profile_filter:
include:
- "ollama" # Include Ollama
- "openai*" # Include all OpenAI variants
exclude:
- "*test*" # Exclude test profiles
- "*debug*" # Exclude debug profiles
Discovery Configuration¶
Endpoint discovery and health checking.
Discovery Type¶
| Field | Type | Default | Description |
|---|---|---|---|
type | string | "static" | Discovery type (only static supported) |
refresh_interval | duration | 30s | Discovery refresh interval |
Example:
Static Endpoints¶
| Field | Type | Required | Description |
|---|---|---|---|
static.endpoints[].url | string | Yes | Endpoint base URL |
static.endpoints[].name | string | Yes | Unique endpoint name |
static.endpoints[].type | string | Yes | Backend type (ollama, lm-studio, llamacpp, vllm, sglang, lemonade, litellm, openai) |
static.endpoints[].priority | int | No | Selection priority (higher=preferred, default: 100) |
static.endpoints[].preserve_path | bool | No | Preserve base path in URL when proxying (default: false) |
static.endpoints[].health_check_url | string | No | Health check path (optional, uses profile default if not specified) |
static.endpoints[].model_url | string | No | Model discovery path (optional, uses profile default if not specified) |
static.endpoints[].check_interval | duration | No | Health check interval (default: 5s) |
static.endpoints[].check_timeout | duration | No | Health check timeout (default: 2s) |
static.endpoints[].model_filter | object | No | Model filtering for this endpoint |
URL Configuration¶
The health_check_url and model_url fields are optional. When not specified, Olla uses profile-specific defaults based on the endpoint type:
Profile Defaults:
| Endpoint Type | Default health_check_url | Default model_url |
|---|---|---|
ollama | / | /api/tags |
llamacpp | /health | /v1/models |
lm-studio | /v1/models | /api/v0/models |
vllm | /health | /v1/models |
sglang | /health | /v1/models |
openai | /v1/models | /v1/models |
auto (or unknown) | / | /v1/models |
Both fields support:
-
Relative paths (recommended) - joined with the endpoint base URL:
-
Absolute URLs - used as-is for external services:
When using relative paths, any base path prefix in the endpoint URL is automatically preserved (e.g., http://localhost:8080/api/ + /v1/models = http://localhost:8080/api/v1/models).
Endpoint Model Filtering¶
Filter models at the endpoint level during discovery. See Filter Concepts for pattern syntax.
| Field | Type | Description |
|---|---|---|
model_filter.include | []string | Models to include (glob patterns) |
model_filter.exclude | []string | Models to exclude (glob patterns) |
Path Preservation¶
The preserve_path field controls how Olla handles base paths in endpoint URLs during proxying. This is particularly important for endpoints that serve multiple services or use path-based routing.
Default Behaviour (preserve_path: false) When preserve_path is false (default), Olla strips the base path from the endpoint URL before proxying:
- Endpoint URL:
http://localhost:8080/api/v1 - Request to Olla:
/v1/chat/completions - Proxied to:
http://localhost:8080/v1/chat/completions(base path/api/v1is replaced)
Path Preservation (preserve_path: true) When preserve_path is true, Olla preserves the base path:
- Endpoint URL:
http://localhost:8080/api/v1 - Request to Olla:
/v1/chat/completions - Proxied to:
http://localhost:8080/api/v1/v1/chat/completions(base path is preserved)
When to Use Path Preservation:
- Docker Model Runner endpoints with base paths
- APIs deployed behind path-based routers
- Services that require specific URL structures
- Multi-service endpoints using path differentiation
Example:
discovery:
static:
endpoints:
# Minimal configuration - uses profile defaults
- url: "http://localhost:11434"
name: "local-ollama"
type: "ollama"
priority: 100
# health_check_url: "/" (default for ollama)
# model_url: "/api/tags" (default for ollama)
# Custom health check URL
- url: "http://localhost:8080"
name: "llamacpp-server"
type: "llamacpp"
priority: 90
health_check_url: "/health"
# model_url: "/v1/models" (default for llamacpp)
# Endpoint with base path - URLs are preserved
- url: "http://localhost:8080/api/"
name: "vllm-gateway"
type: "vllm"
priority: 80
# health_check_url: "/health" -> http://localhost:8080/api/health
# model_url: "/v1/models" -> http://localhost:8080/api/v1/models
# External health check on different host
- url: "http://localhost:11434"
name: "monitored-ollama"
type: "ollama"
health_check_url: "http://monitoring.local:9090/health/ollama"
# Absolute URL used as-is
# Docker Model Runner with base path
- url: "http://localhost:8080/api/models/llama"
name: "docker-llama"
type: "openai"
preserve_path: true # Keep /api/models/llama in requests
# Endpoint with model filtering
- url: "http://remote:11434"
name: "remote-ollama"
type: "ollama"
priority: 50
check_interval: 60s
model_filter:
include:
- "llama*" # Only Llama models
- "mistral*" # And Mistral models
Model Discovery¶
| Field | Type | Default | Description |
|---|---|---|---|
model_discovery.enabled | bool | true | Enable model discovery |
model_discovery.interval | duration | 5m | Discovery interval |
model_discovery.timeout | duration | 30s | Discovery timeout |
model_discovery.concurrent_workers | int | 5 | Parallel workers |
model_discovery.retry_attempts | int | 3 | Retry attempts |
model_discovery.retry_backoff | duration | 1s | Retry backoff |
Example:
discovery:
model_discovery:
enabled: true
interval: 10m
timeout: 30s
concurrent_workers: 10
retry_attempts: 3
retry_backoff: 1s
Model Registry Configuration¶
Model management and unification settings.
Registry Type¶
| Field | Type | Default | Description |
|---|---|---|---|
type | string | "memory" | Registry type (only memory supported) |
enable_unifier | bool | true | Enable model unification |
routing_strategy.type | string | "strict" | Model routing strategy (strict/optimistic/discovery) |
Example:
model_registry:
type: "memory"
enable_unifier: true
routing_strategy:
type: strict # Default: only route to endpoints with the model
Model Routing Strategy¶
Controls how requests are routed when models aren't available on all endpoints:
| Field | Type | Default | Description |
|---|---|---|---|
routing_strategy.type | string | "strict" | Strategy: strict, optimistic, or discovery |
routing_strategy.options.fallback_behavior | string | "compatible_only" | Fallback: compatible_only, all, or none |
routing_strategy.options.discovery_timeout | duration | 2s | Timeout for discovery refresh |
routing_strategy.options.discovery_refresh_on_miss | bool | false | Refresh discovery when model not found |
Example configurations:
# Production - strict routing
model_registry:
routing_strategy:
type: strict
# Development - optimistic with fallback
model_registry:
routing_strategy:
type: optimistic
options:
fallback_behavior: compatible_only
# Dynamic environments - discovery mode
model_registry:
routing_strategy:
type: discovery
options:
discovery_refresh_on_miss: true
discovery_timeout: 2s
Unification Settings¶
| Field | Type | Default | Description |
|---|---|---|---|
unification.enabled | bool | true | Enable unification |
unification.stale_threshold | duration | 24h | Model retention time |
unification.cleanup_interval | duration | 5m | Cleanup frequency |
unification.cache_ttl | duration | 10m | Cache TTL |
Example:
model_registry:
unification:
enabled: true
stale_threshold: 12h
cleanup_interval: 15m
cache_ttl: 10m
Custom Unification Rules¶
| Field | Type | Description |
|---|---|---|
unification.custom_rules[].platform | string | Platform to apply rules |
unification.custom_rules[].name_patterns | map | Name pattern mappings |
unification.custom_rules[].family_overrides | map | Family overrides |
Example:
model_registry:
unification:
custom_rules:
- platform: "ollama"
name_patterns:
"llama3.*": "llama3"
"mistral.*": "mistral"
family_overrides:
"llama3": "meta-llama"
Model Aliases Configuration¶
Define virtual model names that map to platform-specific model names across different backends.
Model Alias Mapping¶
| Field | Type | Default | Description |
|---|---|---|---|
model_aliases | map[string][]string | nil | Map of alias name → list of actual model names |
Each key is the virtual model name clients will use. Each value is a list of actual model names that backends may serve the model under. When a request matches an alias, Olla resolves endpoints for all listed model names and rewrites the request body to the correct name for the selected backend.
Example:
model_aliases:
my-llama:
- "llama3.1:8b" # Ollama
- llama-3.1-8b-instruct # LM Studio
- Meta-Llama-3.1-8B-Instruct.gguf # llamacpp
my-codegen:
- "qwen2.5-coder:7b" # Ollama
- qwen2.5-coder-7b-instruct # LM Studio
Note
Alias names take priority over standard model routing. If no endpoints are found for the alias, Olla falls back to standard routing using the alias name as a regular model name. See Model Aliases for details.
Routing Configuration¶
Model routing strategy settings for handling requests when models aren't available on all endpoints.
Model Routing Strategy¶
| Field | Type | Default | Description |
|---|---|---|---|
routing.model_routing.type | string | "strict" | Routing strategy (strict, optimistic, discovery) |
routing.model_routing.options.fallback_behavior | string | "compatible_only" | Fallback behavior (compatible_only, all, none) |
routing.model_routing.options.discovery_refresh_on_miss | bool | false | Refresh discovery when model not found |
routing.model_routing.options.discovery_timeout | duration | 2s | Discovery refresh timeout |
Strategy Types¶
strict: Only routes to endpoints known to have the modeloptimistic: Falls back to healthy endpoints when model not founddiscovery: Refreshes model discovery before routing decisions
Example:
routing:
model_routing:
type: strict
options:
fallback_behavior: compatible_only
discovery_refresh_on_miss: false
discovery_timeout: 2s
Response Headers¶
Routing decisions are exposed via response headers:
| Header | Description |
|---|---|
X-Olla-Routing-Strategy | Strategy used (strict/optimistic/discovery) |
X-Olla-Routing-Decision | Action taken (routed/fallback/rejected) |
X-Olla-Routing-Reason | Human-readable reason for decision |
Translators Configuration¶
API translation settings. Translators enable clients designed for one API format to work with backends that use a different format.
Anthropic Translation (v0.0.20+) Enabled by default. Still actively being improved -- please report any issues or feedback.
Anthropic Translator¶
The Anthropic translator enables Claude-compatible clients (Claude Code, OpenCode, Crush CLI) to work with OpenAI-compatible backends.
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Master switch for the Anthropic translator. When false, the /olla/anthropic/v1/* endpoints do not exist. |
passthrough_enabled | bool | true | Optimisation mode (only applies when enabled: true). When true, requests are forwarded directly to backends with native Anthropic support for zero translation overhead. When false, all requests go through the Anthropic-to-OpenAI translation pipeline regardless of backend capabilities. |
max_message_size | int | 10485760 | Maximum request body size in bytes (10MB default). |
Two-Level Control: enabled + passthrough_enabled¶
The Anthropic translator uses a two-level configuration model:
enabledis the master switch. Whenfalse, the translator is completely disabled and thepassthrough_enabledsetting has no effect. It istrueby default.passthrough_enabledis the optimisation flag. It only takes effect whenenabled: true.
When both are active, passthrough mode also requires that the backend profile declares native Anthropic support via api.anthropic_support.enabled: true. Both conditions must be true for passthrough to activate:
translators.anthropic.passthrough_enabled: true(global configuration)- Backend profile has
api.anthropic_support.enabled: true(per-backend profile)
If either condition is false, Olla falls back to translation mode automatically.
Examples¶
Enable translator with passthrough (recommended for production):
translators:
anthropic:
enabled: true
passthrough_enabled: true # Forward directly to backends with native Anthropic support
max_message_size: 10485760 # 10MB
Enable translator with translation only (useful for debugging/testing):
translators:
anthropic:
enabled: true
passthrough_enabled: false # Always translate Anthropic ↔ OpenAI format
max_message_size: 10485760
Disable translator entirely:
translators:
anthropic:
enabled: false
# passthrough_enabled has no effect when enabled=false
passthrough_enabled: true
Performance Implications¶
| Mode | Overhead | When Used |
|---|---|---|
| Passthrough | Near-zero (~0ms) | passthrough_enabled: true and backend has native Anthropic support |
| Translation | ~1-5ms per request | passthrough_enabled: false, or backend lacks native Anthropic support |
| Disabled | N/A | enabled: false -- endpoints return 404 |
Detecting the Active Mode¶
Check the X-Olla-Mode response header:
X-Olla-Mode: passthrough-- passthrough mode was used- Header absent -- translation mode was used
Inspector (Development Only)¶
Do not enable in production -- logs full request/response bodies including potentially sensitive user data.
| Field | Type | Default | Description |
|---|---|---|---|
inspector.enabled | bool | false | Enable request/response logging |
inspector.output_dir | string | "logs/inspector/anthropic" | Directory for log output |
inspector.session_header | string | "X-Session-ID" | Header for session grouping |
See Anthropic Inspector for details.
Logging Configuration¶
Application logging settings.
| Field | Type | Default | Description |
|---|---|---|---|
level | string | "info" | Log level (debug, info, warn, error) |
format | string | "json" | Log format (json or text) |
output | string | "stdout" | Output destination |
Example:
Log levels:
debug: Detailed debugging informationinfo: Normal operational messageswarn: Warning conditionserror: Error conditions only
Engineering Configuration¶
Debug and development features.
| Field | Type | Default | Description |
|---|---|---|---|
show_nerdstats | bool | false | Show memory stats on shutdown |
Example:
When enabled, displays:
- Memory allocation statistics
- Garbage collection metrics
- Goroutine counts
- Runtime information
Environment Variables¶
All configuration can be overridden via environment variables.
Pattern: OLLA_<SECTION>_<KEY> in uppercase with underscores.
Examples:
# Server settings
OLLA_SERVER_HOST=0.0.0.0
OLLA_SERVER_PORT=8080
OLLA_SERVER_REQUEST_LOGGING=true
# Proxy settings
OLLA_PROXY_ENGINE=olla
OLLA_PROXY_LOAD_BALANCER=round-robin
OLLA_PROXY_PROFILE=auto
# Logging
OLLA_LOGGING_LEVEL=debug
OLLA_LOGGING_FORMAT=text
# Rate limits
OLLA_SERVER_RATE_LIMITS_GLOBAL_REQUESTS_PER_MINUTE=1000
Duration Format¶
Duration values use Go duration syntax:
s- seconds (e.g.,30s)m- minutes (e.g.,5m)h- hours (e.g.,2h)ms- milliseconds (e.g.,500ms)us- microseconds (e.g.,100us)
Examples:
30s- 30 seconds5m- 5 minutes1h30m- 1 hour 30 minutes500ms- 500 milliseconds
Default Configuration¶
Complete default configuration:
server:
host: "localhost"
port: 40114
read_timeout: 30s
write_timeout: 0s
# idle_timeout: 0s # Optional (0 = use read_timeout)
shutdown_timeout: 10s
request_logging: true
request_limits:
max_body_size: 104857600 # 100MB
max_header_size: 1048576 # 1MB
rate_limits:
global_requests_per_minute: 1000
per_ip_requests_per_minute: 100
health_requests_per_minute: 1000
burst_size: 50
cleanup_interval: 5m
trust_proxy_headers: false
trusted_proxy_cidrs:
- "127.0.0.0/8"
- "10.0.0.0/8"
- "172.16.0.0/12"
- "192.168.0.0/16"
proxy:
engine: "sherpa"
profile: "auto"
load_balancer: "priority"
connection_timeout: 30s
response_timeout: 10m
read_timeout: 120s
# DEPRECATED as of v0.0.16 - retry is now automatic
# max_retries: 3
# retry_backoff: 1s
stream_buffer_size: 8192
discovery:
type: "static"
refresh_interval: 30s
model_discovery:
enabled: true
interval: 5m
timeout: 30s
concurrent_workers: 5
retry_attempts: 3
retry_backoff: 1s
static:
endpoints: []
model_registry:
type: "memory"
enable_unifier: true
routing_strategy:
type: "strict"
options:
fallback_behavior: "compatible_only"
discovery_timeout: 2s
discovery_refresh_on_miss: false
unification:
enabled: true
stale_threshold: 24h
cleanup_interval: 5m
cache_ttl: 10m
custom_rules: []
translators:
anthropic:
enabled: true
passthrough_enabled: true
max_message_size: 10485760 # 10MB
inspector:
enabled: false
output_dir: "logs/inspector/anthropic"
session_header: "X-Session-ID"
logging:
level: "info"
format: "json"
output: "stdout"
engineering:
show_nerdstats: false
Validation¶
Olla validates configuration on startup:
- Required fields are checked
- URLs must be valid
- Durations must parse correctly
- Endpoints must have unique names
- Ports must be in valid range (1-65535)
- CIDR blocks must be valid
Additionally, Olla's Validate() method catches dangerous zero or empty configuration values that would cause panics or silent failures at runtime. It runs after all config sources (file, environment overrides) have been merged, so the final state is what gets checked. The following conditions produce clear error messages at startup:
proxy.engineis emptyproxy.load_balanceris emptydiscovery.typeis emptyserver.portis zero or negative- When
model_discovery.enabledistrue:interval,concurrent_workers, ortimeoutis zero
Next Steps¶
- Configuration Examples - Common configurations
- Best Practices - Production recommendations
- Environment Variables - Override configuration