Skip to content

Configuration Reference

Complete reference for all Olla configuration options.

📝 Default Configuration

server:
  host: "localhost"
  port: 40114

proxy:
  engine: "sherpa"
  load_balancer: "priority"

discovery:
  model_discovery:
    enabled: true
    interval: 5m

logging:
  level: "info"
  format: "json"
Minimal Setup: Olla starts with sensible defaults - just run olla and it works!

Environment Variables: All settings support OLLA_ prefix (e.g., OLLA_SERVER_PORT=8080)

Configuration Structure

server:         # HTTP server configuration
proxy:          # Proxy engine settings
discovery:      # Endpoint discovery
model_registry: # Model management
logging:        # Logging configuration
engineering:    # Debug features

Server Configuration

HTTP server and security settings.

Basic Settings

Field Type Default Description
host string "localhost" Network interface to bind
port int 40114 TCP port to listen on
request_logging bool false Enable request logging

Example:

server:
  host: "0.0.0.0"
  port: 40114
  request_logging: true

Timeouts

Field Type Default Description
read_timeout duration 20s Time to read request
write_timeout duration 0s Response write timeout (must be 0 for streaming)
idle_timeout duration 120s Keep-alive timeout
shutdown_timeout duration 10s Graceful shutdown timeout

Example:

server:
  read_timeout: 30s
  write_timeout: 0s      # Required for streaming
  idle_timeout: 120s
  shutdown_timeout: 30s

Request Limits

Field Type Default Description
request_limits.max_body_size int64 52428800 Max request body (bytes)
request_limits.max_header_size int64 524288 Max header size (bytes)

Example:

server:
  request_limits:
    max_body_size: 104857600    # 100MB
    max_header_size: 1048576     # 1MB

Rate Limits

Field Type Default Description
rate_limits.global_requests_per_minute int 0 Global rate limit (0=disabled)
rate_limits.per_ip_requests_per_minute int 0 Per-IP rate limit (0=disabled)
rate_limits.health_requests_per_minute int 0 Health endpoint limit
rate_limits.burst_size int 50 Token bucket burst size
rate_limits.cleanup_interval duration 1m Rate limiter cleanup
rate_limits.trust_proxy_headers bool false Trust X-Forwarded-For
rate_limits.trusted_proxy_cidrs []string [] Trusted proxy CIDRs

Example:

server:
  rate_limits:
    global_requests_per_minute: 10000
    per_ip_requests_per_minute: 100
    health_requests_per_minute: 5000
    burst_size: 50
    cleanup_interval: 1m
    trust_proxy_headers: true
    trusted_proxy_cidrs:
      - "10.0.0.0/8"
      - "172.16.0.0/12"

Proxy Configuration

Proxy engine and request handling settings.

Basic Settings

Field Type Default Description
engine string "sherpa" Proxy engine (sherpa or olla)
profile string "auto" Proxy profile (auto, streaming, standard)
load_balancer string "priority" Load balancer strategy

Example:

proxy:
  engine: "olla"
  profile: "auto"
  load_balancer: "least-connections"

Connection Settings

Field Type Default Description
connection_timeout duration 30s Backend connection timeout
response_timeout duration 0s Response timeout (0=disabled)
read_timeout duration 0s Read timeout (0=disabled)

Example:

proxy:
  connection_timeout: 45s
  response_timeout: 0s    # Disable for streaming
  read_timeout: 0s

Retry Behaviour

As of v0.0.16, the retry mechanism is automatic and built-in for connection failures. When a connection error occurs (e.g., connection refused, network unreachable, timeout), Olla will automatically:

  1. Mark the failed endpoint as unhealthy
  2. Try the next available healthy endpoint
  3. Continue until a successful connection is made or all endpoints have been tried
  4. Use exponential backoff for unhealthy endpoints to prevent overwhelming them

Note: The fields max_retries and retry_backoff that may still appear in the configuration are deprecated and ignored. The retry behaviour is now automatic and cannot be configured.

Streaming Settings

Field Type Default Description
stream_buffer_size int 4096 Stream buffer size (bytes)

Example:

proxy:
  stream_buffer_size: 8192

Profile Filtering

Control which inference profiles are loaded at startup. See Filter Concepts for pattern details.

Field Type Default Description
profile_filter.include []string [] Profiles to include (glob patterns)
profile_filter.exclude []string [] Profiles to exclude (glob patterns)

Example:

proxy:
  profile_filter:
    include:
      - "ollama"        # Include Ollama
      - "openai*"       # Include all OpenAI variants
    exclude:
      - "*test*"        # Exclude test profiles
      - "*debug*"       # Exclude debug profiles

Discovery Configuration

Endpoint discovery and health checking.

Discovery Type

Field Type Default Description
type string "static" Discovery type (only static supported)
refresh_interval duration 5m Discovery refresh interval

Example:

discovery:
  type: "static"
  refresh_interval: 10m

Static Endpoints

Field Type Required Description
static.endpoints[].url string Yes Endpoint base URL
static.endpoints[].name string Yes Unique endpoint name
static.endpoints[].type string Yes Backend type (ollama, lm-studio, vllm, openai)
static.endpoints[].priority int No Selection priority (higher=preferred)
static.endpoints[].health_check_url string No Health check path
static.endpoints[].model_url string No Model discovery path
static.endpoints[].check_interval duration No Health check interval
static.endpoints[].check_timeout duration No Health check timeout
static.endpoints[].model_filter object No Model filtering for this endpoint

Endpoint Model Filtering

Filter models at the endpoint level during discovery. See Filter Concepts for pattern syntax.

Field Type Description
model_filter.include []string Models to include (glob patterns)
model_filter.exclude []string Models to exclude (glob patterns)

Example:

discovery:
  static:
    endpoints:
      - url: "http://localhost:11434"
        name: "local-ollama"
        type: "ollama"
        priority: 100
        health_check_url: "/"
        model_url: "/api/tags"
        check_interval: 30s
        check_timeout: 5s
        model_filter:
          exclude:
            - "*embed*"         # No embedding models
            - "*uncensored*"    # No uncensored models
            - "nomic-*"         # No Nomic models

      - url: "http://remote:11434"
        name: "remote-ollama"
        type: "ollama"
        priority: 50
        check_interval: 60s
        model_filter:
          include:
            - "llama*"          # Only Llama models
            - "mistral*"        # And Mistral models

Model Discovery

Field Type Default Description
model_discovery.enabled bool true Enable model discovery
model_discovery.interval duration 5m Discovery interval
model_discovery.timeout duration 30s Discovery timeout
model_discovery.concurrent_workers int 5 Parallel workers
model_discovery.retry_attempts int 3 Retry attempts
model_discovery.retry_backoff duration 5s Retry backoff

Example:

discovery:
  model_discovery:
    enabled: true
    interval: 10m
    timeout: 30s
    concurrent_workers: 10
    retry_attempts: 3
    retry_backoff: 5s

Model Registry Configuration

Model management and unification settings.

Registry Type

Field Type Default Description
type string "memory" Registry type (only memory supported)
enable_unifier bool true Enable model unification
routing_strategy.type string "strict" Model routing strategy (strict/optimistic/discovery)

Example:

model_registry:
  type: "memory"
  enable_unifier: true
  routing_strategy:
    type: strict  # Default: only route to endpoints with the model

Model Routing Strategy

Controls how requests are routed when models aren't available on all endpoints:

Field Type Default Description
routing_strategy.type string "strict" Strategy: strict, optimistic, or discovery
routing_strategy.options.fallback_behavior string "compatible_only" Fallback: compatible_only, all, or none
routing_strategy.options.discovery_timeout duration 2s Timeout for discovery refresh
routing_strategy.options.discovery_refresh_on_miss bool false Refresh discovery when model not found

Example configurations:

# Production - strict routing
model_registry:
  routing_strategy:
    type: strict

# Development - optimistic with fallback
model_registry:
  routing_strategy:
    type: optimistic
    options:
      fallback_behavior: compatible_only

# Dynamic environments - discovery mode
model_registry:
  routing_strategy:
    type: discovery
    options:
      discovery_refresh_on_miss: true
      discovery_timeout: 2s

Unification Settings

Field Type Default Description
unification.enabled bool true Enable unification
unification.stale_threshold duration 24h Model retention time
unification.cleanup_interval duration 10m Cleanup frequency
unification.cache_ttl duration 5m Cache TTL

Example:

model_registry:
  unification:
    enabled: true
    stale_threshold: 12h
    cleanup_interval: 15m
    cache_ttl: 10m

Custom Unification Rules

Field Type Description
unification.custom_rules[].platform string Platform to apply rules
unification.custom_rules[].name_patterns map Name pattern mappings
unification.custom_rules[].family_overrides map Family overrides

Example:

model_registry:
  unification:
    custom_rules:
      - platform: "ollama"
        name_patterns:
          "llama3.*": "llama3"
          "mistral.*": "mistral"
        family_overrides:
          "llama3": "meta-llama"

Routing Configuration

Model routing strategy settings for handling requests when models aren't available on all endpoints.

Model Routing Strategy

Field Type Default Description
routing.model_routing.type string "strict" Routing strategy (strict, optimistic, discovery)
routing.model_routing.options.fallback_behavior string "compatible_only" Fallback behavior (compatible_only, all, none)
routing.model_routing.options.discovery_refresh_on_miss bool false Refresh discovery when model not found
routing.model_routing.options.discovery_timeout duration 2s Discovery refresh timeout

Strategy Types

  • strict: Only routes to endpoints known to have the model
  • optimistic: Falls back to healthy endpoints when model not found
  • discovery: Refreshes model discovery before routing decisions

Example:

routing:
  model_routing:
    type: strict
    options:
      fallback_behavior: compatible_only
      discovery_refresh_on_miss: false
      discovery_timeout: 2s

Response Headers

Routing decisions are exposed via response headers:

Header Description
X-Olla-Routing-Strategy Strategy used (strict/optimistic/discovery)
X-Olla-Routing-Decision Action taken (routed/fallback/rejected)
X-Olla-Routing-Reason Human-readable reason for decision

Logging Configuration

Application logging settings.

Field Type Default Description
level string "info" Log level (debug, info, warn, error)
format string "json" Log format (json or text)
output string "stdout" Output destination

Example:

logging:
  level: "info"
  format: "json"
  output: "stdout"

Log levels:

  • debug: Detailed debugging information
  • info: Normal operational messages
  • warn: Warning conditions
  • error: Error conditions only

Engineering Configuration

Debug and development features.

Field Type Default Description
show_nerdstats bool false Show memory stats on shutdown

Example:

engineering:
  show_nerdstats: true

When enabled, displays:

  • Memory allocation statistics
  • Garbage collection metrics
  • Goroutine counts
  • Runtime information

Environment Variables

All configuration can be overridden via environment variables.

Pattern: OLLA_<SECTION>_<KEY> in uppercase with underscores.

Examples:

# Server settings
OLLA_SERVER_HOST=0.0.0.0
OLLA_SERVER_PORT=8080
OLLA_SERVER_REQUEST_LOGGING=true

# Proxy settings
OLLA_PROXY_ENGINE=olla
OLLA_PROXY_LOAD_BALANCER=round-robin
OLLA_PROXY_PROFILE=auto

# Logging
OLLA_LOGGING_LEVEL=debug
OLLA_LOGGING_FORMAT=text

# Rate limits
OLLA_SERVER_RATE_LIMITS_GLOBAL_REQUESTS_PER_MINUTE=1000

Duration Format

Duration values use Go duration syntax:

  • s - seconds (e.g., 30s)
  • m - minutes (e.g., 5m)
  • h - hours (e.g., 2h)
  • ms - milliseconds (e.g., 500ms)
  • us - microseconds (e.g., 100us)

Examples:

  • 30s - 30 seconds
  • 5m - 5 minutes
  • 1h30m - 1 hour 30 minutes
  • 500ms - 500 milliseconds

Default Configuration

Complete default configuration:

server:
  host: "localhost"
  port: 40114
  read_timeout: 20s
  write_timeout: 0s
  idle_timeout: 120s
  shutdown_timeout: 10s
  request_logging: false
  request_limits:
    max_body_size: 52428800    # 50MB
    max_header_size: 524288     # 512KB
  rate_limits:
    global_requests_per_minute: 0
    per_ip_requests_per_minute: 0
    health_requests_per_minute: 0
    burst_size: 50
    cleanup_interval: 1m
    trust_proxy_headers: false
    trusted_proxy_cidrs: []

proxy:
  engine: "sherpa"
  profile: "auto"
  load_balancer: "priority"
  connection_timeout: 30s
  response_timeout: 0s
  read_timeout: 0s
  # DEPRECATED as of v0.0.16 - retry is now automatic
  # max_retries: 3
  # retry_backoff: 1s
  stream_buffer_size: 4096

discovery:
  type: "static"
  refresh_interval: 5m
  model_discovery:
    enabled: true
    interval: 5m
    timeout: 30s
    concurrent_workers: 5
    retry_attempts: 3
    retry_backoff: 5s
  static:
    endpoints: []

model_registry:
  type: "memory"
  enable_unifier: true
  routing_strategy:
    type: "strict"
    options:
      fallback_behavior: "compatible_only"
      discovery_timeout: 2s
      discovery_refresh_on_miss: false
  unification:
    enabled: true
    stale_threshold: 24h
    cleanup_interval: 10m
    cache_ttl: 5m
    custom_rules: []

logging:
  level: "info"
  format: "json"
  output: "stdout"

engineering:
  show_nerdstats: false

Validation

Olla validates configuration on startup:

  • Required fields are checked
  • URLs must be valid
  • Durations must parse correctly
  • Endpoints must have unique names
  • Ports must be in valid range (1-65535)
  • CIDR blocks must be valid

Next Steps