Performance Best Practices¶

This guide covers performance optimisation techniques for Olla deployments.

Default Configuration
proxy:
  engine: "sherpa"              # Simple, maintainable engine
  load_balancer: "priority"     # Fastest routing decisions
  stream_buffer_size: 8192      # 8KB buffer (balanced)
  connection_timeout: 30s       # Connection reuse duration

server:
  write_timeout: 0s             # Required for streaming
  read_timeout: 30s             # Request timeout
Performance Defaults:

Sherpa engine for simplicity (use "olla" for production)

Priority load balancer for lowest CPU usage

8KB buffer size for balanced performance

Environment Variables: OLLA_PROXY_ENGINE, OLLA_PROXY_LOAD_BALANCER

Performance Overview¶

Olla is designed for high-performance LLM request routing with:

Sub-millisecond routing decisions
Minimal memory overhead
Efficient connection pooling
Lock-free statistics collection
Zero-copy streaming where possible

Engine Selection¶

Sherpa vs Olla Engine¶

Choose the right engine for your performance needs:

Metric	Sherpa	Olla
Throughput	Moderate	High
Latency Overhead	Low (1-5ms)	Very Low (<1ms)
Memory Usage	50-100MB	100-200MB
Connection Pooling	Basic	Advanced
Best For	Development	Production

# High-performance configuration
proxy:
  engine: "olla"  # Use Olla engine for production

Load Balancer Optimisation¶

Strategy Performance Impact¶

Different strategies have different performance characteristics:

Strategy	CPU Usage	Memory	Latency	Best For
priority	Lowest	Lowest	Fastest	Static preferences
round-robin	Low	Low	Fast	Even distribution
least-connections	Medium	Medium	Medium	Dynamic load

# Fastest routing decision
proxy:
  load_balancer: "priority"

# Best response times under load
proxy:
  load_balancer: "least-connections"

Connection Management¶

Connection Pooling¶

The Olla engine maintains persistent connections:

proxy:
  engine: "olla"
  connection_timeout: 30s  # Reuse connections for 30s

Benefits:

Eliminates TCP handshake overhead
Reduces latency by 10-50ms per request
Lower CPU usage
Better throughput

Connection Reuse¶

The Olla engine maintains persistent connections for better performance:

# Connection reuse configuration
proxy:
  connection_timeout: 60s  # Keep connections alive for reuse

Timeout Tuning¶

Critical Timeout Settings¶

server:
  read_timeout: 30s       # Time to read request
  write_timeout: 0s       # MUST be 0 for streaming
  idle_timeout: 120s      # Keep-alive duration
  shutdown_timeout: 30s   # Graceful shutdown

proxy:
  connection_timeout: 30s # Backend connection timeout

Important: write_timeout: 0s is required for LLM streaming.

Health Check Optimisation¶

Balance detection speed vs overhead:

endpoints:
  - url: "http://localhost:11434"
    check_interval: 30s    # Not too frequent
    check_timeout: 2s      # Fast failure detection

Too frequent checks waste resources:

5s interval = 12 checks/minute/endpoint
30s interval = 2 checks/minute/endpoint
With 10 endpoints, that's 120 vs 20 checks/minute

Memory Optimisation¶

Memory Usage Patterns¶

Typical memory usage:

Component	Sherpa	Olla
Base	30MB	50MB
Per Connection	4KB	8KB
Per Model	1KB	1KB
Request Buffer	64KB	128KB

Reducing Memory Usage¶

# Memory-conscious configuration
server:
  request_limits:
    max_body_size: 5242880    # 5MB instead of 50MB
    max_header_size: 65536    # 64KB instead of 512KB

model_registry:
  unification:
    stale_threshold: 1h       # Aggressive cleanup
    cleanup_interval: 5m      # Frequent cleanup

Buffer Tuning¶

The stream_buffer_size setting is crucial for balancing latency and throughput:

# Low-latency configuration (chat applications)
proxy:
  engine: "sherpa"
  profile: "streaming"
  stream_buffer_size: 2048   # 2KB - Fastest first token (~5ms)

# Balanced configuration (general purpose)
proxy:
  engine: "sherpa"
  profile: "auto"
  stream_buffer_size: 8192   # 8KB - Good balance (~20ms)

# High-throughput configuration (batch processing)
proxy:
  engine: "olla"
  profile: "standard"
  stream_buffer_size: 65536  # 64KB - Maximum throughput (~150ms)

Buffer Size Trade-offs¶

Buffer Size	First Token	Throughput	Memory/Req	Best For
2KB	~5ms	Lower	2KB	Interactive chat
4KB	~10ms	Moderate	4KB	Responsive UIs
8KB	~20ms	Good	8KB	General purpose
16KB	~40ms	Better	16KB	API serving
64KB	~150ms	Best	64KB	Batch operations

See Stream Buffer Size for detailed analysis.

CPU Optimisation¶

Reducing CPU Usage¶

Use Priority Load Balancing: Least CPU intensive
Increase Health Check Intervals: Reduce background work
Disable Request Logging: Significant CPU savings
Use JSON Logging: More efficient than text

# CPU-optimised configuration
proxy:
  load_balancer: "priority"      # Lowest CPU usage

server:
  request_logging: false          # Disable for performance

logging:
  level: "warn"                   # Reduce logging
  format: "json"                  # More efficient

discovery:
  model_discovery:
    enabled: false                # Disable if not needed

Network Optimisation¶

Reduce Network Latency¶

Colocate Olla with Backends: Same machine or network
Use Internal Networks: Avoid internet routing
Enable Keep-Alive: Reuse connections

discovery:
  static:
    endpoints:
      # Use localhost when possible
      - url: "http://localhost:11434"
        name: "local"
        priority: 100

      # Use internal IPs over public
      - url: "http://10.0.1.10:11434"
        name: "internal"
        priority: 90

TCP Tuning¶

System-level optimisations:

# Increase socket buffer sizes
sysctl -w net.core.rmem_max=134217728
sysctl -w net.core.wmem_max=134217728

# TCP optimisation
sysctl -w net.ipv4.tcp_nodelay=1
sysctl -w net.ipv4.tcp_quickack=1

Streaming Performance¶

Optimise for Streaming¶

proxy:
  engine: "olla"
  profile: "streaming"  # Optimised for token streaming

server:
  write_timeout: 0s     # Required for streaming

Chunk Size Tuning¶

The streaming profile uses optimal chunk sizes:

Small chunks (4KB): Lower latency
Large chunks (64KB): Better throughput

Caching Strategies¶

Model Registry Caching¶

model_registry:
  type: "memory"
  enable_unifier: true
  unification:
    enabled: true
    stale_threshold: 24h      # Cache models for 24h
    cleanup_interval: 30m     # Infrequent cleanup

Health Check Caching¶

Health check results are cached between intervals:

endpoints:
  - check_interval: 60s  # Cache health for 60s
    check_timeout: 5s

Benchmarking¶

Load Testing¶

Use tools like hey or ab:

# Test throughput
hey -n 10000 -c 100 http://localhost:40114/internal/health

# Test streaming
hey -n 100 -c 10 -m POST \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.2","prompt":"test"}' \
  http://localhost:40114/olla/ollama/api/generate

Key Metrics to Monitor¶

Metric	Good	Acceptable	Poor
P50 Latency	<10ms	<50ms	>100ms
P99 Latency	<100ms	<500ms	>1s
CPU Usage	<50%	<80%	>90%
Memory	<200MB	<500MB	>1GB

Production Optimisations¶

High-Throughput Configuration¶

server:
  host: "0.0.0.0"
  port: 40114
  request_logging: false       # Disable logging

  request_limits:
    max_body_size: 52428800    # 50MB
    max_header_size: 524288    # 512KB

proxy:
  engine: "olla"               # High-performance engine
  profile: "auto"              # Dynamic selection
  load_balancer: "least-connections"
  connection_timeout: 60s      # Long connection reuse
  # Note: Automatic retry on connection failures is built-in

discovery:
  model_discovery:
    enabled: true
    interval: 15m              # Infrequent updates
    concurrent_workers: 10     # Parallel discovery

logging:
  level: "error"               # Minimal logging
  format: "json"

Low-Latency Configuration¶

server:
  host: "localhost"
  port: 40114
  read_timeout: 10s            # Fast failure

proxy:
  engine: "olla"
  profile: "streaming"         # Optimise for streaming
  load_balancer: "priority"    # Fastest decisions
  connection_timeout: 120s     # Reuse connections
  # Note: Automatic retry on connection failures is built-in

discovery:
  static:
    endpoints:
      - url: "http://localhost:11434"
        name: "local"
        priority: 100
        check_interval: 60s    # Reduce overhead
        check_timeout: 1s      # Fast detection

logging:
  level: "error"
  format: "json"

Scaling Strategies¶

Vertical Scaling¶

When to scale up:

CPU consistently >80%
Memory pressure
Network saturation

Horizontal Scaling¶

Deploy multiple Olla instances:

upstream olla_cluster {
    least_conn;
    server olla1:40114;
    server olla2:40114;
    server olla3:40114;
}

server {
    location / {
        proxy_pass http://olla_cluster;
    }
}

Load Distribution¶

Distribute by:

Geographic Region: Deploy per region
Model Type: Dedicated instances per model
Priority: Separate production/development

Performance Troubleshooting¶

High CPU Usage¶

Causes and solutions:

Frequent Health Checks: Increase intervals
Debug Logging: Switch to warn/error level
Request Logging: Disable if not needed
Wrong Engine: Switch to Olla engine

High Memory Usage¶

Causes and solutions:

Large Request Limits: Reduce max_body_size
Connection Leaks: Check timeout settings
Model Registry: Reduce stale_threshold
Memory Leaks: Update to latest version

High Latency¶

Causes and solutions:

Network Distance: Colocate services
Cold Connections: Increase connection_timeout
Overloaded Backends: Add more endpoints
Wrong Load Balancer: Try least-connections

Monitoring Performance¶

Key Metrics¶

Monitor these for performance:

# Average latency
curl http://localhost:40114/internal/status | jq '.system.avg_latency'

# Total requests
curl http://localhost:40114/internal/status | jq '.system.total_requests'

# Active connections
curl http://localhost:40114/internal/status | jq '.system.active_connections'

Performance Dashboards¶

Track over time:

Request rate (req/s)
Response time (P50, P95, P99)
Error rate (%)
CPU usage (%)
Memory usage (MB)
Network I/O (MB/s)

Next Steps¶

Monitoring Guide - Track performance metrics
Configuration Reference - All performance settings
Proxy Engines - Detailed engine comparison