Quick Start¶

Get Olla up and running with this quick start guide.

Prerequisites¶

Olla installed on your system
At least one compatible LLM endpoint running

Basic Setup¶

1. Create Configuration¶

Create a config.yaml file:

server:
  host: "0.0.0.0"
  port: 40114
  request_logging: true

proxy:
  engine: "sherpa"  # or "olla" for high-performance
  load_balancer: "priority"

discovery:
  type: "static"
  static:
    endpoints:
      - url: "http://localhost:11434"
        name: "local-ollama"
        type: "ollama"
        priority: 100
        # health_check_url: "/"  # Optional, defaults to provider-specific path

logging:
  level: "info"
  format: "json"

2. Start Olla¶

olla --config config.yaml

You should see output similar to:

{"level":"info","msg":"Starting Olla proxy server","port":40114}
{"level":"info","msg":"Health check passed","endpoint":"local-ollama"}
{"level":"info","msg":"Server ready","endpoints":1}

3. Test the Proxy¶

Check that Olla is running:

curl http://localhost:40114/internal/health

List available models through the proxy:

# For Ollama endpoints
curl http://localhost:40114/olla/ollama/api/tags

# For OpenAI-compatible endpoints
curl http://localhost:40114/olla/ollama/v1/models

Example Requests¶

Chat Completion (OpenAI-compatible)¶

curl -X POST http://localhost:40114/olla/ollama/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ]
  }'

Ollama Generate¶

curl -X POST http://localhost:40114/olla/ollama/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "prompt": "Why is the sky blue?"
  }'

Streaming Response¶

curl -X POST http://localhost:40114/olla/ollama/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "user", "content": "Tell me a story"}
    ],
    "stream": true
  }'

llama.cpp Endpoint¶

curl -X POST http://localhost:40114/olla/llamacpp/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.2-3b-instruct-q4_k_m.gguf",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Multiple Endpoints Configuration¶

Configure multiple LLM endpoints with load balancing:

discovery:
  type: "static"
  static:
    endpoints:
      # High priority local Ollama
      - url: "http://localhost:11434"
        name: "local-ollama"
        type: "ollama"
        priority: 100

      # Medium priority LM Studio
      - url: "http://localhost:1234"
        name: "local-lm-studio"
        type: "lm-studio"
        priority: 50

      # llama.cpp endpoint
      - url: "http://localhost:8080"
        name: "local-llamacpp"
        type: "llamacpp"
        priority: 95

      # Low priority remote endpoint
      - url: "https://api.example.com"
        name: "remote-api"
        type: "openai"
        priority: 10

Monitoring¶

Monitor Olla's performance:

# Health status
curl http://localhost:40114/internal/health

# System status and statistics
curl http://localhost:40114/internal/status

Response headers provide request tracing:

curl -I http://localhost:40114/olla/ollama/v1/models

Look for these headers:

X-Olla-Endpoint: Which backend handled the request
X-Olla-Backend-Type: Type of backend (ollama/openai/lmstudio)
X-Olla-Request-ID: Unique request identifier
X-Olla-Response-Time: Total processing time

Common Configuration Options¶

High-Performance Setup¶

For production environments, use the Olla engine:

proxy:
  engine: "olla"  # High-performance engine
  load_balancer: "least-connections"
  connection_timeout: 30s
  # Note: Automatic retry on connection failures is built-in

Rate Limiting¶

Protect your endpoints with rate limiting:

server:
  rate_limits:
    global_requests_per_minute: 1000
    per_ip_requests_per_minute: 100
    burst_size: 50

Request Size Limits¶

Set appropriate request limits:

server:
  request_limits:
    max_body_size: 52428800   # 50MB
    max_header_size: 524288   # 512KB

Learn More¶

Core Concepts¶

Proxy Engines - Compare Sherpa vs Olla engines
Load Balancing - Priority, round-robin, and least-connections strategies
Model Unification - How models are aggregated across endpoints
Health Checking - Automatic endpoint monitoring
Profile System - Customise backend behaviour

Configuration¶

Configuration Overview - Complete configuration guide
Proxy Profiles - Auto, streaming, and standard profiles
Best Practices - Production recommendations

Next Steps¶

Backend Integrations - Connect Ollama, LM Studio, llama.cpp, vLLM, SGLang, Lemonade SDK, LiteLLM
Architecture Overview - Deep dive into Olla's design
Development Guide - Contribute to Olla

Troubleshooting¶

Endpoint Not Responding¶

Check your endpoint URLs and ensure the services are running:

# Test direct access to your LLM endpoint
curl http://localhost:11434/api/tags

Health Checks Failing¶

Verify health check URLs are correct for your endpoint type:

Ollama: Use / or /api/version
LM Studio: Use / or /v1/models
OpenAI-compatible: Use /v1/models

High Latency¶

Consider switching to the high-performance Olla engine:

proxy:
  engine: "olla"
  load_balancer: "least-connections"

For more detailed troubleshooting, check the logs and open an issue if needed.