Skip to content

Quick Start

Get Olla up and running with this quick start guide.

Prerequisites

Basic Setup

1. Create Configuration

Create a config.yaml file:

server:
  host: "0.0.0.0"
  port: 40114
  request_logging: true

proxy:
  engine: "sherpa"  # or "olla" for high-performance
  load_balancer: "priority"

discovery:
  type: "static"
  static:
    endpoints:
      - url: "http://localhost:11434"
        name: "local-ollama"
        type: "ollama"
        priority: 100
        # health_check_url: "/"  # Optional, defaults to provider-specific path

logging:
  level: "info"
  format: "json"

2. Start Olla

olla --config config.yaml

You should see output similar to:

{"level":"info","msg":"Starting Olla proxy server","port":40114}
{"level":"info","msg":"Health check passed","endpoint":"local-ollama"}
{"level":"info","msg":"Server ready","endpoints":1}

3. Test the Proxy

Check that Olla is running:

curl http://localhost:40114/internal/health

List available models through the proxy:

# For Ollama endpoints
curl http://localhost:40114/olla/ollama/api/tags

# For OpenAI-compatible endpoints
curl http://localhost:40114/olla/ollama/v1/models

Example Requests

Chat Completion (OpenAI-compatible)

curl -X POST http://localhost:40114/olla/ollama/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ]
  }'

Ollama Generate

curl -X POST http://localhost:40114/olla/ollama/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "prompt": "Why is the sky blue?"
  }'

Streaming Response

curl -X POST http://localhost:40114/olla/ollama/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "user", "content": "Tell me a story"}
    ],
    "stream": true
  }'

Multiple Endpoints Configuration

Configure multiple LLM endpoints with load balancing:

discovery:
  type: "static"
  static:
    endpoints:
      # High priority local Ollama
      - url: "http://localhost:11434"
        name: "local-ollama"
        type: "ollama"
        priority: 100

      # Medium priority LM Studio
      - url: "http://localhost:1234"
        name: "local-lm-studio"
        type: "lm-studio"
        priority: 50

      # Low priority remote endpoint
      - url: "https://api.example.com"
        name: "remote-api"
        type: "openai"
        priority: 10

Monitoring

Monitor Olla's performance:

# Health status
curl http://localhost:40114/internal/health

# System status and statistics
curl http://localhost:40114/internal/status

Response headers provide request tracing:

curl -I http://localhost:40114/olla/ollama/v1/models

Look for these headers:

  • X-Olla-Endpoint: Which backend handled the request
  • X-Olla-Backend-Type: Type of backend (ollama/openai/lmstudio)
  • X-Olla-Request-ID: Unique request identifier
  • X-Olla-Response-Time: Total processing time

Common Configuration Options

High-Performance Setup

For production environments, use the Olla engine:

proxy:
  engine: "olla"  # High-performance engine
  load_balancer: "least-connections"
  connection_timeout: 30s
  max_retries: 3

Rate Limiting

Protect your endpoints with rate limiting:

server:
  rate_limits:
    global_requests_per_minute: 1000
    per_ip_requests_per_minute: 100
    burst_size: 50

Request Size Limits

Set appropriate request limits:

server:
  request_limits:
    max_body_size: 52428800   # 50MB
    max_header_size: 524288   # 512KB

Learn More

Core Concepts

Configuration

Next Steps

Troubleshooting

Endpoint Not Responding

Check your endpoint URLs and ensure the services are running:

# Test direct access to your LLM endpoint
curl http://localhost:11434/api/tags

Health Checks Failing

Verify health check URLs are correct for your endpoint type:

  • Ollama: Use / or /api/version
  • LM Studio: Use / or /v1/models
  • OpenAI-compatible: Use /v1/models

High Latency

Consider switching to the high-performance Olla engine:

proxy:
  engine: "olla"
  load_balancer: "least-connections"

For more detailed troubleshooting, check the logs and open an issue if needed.