Quick Start¶
Get Olla up and running with this quick start guide.
Prerequisites¶
- Olla installed on your system
- At least one compatible LLM endpoint running
Basic Setup¶
1. Create Configuration¶
Create a config.yaml
file:
server:
host: "0.0.0.0"
port: 40114
request_logging: true
proxy:
engine: "sherpa" # or "olla" for high-performance
load_balancer: "priority"
discovery:
type: "static"
static:
endpoints:
- url: "http://localhost:11434"
name: "local-ollama"
type: "ollama"
priority: 100
# health_check_url: "/" # Optional, defaults to provider-specific path
logging:
level: "info"
format: "json"
2. Start Olla¶
You should see output similar to:
{"level":"info","msg":"Starting Olla proxy server","port":40114}
{"level":"info","msg":"Health check passed","endpoint":"local-ollama"}
{"level":"info","msg":"Server ready","endpoints":1}
3. Test the Proxy¶
Check that Olla is running:
List available models through the proxy:
# For Ollama endpoints
curl http://localhost:40114/olla/ollama/api/tags
# For OpenAI-compatible endpoints
curl http://localhost:40114/olla/ollama/v1/models
Example Requests¶
Chat Completion (OpenAI-compatible)¶
curl -X POST http://localhost:40114/olla/ollama/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
]
}'
Ollama Generate¶
curl -X POST http://localhost:40114/olla/ollama/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"prompt": "Why is the sky blue?"
}'
Streaming Response¶
curl -X POST http://localhost:40114/olla/ollama/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [
{"role": "user", "content": "Tell me a story"}
],
"stream": true
}'
Multiple Endpoints Configuration¶
Configure multiple LLM endpoints with load balancing:
discovery:
type: "static"
static:
endpoints:
# High priority local Ollama
- url: "http://localhost:11434"
name: "local-ollama"
type: "ollama"
priority: 100
# Medium priority LM Studio
- url: "http://localhost:1234"
name: "local-lm-studio"
type: "lm-studio"
priority: 50
# Low priority remote endpoint
- url: "https://api.example.com"
name: "remote-api"
type: "openai"
priority: 10
Monitoring¶
Monitor Olla's performance:
# Health status
curl http://localhost:40114/internal/health
# System status and statistics
curl http://localhost:40114/internal/status
Response headers provide request tracing:
Look for these headers:
X-Olla-Endpoint
: Which backend handled the requestX-Olla-Backend-Type
: Type of backend (ollama/openai/lmstudio)X-Olla-Request-ID
: Unique request identifierX-Olla-Response-Time
: Total processing time
Common Configuration Options¶
High-Performance Setup¶
For production environments, use the Olla engine:
proxy:
engine: "olla" # High-performance engine
load_balancer: "least-connections"
connection_timeout: 30s
max_retries: 3
Rate Limiting¶
Protect your endpoints with rate limiting:
server:
rate_limits:
global_requests_per_minute: 1000
per_ip_requests_per_minute: 100
burst_size: 50
Request Size Limits¶
Set appropriate request limits:
Learn More¶
Core Concepts¶
- Proxy Engines - Compare Sherpa vs Olla engines
- Load Balancing - Priority, round-robin, and least-connections strategies
- Model Unification - How models are aggregated across endpoints
- Health Checking - Automatic endpoint monitoring
- Profile System - Customise backend behaviour
Configuration¶
- Configuration Overview - Complete configuration guide
- Proxy Profiles - Auto, streaming, and standard profiles
- Best Practices - Production recommendations
Next Steps¶
- Backend Integrations - Connect Ollama, LM Studio, vLLM
- Architecture Overview - Deep dive into Olla's design
- Development Guide - Contribute to Olla
Troubleshooting¶
Endpoint Not Responding¶
Check your endpoint URLs and ensure the services are running:
Health Checks Failing¶
Verify health check URLs are correct for your endpoint type:
- Ollama: Use
/
or/api/version
- LM Studio: Use
/
or/v1/models
- OpenAI-compatible: Use
/v1/models
High Latency¶
Consider switching to the high-performance Olla engine:
For more detailed troubleshooting, check the logs and open an issue if needed.