Frequently Asked Questions¶
General Questions¶
What is Olla?¶
Olla is a high-performance proxy and load balancer specifically designed for LLM infrastructure. It intelligently routes requests across multiple LLM backends (Ollama, LM Studio, OpenAI-compatible endpoints) while providing load balancing, health checking, and unified model management.
Why use Olla instead of connecting directly to backends?¶
Olla provides several benefits:
- High availability: Automatic failover between multiple backends
- Load balancing: Distribute requests across multiple GPUs/nodes
- Unified interface: Single endpoint for all your LLM services
- Health monitoring: Automatic detection and recovery from failures
- Performance optimisation: Connection pooling and streaming optimisation
Which proxy engine should I use?¶
- Sherpa (default): Use for development, testing, or moderate traffic (< 100 concurrent users)
- Olla: Use for production, high traffic, or when you need optimal streaming performance
See Proxy Engines for detailed comparison.
Configuration¶
How do I configure multiple backends?¶
discovery:
static:
endpoints:
- url: "http://localhost:11434"
name: "local-ollama"
type: "ollama"
priority: 100
- url: "http://192.168.1.50:11434"
name: "remote-ollama"
type: "ollama"
priority: 80
- url: "http://lmstudio.local:1234"
name: "lmstudio"
type: "lm-studio"
priority: 60
Higher priority endpoints are preferred when available.
What is stream_buffer_size and how should I tune it?¶
stream_buffer_size
controls how data is chunked during streaming. It's a crucial performance parameter:
- Small buffers (2-4KB): Lower latency, faster first token for chat
- Medium buffers (8KB): Balanced performance for general use
- Large buffers (16-64KB): Higher throughput for batch processing
See Stream Buffer Size for detailed tuning guide.
Can I use environment variables for configuration?¶
Yes, most settings support environment variables:
However, some settings like proxy.profile
must be set in the YAML configuration file.
Troubleshooting¶
Streaming responses arrive all at once¶
This usually means write_timeout
is not set to 0:
Also ensure your client supports streaming. For curl, use the -N
flag.
Circuit breaker keeps opening¶
The circuit breaker opens after 3 consecutive failures. Common causes:
- Backend is actually down: Check if the backend is running
- Network issues: Verify connectivity to the backend
- Timeout too short: Increase
check_timeout
in endpoint configuration - Backend overloaded: The backend might be too slow to respond
High memory usage¶
Try these optimisations:
- Use Sherpa engine instead of Olla (lower memory footprint)
- Reduce
stream_buffer_size
- Lower request size limits
- Reduce model registry cache time
proxy:
engine: "sherpa"
stream_buffer_size: 4096 # Smaller buffer
server:
request_limits:
max_body_size: 5242880 # 5MB instead of default 50MB
Models not appearing¶
If models aren't being discovered:
-
Check model discovery is enabled:
-
Verify endpoints are healthy:
-
Check backend APIs directly:
Performance¶
How many requests can Olla handle?¶
Performance depends on your configuration:
- Sherpa engine: ~1,000 req/s for simple requests
- Olla engine: ~10,000 req/s with connection pooling
- Actual LLM inference will be the bottleneck, not Olla
How do I optimise for low latency?¶
For minimal latency to first token:
proxy:
engine: "sherpa"
profile: "streaming"
stream_buffer_size: 2048 # 2KB for fastest response
server:
write_timeout: 0s
How do I optimise for high throughput?¶
For maximum throughput:
proxy:
engine: "olla"
profile: "auto"
stream_buffer_size: 65536 # 64KB for batch processing
discovery:
model_discovery:
enabled: false # Disable if not needed
server:
request_logging: false # Reduce overhead
Integration¶
Does Olla work with OpenAI SDK?¶
Yes, Olla provides OpenAI-compatible endpoints:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:40114/olla/ollama/v1",
api_key="not-needed" # Ollama doesn't require API keys
)
response = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Hello"}]
)
Can I use Olla with LangChain?¶
Yes, configure LangChain to use Olla's endpoint:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="http://localhost:40114/olla/ollama/v1",
api_key="not-needed",
model="llama3.2"
)
Does Olla support embeddings?¶
Yes, Olla proxies embedding requests:
curl http://localhost:40114/olla/ollama/v1/embeddings \
-d '{"model":"nomic-embed-text","input":"Hello world"}'
Deployment¶
Can I run multiple Olla instances?¶
Yes, you can run multiple instances for high availability:
How do I monitor Olla?¶
Olla provides several monitoring endpoints:
/internal/health
- Basic health check/internal/status
- Detailed status and statistics/internal/status/models
- Model registry information
What's the recommended production configuration?¶
server:
request_logging: false # Reduce overhead
proxy:
engine: "olla" # High-performance engine
profile: "auto"
load_balancer: "least-connections"
logging:
level: "warn"
format: "json"
discovery:
model_discovery:
interval: 15m # Less frequent discovery
Common Issues¶
"No healthy endpoints available"¶
This means all backends are failing health checks. Check:
- Backends are running
- URLs are correct in configuration
- Network connectivity
- Firewall rules
"Circuit breaker open"¶
The circuit breaker has tripped after multiple failures. It will automatically retry after 30 seconds. To manually reset, restart Olla.
Response headers missing¶
Olla adds several headers to responses:
X-Olla-Endpoint
: Which backend served the requestX-Olla-Model
: Model usedX-Olla-Response-Time
: Total processing time
If missing, check you're using the /olla/
prefix in your requests.
Connection refused errors¶
Common causes:
- Olla isn't running on the expected port
- Firewall blocking the port
- Binding to localhost vs 0.0.0.0
- Another service using the port
Check with:
Best Practices¶
Should I use auto proxy profile?¶
Yes, the auto
profile intelligently detects whether to stream or buffer based on content type. It's the recommended default for most workloads.
How often should health checks run?¶
Balance detection speed vs overhead:
- Production: 30-60 seconds
- Development: 10-30 seconds
- Critical systems: 5-10 seconds
Should I enable request logging?¶
Only in development or when debugging. Request logging significantly impacts performance in production.
How many endpoints should I configure?¶
- Minimum: 2 for redundancy
- Typical: 3-5 endpoints
- Maximum: No hard limit, but more endpoints increase health check overhead
Getting Help¶
Where can I get support?¶
- Check this FAQ first
- Review the documentation
- Search GitHub Issues
- Create a new issue with details
How do I report a bug?¶
Create a GitHub issue with:
- Olla version (
olla version
) - Configuration (sanitised)
- Steps to reproduce
- Expected vs actual behaviour
- Relevant logs
Can I contribute?¶
Yes! See the Contributing Guide for details on:
- Code standards
- Testing requirements
- Pull request process
- Development setup