Frequently Asked Questions¶

General Questions¶

What is Olla?¶

Olla is a high-performance proxy and load balancer specifically designed for LLM infrastructure. It intelligently routes requests across multiple LLM backends (Ollama, LM Studio, vLLM, SGLang, Lemonade SDK, LiteLLM, and OpenAI-compatible endpoints) while providing load balancing, health checking, and unified model management.

See how Olla compares to other tools in the ecosystem.

Why use Olla instead of connecting directly to backends?¶

Olla provides several benefits:

High availability: Automatic failover between multiple backends
Load balancing: Distribute requests across multiple GPUs/nodes
Unified interface: Single endpoint for all your LLM services
Health monitoring: Automatic detection and recovery from failures
Performance optimisation: Connection pooling and streaming optimisation

Which proxy engine should I use?¶

Sherpa (default): Use for development, testing, or moderate traffic (< 100 concurrent users)
Olla: Use for production, high traffic, or when you need optimal streaming performance

See Proxy Engines for detailed comparison.

Configuration¶

How do I configure multiple backends?¶

discovery:
  static:
    endpoints:
      - url: "http://localhost:11434"
        name: "local-ollama"
        type: "ollama"
        priority: 100

      - url: "http://192.168.1.50:11434"
        name: "remote-ollama"
        type: "ollama"
        priority: 80

      - url: "http://lmstudio.local:1234"
        name: "lmstudio"
        type: "lm-studio"
        priority: 60

Higher priority endpoints are preferred when available.

What is stream_buffer_size and how should I tune it?¶

stream_buffer_size controls how data is chunked during streaming. It's a crucial performance parameter:

Small buffers (2-4KB): Lower latency, faster first token for chat
Medium buffers (8KB): Balanced performance for general use
Large buffers (16-64KB): Higher throughput for batch processing

See Stream Buffer Size for detailed tuning guide.

Can I use environment variables for configuration?¶

Yes, most settings support environment variables:

OLLA_SERVER_PORT=8080
OLLA_PROXY_ENGINE=olla
OLLA_LOG_LEVEL=debug

However, some settings like proxy.profile must be set in the YAML configuration file.

Troubleshooting¶

Streaming responses arrive all at once¶

This usually means write_timeout is not set to 0:

server:
  write_timeout: 0s  # Required for streaming

Also ensure your client supports streaming. For curl, use the -N flag.

Circuit breaker keeps opening¶

The circuit breaker opens after 3 consecutive failures. Common causes:

Backend is actually down: Check if the backend is running
Network issues: Verify connectivity to the backend
Timeout too short: Increase check_timeout in endpoint configuration
Backend overloaded: The backend might be too slow to respond

High memory usage¶

Try these optimisations:

Use Sherpa engine instead of Olla (lower memory footprint)
Reduce stream_buffer_size
Lower request size limits
Reduce model registry cache time

proxy:
  engine: "sherpa"
  stream_buffer_size: 4096  # Smaller buffer

server:
  request_limits:
    max_body_size: 5242880  # 5MB instead of default 50MB

Models not appearing¶

If models aren't being discovered:

Check model discovery is enabled:

discovery:
  model_discovery:
    enabled: true

Verify endpoints are healthy:

curl http://localhost:40114/internal/status

Check backend APIs directly:

# Ollama
curl http://localhost:11434/api/tags

# LM Studio
curl http://localhost:1234/v1/models

Performance¶

How many requests can Olla handle?¶

Performance depends on your configuration:

Sherpa engine: ~1,000 req/s for simple requests
Olla engine: ~10,000 req/s with connection pooling
Actual LLM inference will be the bottleneck, not Olla

How do I optimise for low latency?¶

For minimal latency to first token:

proxy:
  engine: "sherpa"
  profile: "streaming"
  stream_buffer_size: 2048  # 2KB for fastest response

server:
  write_timeout: 0s

How do I optimise for high throughput?¶

For maximum throughput:

proxy:
  engine: "olla"
  profile: "auto"
  stream_buffer_size: 65536  # 64KB for batch processing

discovery:
  model_discovery:
    enabled: false  # Disable if not needed

server:
  request_logging: false  # Reduce overhead

Integration¶

Does Olla work with OpenAI SDK?¶

Yes, Olla provides OpenAI-compatible endpoints (similar to LocalAI):

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:40114/olla/ollama/v1",
    api_key="not-needed"  # Ollama doesn't require API keys
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello"}]
)

How does Olla compare to LiteLLM?¶

LiteLLM is an API translation layer for cloud providers, while Olla is an infrastructure proxy for self-hosted endpoints. They solve different problems and work well together - LiteLLM for cloud APIs, Olla for local infrastructure reliability.

Can I use Olla with LangChain?¶

Yes, configure LangChain to use Olla's endpoint:

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="http://localhost:40114/olla/ollama/v1",
    api_key="not-needed",
    model="llama3.2"
)

Does Olla support embeddings?¶

Yes, Olla proxies embedding requests:

curl http://localhost:40114/olla/ollama/v1/embeddings \
  -d '{"model":"nomic-embed-text","input":"Hello world"}'

Deployment¶

Can Olla deploy models like GPUStack?¶

No, Olla doesn't deploy models. It routes to existing endpoints. For model deployment across GPU clusters, use GPUStack. Olla can then provide routing and failover for GPUStack-managed endpoints.

Can I run multiple Olla instances?¶

Yes, you can run multiple instances for high availability:

upstream olla_cluster {
    least_conn;
    server olla1:40114;
    server olla2:40114;
}

How do I monitor Olla?¶

Olla provides several monitoring endpoints:

/internal/health - Basic health check
/internal/status - Detailed status and statistics
/internal/status/models - Model registry information

What's the recommended production configuration?¶

server:
  request_logging: false  # Reduce overhead

proxy:
  engine: "olla"  # High-performance engine
  profile: "auto"
  load_balancer: "least-connections"

logging:
  level: "warn"
  format: "json"

discovery:
  model_discovery:
    interval: 15m  # Less frequent discovery

Common Issues¶

"No healthy endpoints available"¶

This means all backends are failing health checks. Check:

Backends are running
URLs are correct in configuration
Network connectivity
Firewall rules

"Circuit breaker open"¶

The circuit breaker has tripped after multiple failures. It will automatically retry after 30 seconds. To manually reset, restart Olla.

Response headers missing¶

Olla adds several headers to responses:

X-Olla-Endpoint: Which backend served the request
X-Olla-Model: Model used
X-Olla-Response-Time: Total processing time

If missing, check you're using the /olla/ prefix in your requests.

Connection refused errors¶

Common causes:

Olla isn't running on the expected port
Firewall blocking the port
Binding to localhost vs 0.0.0.0
Another service using the port

Check with:

netstat -tlnp | grep 40114

Best Practices¶

Should I use auto proxy profile?¶

Yes, the auto profile intelligently detects whether to stream or buffer based on content type. It's the recommended default for most workloads.

How often should health checks run?¶

Balance detection speed vs overhead:

Production: 30-60 seconds
Development: 10-30 seconds
Critical systems: 5-10 seconds

Should I enable request logging?¶

Only in development or when debugging. Request logging significantly impacts performance in production.

How many endpoints should I configure?¶

Minimum: 2 for redundancy
Typical: 3-5 endpoints
Maximum: No hard limit, but more endpoints increase health check overhead

Should I use Olla with other tools?¶

Yes! Olla works well in combination with other tools:

Use LiteLLM for cloud API access
Use GPUStack for GPU cluster management
Use LocalAI for OpenAI compatibility
See integration patterns for architectures

Getting Help¶

Where can I get support?¶

Check this FAQ first
Review the documentation
Search GitHub Issues
Create a new issue with details

How do I report a bug?¶

Create a GitHub issue with:

Olla version (olla version)
Configuration (sanitised)
Steps to reproduce
Expected vs actual behaviour
Relevant logs

Can I contribute?¶

Yes! See the Contributing Guide for details on:

Code standards
Testing requirements
Pull request process
Development setup