OpenWebUI Integration with OpenAI¶
OpenWebUI can speak to any OpenAI‑compatible endpoint. Olla sits in front as a smart proxy, exposing a single OpenAI API base that merges multiple backends (e.g. vLLM, SGLang) and handles load‑balancing + failover.
Set in OpenWebUI:
What you get via Olla
- One stable OpenAI base URL for all backends
- Priority/least‑connections load‑balancing and health checks
- Streaming passthrough
- Unified
/v1/models
across providers
Overview¶
Project | github.com/open-webui/open-webui |
---|---|
Integration Type | Frontend UI |
Connection Method | Open API Compatibility |
Features Supported (via Olla) |
|
Configuration | Set OPENAI_API_BASE_URL to Olla OpenAI endpoint |
Example | You can find an example of integration in examples/ollama-openwebui for Ollama as a full example, just remember to change to OPENAI_API_BASE_URL . |
Architecture¶
┌─────────────┐ ┌───────── Olla (40114) ──────┐ ┌─────────────────────┐
│ OpenWebUI │ ───▶│ /olla/openai/v1 (proxy) │ ───▶ │ vLLM :8000 (/v1/*) │
│ (3000) │ │ • LB + failover │ └─────────────────────┘
└─────────────┘ │ • health checks │ ┌─────────────────────┐
│ • model unification (/v1) │ ───▶ │ SGLang :30000 (/v1) │
└─────────────────────────────┘ └─────────────────────┘
Quick Start (Docker Compose)¶
Create compose.yaml
:
services:
olla:
image: ghcr.io/thushan/olla:latest
container_name: olla
restart: unless-stopped
ports:
- "40114:40114"
volumes:
- ./olla.yaml:/app/config.yaml:ro
- ./logs:/app/logs
healthcheck:
test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:40114/internal/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
openwebui:
image: ghcr.io/open-webui/open-webui:main
container_name: openwebui
restart: unless-stopped
ports:
- "3000:8080"
volumes:
- openwebui_data:/app/backend/data
environment:
- OPENAI_API_BASE_URL=http://olla:40114/olla/openai/v1
- WEBUI_NAME=Olla + OpenWebUI
- WEBUI_URL=http://localhost:3000
depends_on:
olla:
condition: service_healthy
volumes:
openwebui_data:
driver: local
Create olla.yaml
(static discovery example):
server:
host: 0.0.0.0
port: 40114
proxy:
engine: sherpa # or: olla (lower overhead), test both
load_balancer: priority # or: least-connections
# Service discovery of OpenAI-compatible backends
# (Each backend must expose /v1/*; Olla will translate as needed.)
discovery:
type: static
static:
endpoints:
- url: http://192.168.1.100:8000
name: gpu-vllm
type: vllm
priority: 100
- url: http://192.168.1.101:30000
name: gpu-sglang
type: sglang
priority: 50
# Optional timeouts & streaming profile
# proxy:
# response_timeout: 1800s
# read_timeout: 600s
# profile: streaming
Bring it up:
OpenWebUI: http://localhost:3000
Verifying via cURL¶
List unified models:
Simple completion (non‑streaming):
curl -s http://localhost:40114/olla/openai/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "gpt-oss-120b",
"messages": [{"role":"user","content":"Hello from Olla"}]
}' | jq
Streaming (SSE):
curl -N http://localhost:40114/olla/openai/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "gpt-oss-120b",
"stream": true,
"messages": [{"role":"user","content":"Stream test"}]
}'
Inspect Olla headers (which backend served the call):
curl -sI http://localhost:40114/internal/status/endpoints | sed -n '1,20p'
# Look for: X-Olla-Endpoint, X-Olla-Backend-Type, X-Olla-Response-Time
OpenWebUI Configuration Notes¶
- Env var:
OPENAI_API_BASE_URL
must point to Olla’s /olla/openai/v1. - Model picker: OpenWebUI’s model list is sourced from
/v1/models
(via Olla). If empty, see Troubleshooting. - API keys: If OpenWebUI prompts for an OpenAI key but your backends don’t require one, leave blank.
Multiple Backends (vLLM, SGLang, LM Studio)¶
Add as many OpenAI‑compatible servers as you like. Priorities control routing.
discovery:
static:
endpoints:
- url: http://vllm-a:8000
name: vllm-a
type: vllm
priority: 100
- url: http://sglang-b:30000
name: sglang-b
type: sglang
priority: 80
- url: http://lmstudio-c:1234
name: lmstudio-c
type: openai-compatible # generic OpenAI-compatible server
priority: 60
Tip: Use
least-connections
when all nodes are similar; usepriority
to prefer local/cheaper nodes.
Authentication (Front‑door keying via Nginx)¶
Olla doesn’t issue/validate API keys (yet). To expose Olla publicly, front it with Nginx to enforce simple static keys.
/etc/nginx/conf.d/olla.conf
map $http_authorization $api_key_valid {
default 0;
~*"Bearer (sk-thushan-XXXXXXXX|sk-yolo-YYYYYYYY)" 1;
}
server {
listen 80;
server_name ai.example.com;
location /api/ {
if ($api_key_valid = 0) { return 401; }
proxy_pass http://localhost:40114;
proxy_set_header Host $host;
proxy_http_version 1.1;
}
}
Then point external users to http://ai.example.com/api/olla/openai/v1
and give them a matching Authorization: Bearer ...
.
For more robust auth (rate limits, per‑key quotas, logs), put an API gateway (Traefik/Envoy/Kong) ahead of Olla.
Monitoring & Health¶
Olla health:
Endpoint status:
Unified models:
Logs:
Troubleshooting¶
Models not appearing in OpenWebUI
- Olla up?
- Backends discovered?
- Models resolvable?
- OpenWebUI sees correct base?
Connection refused from OpenWebUI → Olla
- Verify compose service names and ports
- From container:
docker exec openwebui curl -sS olla:40114/internal/health
Slow responses
- Switch to
proxy.engine: olla
orprofile: streaming
- Use
least-connections
for fairer distribution - Increase
proxy.response_timeout
for long generations
Docker networking (Linux)
- To hit host services:
http://172.17.0.1:<port>
- Remote nodes: use actual LAN IPs
Standalone (no compose)¶
Run Olla locally:
Run OpenWebUI: