API Reference¶
Olla exposes several API endpoints for proxy operations, health monitoring, and system status. All endpoints follow RESTful conventions and return JSON responses unless otherwise specified.
Base URL¶
If you ever need to remember the port, think - what's the port, 4 OLLA?!
API Sections¶
System Endpoints¶
Internal endpoints for health monitoring and system status.
/internal/health- Health check endpoint/internal/status- System status and statistics/internal/process- Process information
Unified Models API¶
Cross-provider model discovery and information.
/olla/models- List all available models across providers
Ollama API¶
Proxy endpoints for Ollama instances.
/olla/ollama/*- All Ollama API endpoints- OpenAI-compatible endpoints included
LM Studio API¶
Proxy endpoints for LM Studio servers.
/olla/lmstudio/*- All LM Studio API endpoints/olla/lm-studio/*- Alternative prefix/olla/lm_studio/*- Alternative prefix
OpenAI API¶
Proxy endpoints for OpenAI-compatible services.
/olla/openai/*- OpenAI API endpoints
vLLM API¶
Proxy endpoints for vLLM servers.
/olla/vllm/*- vLLM API endpoints
SGLang API¶
Proxy endpoints for SGLang servers with RadixAttention and Frontend Language support.
/olla/sglang/*- SGLang API endpoints- Includes vision model support and speculative decoding
LiteLLM API¶
Proxy endpoints for LiteLLM gateway (100+ providers).
/olla/litellm/*- LiteLLM API endpoints
llama.cpp API¶
Proxy endpoints for llama.cpp servers.
/olla/llamacpp/*- llama.cpp API endpoints- OpenAI-compatible endpoints plus native llamacpp features
- Includes slot monitoring, code infill, and tokenisation
Lemonade SDK API¶
Proxy endpoints for Lemonade SDK servers with AMD Ryzen AI support.
/olla/lemonade/*- Lemonade SDK API endpoints- Includes ONNX and GGUF model support with hardware acceleration
Translated APIs¶
APIs that translate between different formats in real-time.
Anthropic Messages API¶
Anthropic-compatible API endpoints for Claude clients.
Endpoints: - POST /olla/anthropic/v1/messages - Create a message (chat) - GET /olla/anthropic/v1/models - List available models
Features: - Full Anthropic Messages API v1 support - Automatic translation to OpenAI format - Streaming with Server-Sent Events - Tool use (function calling) - Vision support (multi-modal)
Use With: - Claude Code - OpenCode - Crush CLI - Any Anthropic API client
See API Translation for how translation works.
Authentication¶
Currently, Olla does not implement authentication at the proxy level. Authentication should be handled by:
- Backend services (Ollama, LM Studio, etc.)
- Network-level security (firewalls, VPNs)
- Reverse proxy authentication (nginx, Traefik)
Rate Limiting¶
Global and per-IP rate limits are enforced:
| Limit Type | Default Value |
|---|---|
| Global requests/minute | 1000 |
| Per-IP requests/minute | 100 |
| Health endpoint requests/minute | 1000 |
| Burst size | 50 |
Request Headers¶
Required Headers¶
Content-Type: application/jsonfor POST requests
Optional Headers¶
X-Request-ID- Custom request ID for tracing
Response Headers¶
All responses include:
| Header | Description |
|---|---|
X-Olla-Request-ID | Unique request identifier |
X-Olla-Endpoint | Backend endpoint name |
X-Olla-Model | Model used (if applicable) |
X-Olla-Backend-Type | Provider type, examples: ollama/lmstudio/llamacpp/openai/vllm/sglang/lemonade/litellm |
X-Olla-Response-Time | Total processing time |
X-Olla-Routing-Strategy | Routing strategy used (when model routing is active) |
X-Olla-Routing-Decision | Routing decision made (routed/fallback/rejected) |
X-Olla-Routing-Reason | Human-readable reason for routing decision |
Provider Metrics (Debug Logs)¶
When available, provider-specific performance metrics are extracted from responses and included in debug logs:
| Metric | Description | Providers |
|---|---|---|
provider_total_ms | Total processing time (ms) | Ollama, LM Studio |
provider_prompt_tokens | Tokens in prompt (count) | All |
provider_completion_tokens | Tokens generated (count) | All |
provider_tokens_per_second | Generation speed (tokens/s) | Ollama, LM Studio |
provider_model | Actual model used | All |
See Provider Metrics for detailed information.
Error Responses¶
Standard HTTP status codes are used:
| Status Code | Description |
|---|---|
| 200 | Success |
| 400 | Bad Request |
| 404 | Not Found |
| 429 | Rate Limit Exceeded |
| 500 | Internal Server Error |
| 502 | Bad Gateway |
| 503 | Service Unavailable |
Error Response Format¶
Streaming Responses¶
For streaming endpoints (chat completions, text generation), responses use:
Content-Type: text/event-streamfor SSE streamsTransfer-Encoding: chunkedfor HTTP streaming- Line-delimited JSON for data chunks
CORS Support¶
CORS headers are included for browser-based clients:
Access-Control-Allow-Origin: *Access-Control-Allow-Methods: GET, POST, OPTIONSAccess-Control-Allow-Headers: Content-Type, Authorization