API Reference¶

Olla exposes several API endpoints for proxy operations, health monitoring, and system status. All endpoints follow RESTful conventions and return JSON responses unless otherwise specified.

Base URL¶

http://localhost:40114

If you ever need to remember the port, think - what's the port, 4 OLLA?!

API Sections¶

System Endpoints ¶

Internal endpoints for health monitoring and system status.

/internal/health - Health check endpoint
/internal/status - System status and statistics
/internal/process - Process information

Unified Models API ¶

Cross-provider model discovery and information.

/olla/models - List all available models across providers

Ollama API ¶

Proxy endpoints for Ollama instances.

/olla/ollama/* - All Ollama API endpoints
OpenAI-compatible endpoints included

LM Studio API ¶

Proxy endpoints for LM Studio servers.

/olla/lmstudio/* - All LM Studio API endpoints
/olla/lm-studio/* - Alternative prefix
/olla/lm_studio/* - Alternative prefix

OpenAI API ¶

Proxy endpoints for OpenAI-compatible services.

/olla/openai/* - OpenAI API endpoints

vLLM API ¶

Proxy endpoints for vLLM servers.

/olla/vllm/* - vLLM API endpoints

SGLang API ¶

Proxy endpoints for SGLang servers with RadixAttention and Frontend Language support.

/olla/sglang/* - SGLang API endpoints
Includes vision model support and speculative decoding

LiteLLM API ¶

Proxy endpoints for LiteLLM gateway (100+ providers).

/olla/litellm/* - LiteLLM API endpoints

llama.cpp API ¶

Proxy endpoints for llama.cpp servers.

/olla/llamacpp/* - llama.cpp API endpoints
OpenAI-compatible endpoints plus native llamacpp features
Includes slot monitoring, code infill, and tokenisation

Lemonade SDK API ¶

Proxy endpoints for Lemonade SDK servers with AMD Ryzen AI support.

/olla/lemonade/* - Lemonade SDK API endpoints
Includes ONNX and GGUF model support with hardware acceleration

Translated APIs¶

APIs that translate between different formats in real-time.

Anthropic Messages API ¶

Anthropic-compatible API endpoints for Claude clients.

Endpoints: - POST /olla/anthropic/v1/messages - Create a message (chat) - GET /olla/anthropic/v1/models - List available models

Features: - Full Anthropic Messages API v1 support - Automatic translation to OpenAI format - Streaming with Server-Sent Events - Tool use (function calling) - Vision support (multi-modal)

Use With: - Claude Code - OpenCode - Crush CLI - Any Anthropic API client

See API Translation for how translation works.

Authentication¶

Currently, Olla does not implement authentication at the proxy level. Authentication should be handled by:

Backend services (Ollama, LM Studio, etc.)
Network-level security (firewalls, VPNs)
Reverse proxy authentication (nginx, Traefik)

Rate Limiting¶

Global and per-IP rate limits are enforced:

Limit Type	Default Value
Global requests/minute	1000
Per-IP requests/minute	100
Health endpoint requests/minute	1000
Burst size	50

Request Headers¶

Required Headers¶

Content-Type: application/json for POST requests

Optional Headers¶

X-Request-ID - Custom request ID for tracing

Response Headers¶

All responses include:

Header	Description
`X-Olla-Request-ID`	Unique request identifier
`X-Olla-Endpoint`	Backend endpoint name
`X-Olla-Model`	Model used (if applicable)
`X-Olla-Backend-Type`	Provider type, examples: `ollama/lmstudio/llamacpp/openai/vllm/sglang/lemonade/litellm`
`X-Olla-Response-Time`	Total processing time
`X-Olla-Routing-Strategy`	Routing strategy used (when model routing is active)
`X-Olla-Routing-Decision`	Routing decision made (routed/fallback/rejected)
`X-Olla-Routing-Reason`	Human-readable reason for routing decision

Provider Metrics (Debug Logs)¶

When available, provider-specific performance metrics are extracted from responses and included in debug logs:

Metric	Description	Providers
`provider_total_ms`	Total processing time (ms)	Ollama, LM Studio
`provider_prompt_tokens`	Tokens in prompt (count)	All
`provider_completion_tokens`	Tokens generated (count)	All
`provider_tokens_per_second`	Generation speed (tokens/s)	Ollama, LM Studio
`provider_model`	Actual model used	All

See Provider Metrics for detailed information.

Error Responses¶

Standard HTTP status codes are used:

Status Code	Description
200	Success
400	Bad Request
404	Not Found
429	Rate Limit Exceeded
500	Internal Server Error
502	Bad Gateway
503	Service Unavailable

Error Response Format¶

{
  "error": {
    "message": "Error description",
    "type": "error_type",
    "code": "ERROR_CODE"
  }
}

Streaming Responses¶

For streaming endpoints (chat completions, text generation), responses use:

Content-Type: text/event-stream for SSE streams
Transfer-Encoding: chunked for HTTP streaming
Line-delimited JSON for data chunks

CORS Support¶

CORS headers are included for browser-based clients:

Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET, POST, OPTIONS
Access-Control-Allow-Headers: Content-Type, Authorization