Skip to content

API Reference

Olla exposes several API endpoints for proxy operations, health monitoring, and system status. All endpoints follow RESTful conventions and return JSON responses unless otherwise specified.

Base URL

http://localhost:40114

If you ever need to remember the port, think - what's the port, 4 OLLA?!

API Sections

System Endpoints

Internal endpoints for health monitoring and system status.

  • /internal/health - Health check endpoint
  • /internal/status - System status and statistics
  • /internal/process - Process information

Unified Models API

Cross-provider model discovery and information.

  • /olla/models - List all available models across providers

Ollama API

Proxy endpoints for Ollama instances.

  • /olla/ollama/* - All Ollama API endpoints
  • OpenAI-compatible endpoints included

LM Studio API

Proxy endpoints for LM Studio servers.

  • /olla/lmstudio/* - All LM Studio API endpoints
  • /olla/lm-studio/* - Alternative prefix
  • /olla/lm_studio/* - Alternative prefix

OpenAI API

Proxy endpoints for OpenAI-compatible services.

  • /olla/openai/* - OpenAI API endpoints

vLLM API

Proxy endpoints for vLLM servers.

  • /olla/vllm/* - vLLM API endpoints

SGLang API

Proxy endpoints for SGLang servers with RadixAttention and Frontend Language support.

  • /olla/sglang/* - SGLang API endpoints
  • Includes vision model support and speculative decoding

LiteLLM API

Proxy endpoints for LiteLLM gateway (100+ providers).

  • /olla/litellm/* - LiteLLM API endpoints

llama.cpp API

Proxy endpoints for llama.cpp servers.

  • /olla/llamacpp/* - llama.cpp API endpoints
  • OpenAI-compatible endpoints plus native llamacpp features
  • Includes slot monitoring, code infill, and tokenisation

Lemonade SDK API

Proxy endpoints for Lemonade SDK servers with AMD Ryzen AI support.

  • /olla/lemonade/* - Lemonade SDK API endpoints
  • Includes ONNX and GGUF model support with hardware acceleration

Translated APIs

APIs that translate between different formats in real-time.

Anthropic Messages API

Anthropic-compatible API endpoints for Claude clients.

Endpoints: - POST /olla/anthropic/v1/messages - Create a message (chat) - GET /olla/anthropic/v1/models - List available models

Features: - Full Anthropic Messages API v1 support - Automatic translation to OpenAI format - Streaming with Server-Sent Events - Tool use (function calling) - Vision support (multi-modal)

Use With: - Claude Code - OpenCode - Crush CLI - Any Anthropic API client

See API Translation for how translation works.

Authentication

Currently, Olla does not implement authentication at the proxy level. Authentication should be handled by:

  • Backend services (Ollama, LM Studio, etc.)
  • Network-level security (firewalls, VPNs)
  • Reverse proxy authentication (nginx, Traefik)

Rate Limiting

Global and per-IP rate limits are enforced:

Limit Type Default Value
Global requests/minute 1000
Per-IP requests/minute 100
Health endpoint requests/minute 1000
Burst size 50

Request Headers

Required Headers

  • Content-Type: application/json for POST requests

Optional Headers

  • X-Request-ID - Custom request ID for tracing

Response Headers

All responses include:

Header Description
X-Olla-Request-ID Unique request identifier
X-Olla-Endpoint Backend endpoint name
X-Olla-Model Model used (if applicable)
X-Olla-Backend-Type Provider type, examples:
ollama/lmstudio/llamacpp/openai/vllm/sglang/lemonade/litellm
X-Olla-Response-Time Total processing time
X-Olla-Routing-Strategy Routing strategy used (when model routing is active)
X-Olla-Routing-Decision Routing decision made (routed/fallback/rejected)
X-Olla-Routing-Reason Human-readable reason for routing decision

Provider Metrics (Debug Logs)

When available, provider-specific performance metrics are extracted from responses and included in debug logs:

Metric Description Providers
provider_total_ms Total processing time (ms) Ollama, LM Studio
provider_prompt_tokens Tokens in prompt (count) All
provider_completion_tokens Tokens generated (count) All
provider_tokens_per_second Generation speed (tokens/s) Ollama, LM Studio
provider_model Actual model used All

See Provider Metrics for detailed information.

Error Responses

Standard HTTP status codes are used:

Status Code Description
200 Success
400 Bad Request
404 Not Found
429 Rate Limit Exceeded
500 Internal Server Error
502 Bad Gateway
503 Service Unavailable

Error Response Format

{
  "error": {
    "message": "Error description",
    "type": "error_type",
    "code": "ERROR_CODE"
  }
}

Streaming Responses

For streaming endpoints (chat completions, text generation), responses use:

  • Content-Type: text/event-stream for SSE streams
  • Transfer-Encoding: chunked for HTTP streaming
  • Line-delimited JSON for data chunks

CORS Support

CORS headers are included for browser-based clients:

  • Access-Control-Allow-Origin: *
  • Access-Control-Allow-Methods: GET, POST, OPTIONS
  • Access-Control-Allow-Headers: Content-Type, Authorization