Anthropic Messages API¶
Olla provides a complete Anthropic Messages API translator that enables Claude-compatible clients to work with local LLM infrastructure. This translation layer converts requests between Anthropic's API format and OpenAI's Chat Completions API, allowing tools like Claude Code to leverage your existing Ollama, LM Studio, vLLM, SGLang or other OpenAI-compatible backends supported by Olla.
Overview¶
The Anthropic translator accepts requests in Anthropic Messages API format at /olla/anthropic/v1/* endpoints, translates them to OpenAI format, routes to available backends, and translates responses back to Anthropic format.
Key Features:
- ✅ Full Anthropic Messages API compatibility
- ✅ Passthrough mode for backends with native Anthropic support (vLLM, llama.cpp, LM Studio, Ollama)
- ✅ Translation mode for OpenAI-compatible backends without native support
- ✅ Automatic fallback from passthrough to translation when needed
- ✅ Streaming via Server-Sent Events (SSE)
- ✅ Tool use (function calling)
- ✅ Works with all OpenAI-compatible backends
- ✅ Zero backend changes required
- ✅ Translator metrics for observability (passthrough/translation rates, latency, fallback tracking)
- ⚠️ Vision Support: Image content blocks accepted but not yet processed
- ⛔ Async Support: Asynchronous workflows are not supported
Supported Clients:
- Claude Code - Anthropic's official CLI coding assistant
- OpenCode - Open-source AI coding assistant
- Crush CLI - Charmbracelet's AI CLI tool
- Any client using Anthropic Messages API format
How it Works¶
Olla supports two modes for handling Anthropic API requests:
Passthrough Mode (Preferred)¶
When a backend natively supports the Anthropic Messages API, requests are forwarded directly without any translation overhead.
sequenceDiagram
participant Client as Claude Code
participant Olla as Olla (Passthrough)
participant Backend as Anthropic-Compatible Backend
Client->>Olla: POST /olla/anthropic/v1/messages<br/>(Anthropic format)
Note over Olla: 1. Detect native Anthropic support
Note over Olla: 2. Forward request as-is
Olla->>Backend: POST /v1/messages<br/>(Anthropic format - unchanged)
Backend->>Olla: Response (Anthropic format)
Olla->>Client: Response (Anthropic format - unchanged) Compatible backends: vLLM (v0.11.1+), llama.cpp (b4847+), LM Studio (v0.4.1+), Ollama (v0.14.0+)
Observability: Responses include X-Olla-Mode: passthrough header.
Translation Mode (Fallback)¶
When no backend supports native Anthropic format, requests are translated to OpenAI format and responses are translated back.
sequenceDiagram
participant Client as Claude Code
participant Olla as Olla Translator
participant Backend as OpenAI Backend
Client->>Olla: POST /olla/anthropic/v1/messages<br/>(Anthropic format)
Note over Olla: 1. Validate request
Note over Olla: 2. Translate to OpenAI format
Note over Olla: 3. Route to backend
Olla->>Backend: POST /v1/chat/completions<br/>(OpenAI format)
Backend->>Olla: Response (OpenAI format)
Note over Olla: 4. Translate response back
Olla->>Client: Response (Anthropic format) Mode Selection: Olla automatically selects the best mode based on available backend capabilities. No client-side configuration is required.
For detailed explanation of both modes, see API Translation Concept.
Endpoints Overview¶
| Method | Endpoint | Description |
|---|---|---|
| GET | /olla/anthropic/v1/models | List available models in Anthropic format |
| POST | /olla/anthropic/v1/messages | Create a message (chat completion) |
| POST | /olla/anthropic/v1/messages/count_tokens | Estimate token count for a message |
POST /olla/anthropic/v1/messages¶
Create a message with support for streaming, tool use, and vision.
Request¶
Headers:
Body Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model name (must exist on backend) |
messages | array | Yes | Array of message objects |
max_tokens | integer | Yes | Maximum tokens to generate |
system | string | No | System prompt (translated to first message) |
temperature | number | No | Sampling temperature (0.0-1.0) |
top_p | number | No | Nucleus sampling parameter |
top_k | integer | No | Top-k sampling parameter |
stop_sequences | array | No | Stop sequences |
stream | boolean | No | Enable streaming (default: false) |
tools | array | No | Tool definitions for function calling |
Message Object:
Content Blocks (for multi-modal):
{
"role": "user",
"content": [
{
"type": "text",
"text": "What's in this image?"
},
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": "base64-encoded-image-data"
}
}
]
}
Example: Basic Chat¶
Request:
curl -X POST http://localhost:40114/olla/anthropic/v1/messages \
-H "Content-Type: application/json" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "llama4:latest",
"max_tokens": 1024,
"messages": [
{
"role": "user",
"content": "Explain quantum computing in one sentence."
}
]
}'
Response:
{
"id": "msg_01ABC123def456GHI789jkl",
"type": "message",
"role": "assistant",
"content": [
{
"type": "text",
"text": "Quantum computing uses quantum mechanical phenomena like superposition and entanglement to perform calculations that would be infeasible for classical computers."
}
],
"model": "llama4:latest",
"stop_reason": "end_turn",
"stop_sequence": null,
"usage": {
"input_tokens": 15,
"output_tokens": 28
}
}
Example: Streaming¶
Request:
curl -X POST http://localhost:40114/olla/anthropic/v1/messages \
-H "Content-Type: application/json" \
-H "anthropic-version: 2023-06-01" \
-N \
-d '{
"model": "llama4:latest",
"max_tokens": 100,
"stream": true,
"messages": [
{
"role": "user",
"content": "Count from 1 to 5."
}
]
}'
Response (Server-Sent Events):
event: message_start
data: {"type":"message_start","message":{"id":"msg_123","type":"message","role":"assistant","content":[],"model":"llama4:latest","usage":{"input_tokens":12,"output_tokens":0}}}
event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"1"}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":","}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" 2"}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":","}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" 3"}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":","}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" 4"}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":","}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" 5"}}
event: content_block_stop
data: {"type":"content_block_stop","index":0}
event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn","stop_sequence":null},"usage":{"output_tokens":9}}
event: message_stop
data: {"type":"message_stop"}
Example: System Messages¶
The system parameter is translated to the first message with role: "system":
Request:
curl -X POST http://localhost:40114/olla/anthropic/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "llama4:latest",
"max_tokens": 1024,
"system": "You are a helpful coding assistant specialised in Python.",
"messages": [
{
"role": "user",
"content": "Write a function to calculate fibonacci numbers."
}
]
}'
How it's Translated (internally to OpenAI format):
{
"model": "llama4:latest",
"max_tokens": 1024,
"messages": [
{
"role": "system",
"content": "You are a helpful coding assistant specialised in Python."
},
{
"role": "user",
"content": "Write a function to calculate fibonacci numbers."
}
]
}
Example: Tool Use (Function Calling)¶
Request:
curl -X POST http://localhost:40114/olla/anthropic/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "llama4:latest",
"max_tokens": 1024,
"tools": [
{
"name": "get_weather",
"description": "Get current weather for a location",
"input_schema": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
],
"messages": [
{
"role": "user",
"content": "What'\''s the weather in San Francisco?"
}
]
}'
Response (if model decides to use tool):
{
"id": "msg_tool123",
"type": "message",
"role": "assistant",
"content": [
{
"type": "tool_use",
"id": "toolu_01ABC",
"name": "get_weather",
"input": {
"location": "San Francisco",
"unit": "fahrenheit"
}
}
],
"model": "llama4:latest",
"stop_reason": "tool_use",
"usage": {
"input_tokens": 120,
"output_tokens": 45
}
}
Note: Tool use requires a model that supports function calling. Not all local models support this feature.
Example: Vision (Image Input)¶
Request:
curl -X POST http://localhost:40114/olla/anthropic/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "llava:latest",
"max_tokens": 1024,
"messages": [
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": "'"$(base64 -w 0 image.jpg)"'"
}
},
{
"type": "text",
"text": "What do you see in this image?"
}
]
}
]
}'
Response:
{
"id": "msg_vision123",
"type": "message",
"role": "assistant",
"content": [
{
"type": "text",
"text": "I see a sunset over a mountain range with orange and purple hues in the sky."
}
],
"model": "llava:latest",
"stop_reason": "end_turn",
"usage": {
"input_tokens": 1250,
"output_tokens": 22
}
}
Note: Image content blocks are accepted in the request structure but not yet processed by Olla. Vision support requires a multi-modal model like LLaVA or similar.
GET /olla/anthropic/v1/models¶
List all available models across configured backends in Anthropic API format.
Request¶
Response¶
{
"data": [
{
"id": "llama4:latest",
"name": "llama4:latest",
"created": 1704067200,
"description": "Chat model via Olla proxy",
"type": "chat"
},
{
"id": "qwen2.5-coder:32b",
"name": "qwen2.5-coder:32b",
"created": 1704067200,
"description": "Chat model via Olla proxy",
"type": "chat"
},
{
"id": "mistral-nemo:latest",
"name": "mistral-nemo:latest",
"created": 1704067200,
"description": "Chat model via Olla proxy",
"type": "chat"
}
]
}
POST /olla/anthropic/v1/messages/count_tokens¶
Estimate token count for a message request using character-based calculation.
How It Works¶
Olla estimates token count using the formula: totalCharacters / 4
This provides a rough approximation for:
- System prompt content
- All message text content
- Tool names and inputs
- Tool result content
Request¶
Headers:
Body: Same format as /messages request (without requiring actual model execution)
Example¶
Request:
curl -X POST http://localhost:40114/olla/anthropic/v1/messages/count_tokens \
-H "Content-Type: application/json" \
-d '{
"model": "llama4:latest",
"max_tokens": 1024,
"messages": [
{"role": "user", "content": "Hello world"}
]
}'
Response:
Note: This is a character-based estimation (chars/4), not actual tokenisation. Actual token counts may vary by model.
Response Format¶
Non-Streaming Response¶
Structure:
{
"id": "msg_01ABC123",
"type": "message",
"role": "assistant",
"content": [
{
"type": "text",
"text": "Response content here"
}
],
"model": "llama4:latest",
"stop_reason": "end_turn",
"stop_sequence": null,
"usage": {
"input_tokens": 15,
"output_tokens": 28
}
}
Fields:
| Field | Type | Description |
|---|---|---|
id | string | Unique message identifier |
type | string | Always "message" |
role | string | Always "assistant" |
content | array | Array of content blocks |
model | string | Model that generated response |
stop_reason | string | Why generation stopped |
stop_sequence | string | Stop sequence that triggered stop (if any) |
usage | object | Token usage statistics |
Stop Reasons:
end_turn- Natural completionmax_tokens- Reached max_tokens limitstop_sequence- Hit a stop sequencetool_use- Model wants to use a tool
Streaming Response¶
Streaming uses Server-Sent Events (SSE) with typed events.
SSE Event Types (6 total):
| Event | Description | Example Data |
|---|---|---|
message_start | Initial message metadata | {"type":"message_start","message":{...}} |
content_block_start | Start of text or tool_use block | {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}} |
content_block_delta | Text chunks (text_delta) or tool JSON chunks (input_json_delta) | {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"..."}} |
content_block_stop | End of content block | {"type":"content_block_stop","index":0} |
message_delta | Stop reason and final usage statistics | {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":9}} |
message_stop | End of stream | {"type":"message_stop"} |
Full Event Sequence:
event: message_start
data: {"type":"message_start","message":{"id":"msg_123","role":"assistant","content":[],"model":"llama4:latest"}}
event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"!"}}
event: content_block_stop
data: {"type":"content_block_stop","index":0}
event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":2}}
event: message_stop
data: {"type":"message_stop"}
Response Headers¶
All responses include standard Olla headers:
| Header | Description | Example |
|---|---|---|
X-Olla-Request-ID | Unique request identifier | req_abc123 |
X-Olla-Endpoint | Backend endpoint name | local-ollama |
X-Olla-Model | Actual model used | llama4:latest |
X-Olla-Backend-Type | Backend type | ollama |
X-Olla-Response-Time | Total processing time | 1.234s |
X-Olla-Mode | Translator mode (only present for passthrough) | passthrough |
Detecting Passthrough Mode
When passthrough mode is active, the X-Olla-Mode: passthrough header is included in the response. When translation mode is used, this header is absent. This allows monitoring and debugging to distinguish between the two modes.
Translator Statistics
For aggregate translator metrics including passthrough rates, success rates, fallback reasons, and latency data, query the GET /internal/stats/translators endpoint.
Authentication¶
For Local Use: No authentication required by default.
For Production: Olla does not implement authentication at the proxy level. Use:
- Reverse proxy authentication (nginx, Traefik, Caddy)
- API gateway (Kong, Tyk)
- Network-level security (VPN, firewall)
API Key Pass-through: If the backend requires an API key, include it:
curl -X POST http://localhost:40114/olla/anthropic/v1/messages \
-H "Authorization: Bearer your-backend-api-key" \
-H "Content-Type: application/json" \
-d '{"model":"...", ...}'
Rate Limiting¶
Standard Olla rate limits apply:
| Limit Type | Default |
|---|---|
| Global requests/minute | 1000 |
| Per-IP requests/minute | 100 |
| Burst size | 50 |
Configure in config.yaml:
Error Responses¶
Errors follow Anthropic API format:
Common Error Types:
| HTTP Status | Error Type | Description |
|---|---|---|
| 400 | invalid_request_error | Invalid request format or parameters |
| 404 | not_found_error | Model or endpoint not found |
| 429 | rate_limit_error | Rate limit exceeded |
| 500 | api_error | Internal server error |
| 502 | overloaded_error | Backend service unavailable |
| 503 | overloaded_error | Service temporarily unavailable |
Differences from Official Anthropic API¶
✅ Fully Supported¶
- Basic message creation (text, multi-turn conversations)
- Streaming responses with all SSE event types
- System messages (string or content blocks)
- Tool use (definitions, tool_choice, tool_use, tool_result)
- Tool streaming with
input_json_deltaevents - Token counting via
/count_tokensendpoint - Stop sequences
- Temperature, top_p, top_k parameters
- Content blocks (text, tool_use, tool_result)
- Passthrough mode for backends with native Anthropic support (zero translation overhead)
Tool Choice Mapping:
"auto"→ OpenAI"auto""any"→ OpenAI"required"(semantic equivalent)"none"→ OpenAI"none"{"type": "tool", "name": "X"}→ OpenAI{"type": "function", "function": {"name": "X"}}
⚠️ Partially Supported¶
- Vision: Image content blocks accepted in request but not processed
- Token Counting: Character-based estimation (chars/4), not actual model tokenisation
- TopK: Passed to backend but may be ignored by OpenAI-compatible backends
❌ Not Supported¶
- Extended Thinking: Field accepted but not processed
- Prompt Caching: Not implemented
- Batches API: Not implemented
- Message Editing: Not supported
- Anthropic-specific headers:
anthropic-versionheader accepted but not enforced - Asynchronous Flows: Not supported, so Claude Code can't dispatch concurrent/parallel agents to work on a task.
Configuration¶
Anthropic translation is enabled by default. To customise, edit config.yaml:
translators:
anthropic:
enabled: true # Enabled by default
passthrough_enabled: true # Forward directly to backends with native Anthropic support (default)
max_message_size: 10485760 # Max request size (10MB)
# Standard Olla configuration
discovery:
type: static
static:
endpoints:
- url: "http://localhost:11434"
name: "local-ollama"
type: "ollama"
priority: 100
Configuration Options:
| Option | Type | Default | Description |
|---|---|---|---|
enabled | boolean | true | Enable Anthropic translator (enabled by default) |
max_message_size | integer | 10485760 | Max request size in bytes (10MB) |
passthrough_enabled | boolean | true | Passthrough optimisation mode. When true (default), requests are forwarded directly to backends with native Anthropic support for zero translation overhead. When false, all requests go through translation regardless of backend capabilities. Only applies when enabled: true. Individual backends must also declare anthropic_support in their profile. |
Passthrough Configuration¶
Passthrough mode requires two things to be active:
- The
passthrough_enabledfield must be set totruein the translator configuration - Backend profiles must declare native Anthropic support via
anthropic_support.enabled: true
translators:
anthropic:
enabled: true
passthrough_enabled: true # Required to enable passthrough mode
When passthrough_enabled is true (the default), Olla forwards requests directly to backends with native Anthropic support. Set passthrough_enabled to false to force all requests through the translation pipeline regardless of backend capabilities, which can be useful for debugging or testing the translation layer.
Backends with native Anthropic support:
| Backend | Profile | Min Version | Notes |
|---|---|---|---|
| vLLM | config/profiles/vllm.yaml | v0.11.1+ | No token counting |
| llama.cpp | config/profiles/llamacpp.yaml | b4847+ | Supports token counting |
| LM Studio | config/profiles/lmstudio.yaml | v0.4.1+ | No token counting |
| Ollama | config/profiles/ollama.yaml | v0.14.0+ | No token counting |
To disable passthrough for a specific backend, set anthropic_support.enabled: false in the profile:
# config/profiles/vllm.yaml (custom override)
api:
anthropic_support:
enabled: false # Force translation mode for this backend
Performance Considerations¶
Translation Overhead:
- Request translation: ~0.5-2ms
- Response translation: ~1-5ms
- Streaming: Minimal additional latency per chunk
Memory Usage:
- Minimal for basic messages (~1-5KB per request)
- Higher for vision (proportional to image size)
- Streaming uses constant memory
Recommendations:
- Use
proxy.engine: "olla"for lowest latency - Enable
profile: "streaming"for chat applications - Configure appropriate
max_message_sizefor your use case - Use connection pooling for high throughput
Troubleshooting¶
Model Not Found¶
Error:
{
"type": "error",
"error": {
"type": "not_found_error",
"message": "model 'llama4:latest' not found"
}
}
Solutions:
- Check model is available:
curl http://localhost:40114/olla/models - Verify backend is healthy:
curl http://localhost:40114/internal/status/endpoints - Pull model if using Ollama:
ollama pull llama4:latest
Streaming Not Working¶
Symptoms: No streaming events received
Solutions:
- Ensure
stream: truein request - Use
-Nflag with curl:curl -N ... - Check proxy settings (some proxies buffer SSE)
- Verify backend supports streaming
Tool Use Not Working¶
Symptoms: Model doesn't use tools or returns errors
Solutions:
- Verify model supports function calling (not all local models do)
- Check tool definition format matches Anthropic spec
- Try with a known tool-capable model (e.g., Llama 3.2 3B+)
High Latency¶
Symptoms: Slow response times
Solutions:
- Switch to
proxy.engine: "olla"(high-performance engine) - Use
profile: "streaming"for lower latency - Check backend performance
- Enable connection pooling
- Use local backends (Ollama, LM Studio) for lowest latency
Next Steps¶
- API Translation Concept - Understand how translation works
- Claude Code Integration - Set up Claude Code
- OpenCode Integration - Set up OpenCode
- Crush CLI Integration - Set up Crush CLI
- Anthropic Translation Setup - Complete configuration guide
Related Documentation¶
- OpenAI API Reference - OpenAI endpoint documentation
- Model Routing - How models are selected
- Load Balancing - Request distribution
- Health Checking - Endpoint monitoring