Anthropic Messages API¶
Olla provides a complete Anthropic Messages API translator that enables Claude-compatible clients to work with local LLM infrastructure. This translation layer converts requests between Anthropic's API format and OpenAI's Chat Completions API, allowing tools like Claude Code to leverage your existing Ollama, LM Studio, vLLM, SGLang or other OpenAI-compatible backends supported by Olla.
Overview¶
The Anthropic translator accepts requests in Anthropic Messages API format at /olla/anthropic/v1/* endpoints, translates them to OpenAI format, routes to available backends, and translates responses back to Anthropic format.
Key Features:
- ✅ Full Anthropic Messages API compatibility
- ✅ Streaming via Server-Sent Events (SSE)
- ✅ Tool use (function calling)
- ✅ Works with all OpenAI-compatible backends
- ✅ Zero backend changes required
- ⚠️ Vision Support: Image content blocks accepted but not yet processed
- ⛔ Async Support: Asynchronous workflows are not supported
Supported Clients:
- Claude Code - Anthropic's official CLI coding assistant
- OpenCode - Open-source AI coding assistant
- Crush CLI - Charmbracelet's AI CLI tool
- Any client using Anthropic Messages API format
How it Works¶
sequenceDiagram
participant Client as Claude Code
participant Olla as Olla Translator
participant Backend as OpenAI Backend
Client->>Olla: POST /olla/anthropic/v1/messages<br/>(Anthropic format)
Note over Olla: 1. Validate request
Note over Olla: 2. Translate to OpenAI format
Note over Olla: 3. Route to backend
Olla->>Backend: POST /v1/chat/completions<br/>(OpenAI format)
Backend->>Olla: Response (OpenAI format)
Note over Olla: 4. Translate response back
Olla->>Client: Response (Anthropic format) Translation Process:
- Client sends Anthropic-formatted request
- Olla translates request to OpenAI format
- Request routed through standard Olla pipeline (load balancing, health checks)
- Backend processes request (unaware of original format)
- Olla translates OpenAI response back to Anthropic format
- Client receives Anthropic-formatted response
For detailed explanation, see API Translation Concept.
Endpoints Overview¶
| Method | Endpoint | Description |
|---|---|---|
| GET | /olla/anthropic/v1/models | List available models in Anthropic format |
| POST | /olla/anthropic/v1/messages | Create a message (chat completion) |
| POST | /olla/anthropic/v1/messages/count_tokens | Estimate token count for a message |
POST /olla/anthropic/v1/messages¶
Create a message with support for streaming, tool use, and vision.
Request¶
Headers:
Body Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model name (must exist on backend) |
messages | array | Yes | Array of message objects |
max_tokens | integer | Yes | Maximum tokens to generate |
system | string | No | System prompt (translated to first message) |
temperature | number | No | Sampling temperature (0.0-1.0) |
top_p | number | No | Nucleus sampling parameter |
top_k | integer | No | Top-k sampling parameter |
stop_sequences | array | No | Stop sequences |
stream | boolean | No | Enable streaming (default: false) |
tools | array | No | Tool definitions for function calling |
Message Object:
Content Blocks (for multi-modal):
{
"role": "user",
"content": [
{
"type": "text",
"text": "What's in this image?"
},
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": "base64-encoded-image-data"
}
}
]
}
Example: Basic Chat¶
Request:
curl -X POST http://localhost:40114/olla/anthropic/v1/messages \
-H "Content-Type: application/json" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "llama4:latest",
"max_tokens": 1024,
"messages": [
{
"role": "user",
"content": "Explain quantum computing in one sentence."
}
]
}'
Response:
{
"id": "msg_01ABC123def456GHI789jkl",
"type": "message",
"role": "assistant",
"content": [
{
"type": "text",
"text": "Quantum computing uses quantum mechanical phenomena like superposition and entanglement to perform calculations that would be infeasible for classical computers."
}
],
"model": "llama4:latest",
"stop_reason": "end_turn",
"stop_sequence": null,
"usage": {
"input_tokens": 15,
"output_tokens": 28
}
}
Example: Streaming¶
Request:
curl -X POST http://localhost:40114/olla/anthropic/v1/messages \
-H "Content-Type: application/json" \
-H "anthropic-version: 2023-06-01" \
-N \
-d '{
"model": "llama4:latest",
"max_tokens": 100,
"stream": true,
"messages": [
{
"role": "user",
"content": "Count from 1 to 5."
}
]
}'
Response (Server-Sent Events):
event: message_start
data: {"type":"message_start","message":{"id":"msg_123","type":"message","role":"assistant","content":[],"model":"llama4:latest","usage":{"input_tokens":12,"output_tokens":0}}}
event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"1"}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":","}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" 2"}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":","}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" 3"}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":","}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" 4"}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":","}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" 5"}}
event: content_block_stop
data: {"type":"content_block_stop","index":0}
event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn","stop_sequence":null},"usage":{"output_tokens":9}}
event: message_stop
data: {"type":"message_stop"}
Example: System Messages¶
The system parameter is translated to the first message with role: "system":
Request:
curl -X POST http://localhost:40114/olla/anthropic/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "llama4:latest",
"max_tokens": 1024,
"system": "You are a helpful coding assistant specialised in Python.",
"messages": [
{
"role": "user",
"content": "Write a function to calculate fibonacci numbers."
}
]
}'
How it's Translated (internally to OpenAI format):
{
"model": "llama4:latest",
"max_tokens": 1024,
"messages": [
{
"role": "system",
"content": "You are a helpful coding assistant specialised in Python."
},
{
"role": "user",
"content": "Write a function to calculate fibonacci numbers."
}
]
}
Example: Tool Use (Function Calling)¶
Request:
curl -X POST http://localhost:40114/olla/anthropic/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "llama4:latest",
"max_tokens": 1024,
"tools": [
{
"name": "get_weather",
"description": "Get current weather for a location",
"input_schema": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
],
"messages": [
{
"role": "user",
"content": "What'\''s the weather in San Francisco?"
}
]
}'
Response (if model decides to use tool):
{
"id": "msg_tool123",
"type": "message",
"role": "assistant",
"content": [
{
"type": "tool_use",
"id": "toolu_01ABC",
"name": "get_weather",
"input": {
"location": "San Francisco",
"unit": "fahrenheit"
}
}
],
"model": "llama4:latest",
"stop_reason": "tool_use",
"usage": {
"input_tokens": 120,
"output_tokens": 45
}
}
Note: Tool use requires a model that supports function calling. Not all local models support this feature.
Example: Vision (Image Input)¶
Request:
curl -X POST http://localhost:40114/olla/anthropic/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "llava:latest",
"max_tokens": 1024,
"messages": [
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": "'"$(base64 -w 0 image.jpg)"'"
}
},
{
"type": "text",
"text": "What do you see in this image?"
}
]
}
]
}'
Response:
{
"id": "msg_vision123",
"type": "message",
"role": "assistant",
"content": [
{
"type": "text",
"text": "I see a sunset over a mountain range with orange and purple hues in the sky."
}
],
"model": "llava:latest",
"stop_reason": "end_turn",
"usage": {
"input_tokens": 1250,
"output_tokens": 22
}
}
Note: Image content blocks are accepted in the request structure but not yet processed by Olla. Vision support requires a multi-modal model like LLaVA or similar.
GET /olla/anthropic/v1/models¶
List all available models across configured backends in Anthropic API format.
Request¶
Response¶
{
"data": [
{
"id": "llama4:latest",
"name": "llama4:latest",
"created": 1704067200,
"description": "Chat model via Olla proxy",
"type": "chat"
},
{
"id": "qwen2.5-coder:32b",
"name": "qwen2.5-coder:32b",
"created": 1704067200,
"description": "Chat model via Olla proxy",
"type": "chat"
},
{
"id": "mistral-nemo:latest",
"name": "mistral-nemo:latest",
"created": 1704067200,
"description": "Chat model via Olla proxy",
"type": "chat"
}
]
}
POST /olla/anthropic/v1/messages/count_tokens¶
Estimate token count for a message request using character-based calculation.
How It Works¶
Olla estimates token count using the formula: totalCharacters / 4
This provides a rough approximation for:
- System prompt content
- All message text content
- Tool names and inputs
- Tool result content
Request¶
Headers:
Body: Same format as /messages request (without requiring actual model execution)
Example¶
Request:
curl -X POST http://localhost:40114/olla/anthropic/v1/messages/count_tokens \
-H "Content-Type: application/json" \
-d '{
"model": "llama4:latest",
"max_tokens": 1024,
"messages": [
{"role": "user", "content": "Hello world"}
]
}'
Response:
Note: This is a character-based estimation (chars/4), not actual tokenisation. Actual token counts may vary by model.
Response Format¶
Non-Streaming Response¶
Structure:
{
"id": "msg_01ABC123",
"type": "message",
"role": "assistant",
"content": [
{
"type": "text",
"text": "Response content here"
}
],
"model": "llama4:latest",
"stop_reason": "end_turn",
"stop_sequence": null,
"usage": {
"input_tokens": 15,
"output_tokens": 28
}
}
Fields:
| Field | Type | Description |
|---|---|---|
id | string | Unique message identifier |
type | string | Always "message" |
role | string | Always "assistant" |
content | array | Array of content blocks |
model | string | Model that generated response |
stop_reason | string | Why generation stopped |
stop_sequence | string | Stop sequence that triggered stop (if any) |
usage | object | Token usage statistics |
Stop Reasons:
end_turn- Natural completionmax_tokens- Reached max_tokens limitstop_sequence- Hit a stop sequencetool_use- Model wants to use a tool
Streaming Response¶
Streaming uses Server-Sent Events (SSE) with typed events.
SSE Event Types (6 total):
| Event | Description | Example Data |
|---|---|---|
message_start | Initial message metadata | {"type":"message_start","message":{...}} |
content_block_start | Start of text or tool_use block | {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}} |
content_block_delta | Text chunks (text_delta) or tool JSON chunks (input_json_delta) | {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"..."}} |
content_block_stop | End of content block | {"type":"content_block_stop","index":0} |
message_delta | Stop reason and final usage statistics | {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":9}} |
message_stop | End of stream | {"type":"message_stop"} |
Full Event Sequence:
event: message_start
data: {"type":"message_start","message":{"id":"msg_123","role":"assistant","content":[],"model":"llama4:latest"}}
event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"!"}}
event: content_block_stop
data: {"type":"content_block_stop","index":0}
event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":2}}
event: message_stop
data: {"type":"message_stop"}
Response Headers¶
All responses include standard Olla headers:
| Header | Description | Example |
|---|---|---|
X-Olla-Request-ID | Unique request identifier | req_abc123 |
X-Olla-Endpoint | Backend endpoint name | local-ollama |
X-Olla-Model | Actual model used | llama4:latest |
X-Olla-Backend-Type | Backend type | ollama |
X-Olla-Response-Time | Total processing time | 1.234s |
Authentication¶
For Local Use: No authentication required by default.
For Production: Olla does not implement authentication at the proxy level. Use:
- Reverse proxy authentication (nginx, Traefik, Caddy)
- API gateway (Kong, Tyk)
- Network-level security (VPN, firewall)
API Key Pass-through: If the backend requires an API key, include it:
curl -X POST http://localhost:40114/olla/anthropic/v1/messages \
-H "Authorization: Bearer your-backend-api-key" \
-H "Content-Type: application/json" \
-d '{"model":"...", ...}'
Rate Limiting¶
Standard Olla rate limits apply:
| Limit Type | Default |
|---|---|
| Global requests/minute | 1000 |
| Per-IP requests/minute | 100 |
| Burst size | 50 |
Configure in config.yaml:
Error Responses¶
Errors follow Anthropic API format:
Common Error Types:
| HTTP Status | Error Type | Description |
|---|---|---|
| 400 | invalid_request_error | Invalid request format or parameters |
| 404 | not_found_error | Model or endpoint not found |
| 429 | rate_limit_error | Rate limit exceeded |
| 500 | api_error | Internal server error |
| 502 | overloaded_error | Backend service unavailable |
| 503 | overloaded_error | Service temporarily unavailable |
Differences from Official Anthropic API¶
✅ Fully Supported¶
- Basic message creation (text, multi-turn conversations)
- Streaming responses with all SSE event types
- System messages (string or content blocks)
- Tool use (definitions, tool_choice, tool_use, tool_result)
- Tool streaming with
input_json_deltaevents - Token counting via
/count_tokensendpoint - Stop sequences
- Temperature, top_p, top_k parameters
- Content blocks (text, tool_use, tool_result)
Tool Choice Mapping:
"auto"→ OpenAI"auto""any"→ OpenAI"required"(semantic equivalent)"none"→ OpenAI"none"{"type": "tool", "name": "X"}→ OpenAI{"type": "function", "function": {"name": "X"}}
⚠️ Partially Supported¶
- Vision: Image content blocks accepted in request but not processed
- Token Counting: Character-based estimation (chars/4), not actual model tokenisation
- TopK: Passed to backend but may be ignored by OpenAI-compatible backends
❌ Not Supported¶
- Extended Thinking: Field accepted but not processed
- Prompt Caching: Not implemented
- Batches API: Not implemented
- Message Editing: Not supported
- Anthropic-specific headers:
anthropic-versionheader accepted but not enforced - Asynchronous Flows: Not supported, so Claude Code can't dispatch concurrent/parallel agents to work on a task.
Configuration¶
Enable Anthropic translation in config.yaml:
translators:
anthropic:
enabled: true # Enable Anthropic API translator
max_message_size: 10485760 # Max request size (10MB)
# Standard Olla configuration
discovery:
type: static
static:
endpoints:
- url: "http://localhost:11434"
name: "local-ollama"
type: "ollama"
priority: 100
Configuration Options:
| Option | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Enable Anthropic translator |
max_message_size | integer | 10485760 | Max request size in bytes (10MB) |
Performance Considerations¶
Translation Overhead:
- Request translation: ~0.5-2ms
- Response translation: ~1-5ms
- Streaming: Minimal additional latency per chunk
Memory Usage:
- Minimal for basic messages (~1-5KB per request)
- Higher for vision (proportional to image size)
- Streaming uses constant memory
Recommendations:
- Use
proxy.engine: "olla"for lowest latency - Enable
profile: "streaming"for chat applications - Configure appropriate
max_message_sizefor your use case - Use connection pooling for high throughput
Troubleshooting¶
Model Not Found¶
Error:
{
"type": "error",
"error": {
"type": "not_found_error",
"message": "model 'llama4:latest' not found"
}
}
Solutions:
- Check model is available:
curl http://localhost:40114/olla/models - Verify backend is healthy:
curl http://localhost:40114/internal/status/endpoints - Pull model if using Ollama:
ollama pull llama4:latest
Streaming Not Working¶
Symptoms: No streaming events received
Solutions:
- Ensure
stream: truein request - Use
-Nflag with curl:curl -N ... - Check proxy settings (some proxies buffer SSE)
- Verify backend supports streaming
Tool Use Not Working¶
Symptoms: Model doesn't use tools or returns errors
Solutions:
- Verify model supports function calling (not all local models do)
- Check tool definition format matches Anthropic spec
- Try with a known tool-capable model (e.g., Llama 3.2 3B+)
High Latency¶
Symptoms: Slow response times
Solutions:
- Switch to
proxy.engine: "olla"(high-performance engine) - Use
profile: "streaming"for lower latency - Check backend performance
- Enable connection pooling
- Use local backends (Ollama, LM Studio) for lowest latency
Next Steps¶
- API Translation Concept - Understand how translation works
- Claude Code Integration - Set up Claude Code
- OpenCode Integration - Set up OpenCode
- Crush CLI Integration - Set up Crush CLI
- Anthropic Translation Setup - Complete configuration guide
Related Documentation¶
- OpenAI API Reference - OpenAI endpoint documentation
- Model Routing - How models are selected
- Load Balancing - Request distribution
- Health Checking - Endpoint monitoring