Anthropic Messages API¶

Olla provides a complete Anthropic Messages API translator that enables Claude-compatible clients to work with local LLM infrastructure. This translation layer converts requests between Anthropic's API format and OpenAI's Chat Completions API, allowing tools like Claude Code to leverage your existing Ollama, LM Studio, vLLM, SGLang or other OpenAI-compatible backends supported by Olla.

Overview¶

The Anthropic translator accepts requests in Anthropic Messages API format at /olla/anthropic/v1/* endpoints, translates them to OpenAI format, routes to available backends, and translates responses back to Anthropic format.

Key Features:

✅ Full Anthropic Messages API compatibility
✅ Streaming via Server-Sent Events (SSE)
✅ Tool use (function calling)
✅ Works with all OpenAI-compatible backends
✅ Zero backend changes required
⚠️ Vision Support: Image content blocks accepted but not yet processed
⛔ Async Support: Asynchronous workflows are not supported

Supported Clients:

Claude Code - Anthropic's official CLI coding assistant
OpenCode - Open-source AI coding assistant
Crush CLI - Charmbracelet's AI CLI tool
Any client using Anthropic Messages API format

How it Works¶

sequenceDiagram
    participant Client as Claude Code
    participant Olla as Olla Translator
    participant Backend as OpenAI Backend

    Client->>Olla: POST /olla/anthropic/v1/messages<br/>(Anthropic format)

    Note over Olla: 1. Validate request
    Note over Olla: 2. Translate to OpenAI format
    Note over Olla: 3. Route to backend

    Olla->>Backend: POST /v1/chat/completions<br/>(OpenAI format)
    Backend->>Olla: Response (OpenAI format)

    Note over Olla: 4. Translate response back

    Olla->>Client: Response (Anthropic format)

Translation Process:

Client sends Anthropic-formatted request
Olla translates request to OpenAI format
Request routed through standard Olla pipeline (load balancing, health checks)
Backend processes request (unaware of original format)
Olla translates OpenAI response back to Anthropic format
Client receives Anthropic-formatted response

For detailed explanation, see API Translation Concept.

Endpoints Overview¶

Method	Endpoint	Description
GET	`/olla/anthropic/v1/models`	List available models in Anthropic format
POST	`/olla/anthropic/v1/messages`	Create a message (chat completion)
POST	`/olla/anthropic/v1/messages/count_tokens`	Estimate token count for a message

POST /olla/anthropic/v1/messages¶

Create a message with support for streaming, tool use, and vision.

Request¶

Headers:

Content-Type: application/json
anthropic-version: 2023-06-01  # Optional, for compatibility

Body Parameters:

Parameter	Type	Required	Description
`model`	string	Yes	Model name (must exist on backend)
`messages`	array	Yes	Array of message objects
`max_tokens`	integer	Yes	Maximum tokens to generate
`system`	string	No	System prompt (translated to first message)
`temperature`	number	No	Sampling temperature (0.0-1.0)
`top_p`	number	No	Nucleus sampling parameter
`top_k`	integer	No	Top-k sampling parameter
`stop_sequences`	array	No	Stop sequences
`stream`	boolean	No	Enable streaming (default: false)
`tools`	array	No	Tool definitions for function calling

Message Object:

{
  "role": "user",  // or "assistant"
  "content": "Hello!"  // string or content blocks array
}

Content Blocks (for multi-modal):

{
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "What's in this image?"
    },
    {
      "type": "image",
      "source": {
        "type": "base64",
        "media_type": "image/jpeg",
        "data": "base64-encoded-image-data"
      }
    }
  ]
}

Example: Basic Chat¶

Request:

curl -X POST http://localhost:40114/olla/anthropic/v1/messages \
  -H "Content-Type: application/json" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "llama4:latest",
    "max_tokens": 1024,
    "messages": [
      {
        "role": "user",
        "content": "Explain quantum computing in one sentence."
      }
    ]
  }'

Response:

{
  "id": "msg_01ABC123def456GHI789jkl",
  "type": "message",
  "role": "assistant",
  "content": [
    {
      "type": "text",
      "text": "Quantum computing uses quantum mechanical phenomena like superposition and entanglement to perform calculations that would be infeasible for classical computers."
    }
  ],
  "model": "llama4:latest",
  "stop_reason": "end_turn",
  "stop_sequence": null,
  "usage": {
    "input_tokens": 15,
    "output_tokens": 28
  }
}

Example: Streaming¶

Request:

curl -X POST http://localhost:40114/olla/anthropic/v1/messages \
  -H "Content-Type: application/json" \
  -H "anthropic-version: 2023-06-01" \
  -N \
  -d '{
    "model": "llama4:latest",
    "max_tokens": 100,
    "stream": true,
    "messages": [
      {
        "role": "user",
        "content": "Count from 1 to 5."
      }
    ]
  }'

Response (Server-Sent Events):

event: message_start
data: {"type":"message_start","message":{"id":"msg_123","type":"message","role":"assistant","content":[],"model":"llama4:latest","usage":{"input_tokens":12,"output_tokens":0}}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"1"}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":","}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" 2"}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":","}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" 3"}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":","}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" 4"}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":","}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" 5"}}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn","stop_sequence":null},"usage":{"output_tokens":9}}

event: message_stop
data: {"type":"message_stop"}

Example: System Messages¶

The system parameter is translated to the first message with role: "system":

Request:

curl -X POST http://localhost:40114/olla/anthropic/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama4:latest",
    "max_tokens": 1024,
    "system": "You are a helpful coding assistant specialised in Python.",
    "messages": [
      {
        "role": "user",
        "content": "Write a function to calculate fibonacci numbers."
      }
    ]
  }'

How it's Translated (internally to OpenAI format):

{
  "model": "llama4:latest",
  "max_tokens": 1024,
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful coding assistant specialised in Python."
    },
    {
      "role": "user",
      "content": "Write a function to calculate fibonacci numbers."
    }
  ]
}

Example: Tool Use (Function Calling)¶

Request:

curl -X POST http://localhost:40114/olla/anthropic/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama4:latest",
    "max_tokens": 1024,
    "tools": [
      {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "input_schema": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "City name"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"],
              "description": "Temperature unit"
            }
          },
          "required": ["location"]
        }
      }
    ],
    "messages": [
      {
        "role": "user",
        "content": "What'\''s the weather in San Francisco?"
      }
    ]
  }'

Response (if model decides to use tool):

{
  "id": "msg_tool123",
  "type": "message",
  "role": "assistant",
  "content": [
    {
      "type": "tool_use",
      "id": "toolu_01ABC",
      "name": "get_weather",
      "input": {
        "location": "San Francisco",
        "unit": "fahrenheit"
      }
    }
  ],
  "model": "llama4:latest",
  "stop_reason": "tool_use",
  "usage": {
    "input_tokens": 120,
    "output_tokens": 45
  }
}

Note: Tool use requires a model that supports function calling. Not all local models support this feature.

Example: Vision (Image Input)¶

Request:

curl -X POST http://localhost:40114/olla/anthropic/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llava:latest",
    "max_tokens": 1024,
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image",
            "source": {
              "type": "base64",
              "media_type": "image/jpeg",
              "data": "'"$(base64 -w 0 image.jpg)"'"
            }
          },
          {
            "type": "text",
            "text": "What do you see in this image?"
          }
        ]
      }
    ]
  }'

Response:

{
  "id": "msg_vision123",
  "type": "message",
  "role": "assistant",
  "content": [
    {
      "type": "text",
      "text": "I see a sunset over a mountain range with orange and purple hues in the sky."
    }
  ],
  "model": "llava:latest",
  "stop_reason": "end_turn",
  "usage": {
    "input_tokens": 1250,
    "output_tokens": 22
  }
}

Note: Image content blocks are accepted in the request structure but not yet processed by Olla. Vision support requires a multi-modal model like LLaVA or similar.

GET /olla/anthropic/v1/models¶

List all available models across configured backends in Anthropic API format.

Request¶

curl http://localhost:40114/olla/anthropic/v1/models

Response¶

{
  "data": [
    {
      "id": "llama4:latest",
      "name": "llama4:latest",
      "created": 1704067200,
      "description": "Chat model via Olla proxy",
      "type": "chat"
    },
    {
      "id": "qwen2.5-coder:32b",
      "name": "qwen2.5-coder:32b",
      "created": 1704067200,
      "description": "Chat model via Olla proxy",
      "type": "chat"
    },
    {
      "id": "mistral-nemo:latest",
      "name": "mistral-nemo:latest",
      "created": 1704067200,
      "description": "Chat model via Olla proxy",
      "type": "chat"
    }
  ]
}

POST /olla/anthropic/v1/messages/count_tokens¶

Estimate token count for a message request using character-based calculation.

How It Works¶

Olla estimates token count using the formula: totalCharacters / 4

This provides a rough approximation for:

System prompt content
All message text content
Tool names and inputs
Tool result content

Request¶

Headers:

Content-Type: application/json

Body: Same format as /messages request (without requiring actual model execution)

Example¶

Request:

curl -X POST http://localhost:40114/olla/anthropic/v1/messages/count_tokens \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama4:latest",
    "max_tokens": 1024,
    "messages": [
      {"role": "user", "content": "Hello world"}
    ]
  }'

Response:

{
  "input_tokens": 2,
  "output_tokens": 0,
  "total_tokens": 2
}

Note: This is a character-based estimation (chars/4), not actual tokenisation. Actual token counts may vary by model.

Response Format¶

Non-Streaming Response¶

Structure:

{
  "id": "msg_01ABC123",
  "type": "message",
  "role": "assistant",
  "content": [
    {
      "type": "text",
      "text": "Response content here"
    }
  ],
  "model": "llama4:latest",
  "stop_reason": "end_turn",
  "stop_sequence": null,
  "usage": {
    "input_tokens": 15,
    "output_tokens": 28
  }
}

Fields:

Field	Type	Description
`id`	string	Unique message identifier
`type`	string	Always "message"
`role`	string	Always "assistant"
`content`	array	Array of content blocks
`model`	string	Model that generated response
`stop_reason`	string	Why generation stopped
`stop_sequence`	string	Stop sequence that triggered stop (if any)
`usage`	object	Token usage statistics

Stop Reasons:

end_turn - Natural completion
max_tokens - Reached max_tokens limit
stop_sequence - Hit a stop sequence
tool_use - Model wants to use a tool

Streaming Response¶

Streaming uses Server-Sent Events (SSE) with typed events.

SSE Event Types (6 total):

Event	Description	Example Data
`message_start`	Initial message metadata	`{"type":"message_start","message":{...}}`
`content_block_start`	Start of text or tool_use block	`{"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}`
`content_block_delta`	Text chunks (`text_delta`) or tool JSON chunks (`input_json_delta`)	`{"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"..."}}`
`content_block_stop`	End of content block	`{"type":"content_block_stop","index":0}`
`message_delta`	Stop reason and final usage statistics	`{"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":9}}`
`message_stop`	End of stream	`{"type":"message_stop"}`

Full Event Sequence:

event: message_start
data: {"type":"message_start","message":{"id":"msg_123","role":"assistant","content":[],"model":"llama4:latest"}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"!"}}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":2}}

event: message_stop
data: {"type":"message_stop"}

Response Headers¶

All responses include standard Olla headers:

Header	Description	Example
`X-Olla-Request-ID`	Unique request identifier	`req_abc123`
`X-Olla-Endpoint`	Backend endpoint name	`local-ollama`
`X-Olla-Model`	Actual model used	`llama4:latest`
`X-Olla-Backend-Type`	Backend type	`ollama`
`X-Olla-Response-Time`	Total processing time	`1.234s`

Authentication¶

For Local Use: No authentication required by default.

For Production: Olla does not implement authentication at the proxy level. Use:

Reverse proxy authentication (nginx, Traefik, Caddy)
API gateway (Kong, Tyk)
Network-level security (VPN, firewall)

API Key Pass-through: If the backend requires an API key, include it:

curl -X POST http://localhost:40114/olla/anthropic/v1/messages \
  -H "Authorization: Bearer your-backend-api-key" \
  -H "Content-Type: application/json" \
  -d '{"model":"...", ...}'

Rate Limiting¶

Standard Olla rate limits apply:

Limit Type	Default
Global requests/minute	1000
Per-IP requests/minute	100
Burst size	50

Configure in config.yaml:

security:
  rate_limit:
    enabled: true
    requests_per_minute: 100
    burst: 50

Error Responses¶

Errors follow Anthropic API format:

{
  "type": "error",
  "error": {
    "type": "invalid_request_error",
    "message": "model is required"
  }
}

Common Error Types:

HTTP Status	Error Type	Description
400	`invalid_request_error`	Invalid request format or parameters
404	`not_found_error`	Model or endpoint not found
429	`rate_limit_error`	Rate limit exceeded
500	`api_error`	Internal server error
502	`overloaded_error`	Backend service unavailable
503	`overloaded_error`	Service temporarily unavailable

Differences from Official Anthropic API¶

✅ Fully Supported¶

Basic message creation (text, multi-turn conversations)
Streaming responses with all SSE event types
System messages (string or content blocks)
Tool use (definitions, tool_choice, tool_use, tool_result)
Tool streaming with input_json_delta events
Token counting via /count_tokens endpoint
Stop sequences
Temperature, top_p, top_k parameters
Content blocks (text, tool_use, tool_result)

Tool Choice Mapping:

"auto" → OpenAI "auto"
"any" → OpenAI "required" (semantic equivalent)
"none" → OpenAI "none"
{"type": "tool", "name": "X"} → OpenAI {"type": "function", "function": {"name": "X"}}

⚠️ Partially Supported¶

Vision: Image content blocks accepted in request but not processed
Token Counting: Character-based estimation (chars/4), not actual model tokenisation
TopK: Passed to backend but may be ignored by OpenAI-compatible backends

❌ Not Supported¶

Extended Thinking: Field accepted but not processed
Prompt Caching: Not implemented
Batches API: Not implemented
Message Editing: Not supported
Anthropic-specific headers: anthropic-version header accepted but not enforced
Asynchronous Flows: Not supported, so Claude Code can't dispatch concurrent/parallel agents to work on a task.

Configuration¶

Enable Anthropic translation in config.yaml:

translators:
  anthropic:
    enabled: true                   # Enable Anthropic API translator
    max_message_size: 10485760     # Max request size (10MB)

# Standard Olla configuration
discovery:
  type: static
  static:
    endpoints:
      - url: "http://localhost:11434"
        name: "local-ollama"
        type: "ollama"
        priority: 100

Configuration Options:

Option	Type	Default	Description
`enabled`	boolean	`false`	Enable Anthropic translator
`max_message_size`	integer	`10485760`	Max request size in bytes (10MB)

Performance Considerations¶

Translation Overhead:

Request translation: ~0.5-2ms
Response translation: ~1-5ms
Streaming: Minimal additional latency per chunk

Memory Usage:

Minimal for basic messages (~1-5KB per request)
Higher for vision (proportional to image size)
Streaming uses constant memory

Recommendations:

Use proxy.engine: "olla" for lowest latency
Enable profile: "streaming" for chat applications
Configure appropriate max_message_size for your use case
Use connection pooling for high throughput

Troubleshooting¶

Model Not Found¶

Error:

{
  "type": "error",
  "error": {
    "type": "not_found_error",
    "message": "model 'llama4:latest' not found"
  }
}

Solutions:

Check model is available: curl http://localhost:40114/olla/models
Verify backend is healthy: curl http://localhost:40114/internal/status/endpoints
Pull model if using Ollama: ollama pull llama4:latest

Streaming Not Working¶

Symptoms: No streaming events received

Solutions:

Ensure stream: true in request
Use -N flag with curl: curl -N ...
Check proxy settings (some proxies buffer SSE)
Verify backend supports streaming

Tool Use Not Working¶

Symptoms: Model doesn't use tools or returns errors

Solutions:

Verify model supports function calling (not all local models do)
Check tool definition format matches Anthropic spec
Try with a known tool-capable model (e.g., Llama 3.2 3B+)

High Latency¶

Symptoms: Slow response times

Solutions:

Switch to proxy.engine: "olla" (high-performance engine)
Use profile: "streaming" for lower latency
Check backend performance
Enable connection pooling
Use local backends (Ollama, LM Studio) for lowest latency

Next Steps¶

API Translation Concept - Understand how translation works
Claude Code Integration - Set up Claude Code
OpenCode Integration - Set up OpenCode
Crush CLI Integration - Set up Crush CLI
Anthropic Translation Setup - Complete configuration guide

OpenAI API Reference - OpenAI endpoint documentation
Model Routing - How models are selected
Load Balancing - Request distribution
Health Checking - Endpoint monitoring

Anthropic Messages API¶

Overview¶

How it Works¶

Endpoints Overview¶

POST /olla/anthropic/v1/messages¶

Request¶

Example: Basic Chat¶

Example: Streaming¶

Example: System Messages¶

Example: Tool Use (Function Calling)¶

Example: Vision (Image Input)¶

GET /olla/anthropic/v1/models¶

Request¶

Response¶

POST /olla/anthropic/v1/messages/count_tokens¶

How It Works¶

Request¶

Example¶

Response Format¶

Non-Streaming Response¶

Streaming Response¶

Response Headers¶

Authentication¶

Rate Limiting¶

Error Responses¶

Differences from Official Anthropic API¶

✅ Fully Supported¶

⚠️ Partially Supported¶

❌ Not Supported¶

Configuration¶

Performance Considerations¶

Troubleshooting¶

Model Not Found¶

Streaming Not Working¶

Tool Use Not Working¶

High Latency¶

Next Steps¶

Related Documentation¶