API Translation¶

What is API Translation?¶

API translation is a capability in Olla that converts requests and responses between different LLM API formats in real-time. This enables clients designed for one API format (e.g., Anthropic Messages API) to work seamlessly with backends that implement a different format (e.g., OpenAI Chat Completions API).

Key Benefit: Use any client with any backend, regardless of their native API formats.

Why Translation is Needed¶

Different LLM providers use different API formats:

Provider	API Format	Example Client
Anthropic	Messages API	Claude Code, Crush CLI
OpenAI	Chat Completions API	OpenAI libraries, most tools
Ollama	Native + OpenAI-compatible	Ollama CLI, OpenWebUI
vLLM	OpenAI-compatible	Standard OpenAI clients

Without translation, you would need:

Different proxy setups for different clients
Client-specific backend configurations
Separate infrastructure for each API format

With translation, Olla acts as a universal adapter:

One proxy for all clients
One configuration for all backends
Seamless interoperability
Minimal overhead and latency

How Translation Works¶

Request Flow¶

For an Anthropic Message (say from Claude Code), sent to Olla's Anthropic Endpoint:

sequenceDiagram
    participant Client as Claude Code
    participant Olla as Olla Translator
    participant Backend as OpenAI Backend

    Client->>Olla: POST /olla/anthropic/v1/messages<br/>(Anthropic format)

    Note over Olla: 1. Validate request
    Note over Olla: 2. Translate to OpenAI format
    Note over Olla: 3. Route to available backend

    Olla->>Backend: POST /v1/chat/completions<br/>(OpenAI format)
    Backend->>Olla: Response (OpenAI format)

    Note over Olla: 4. Translate to Anthropic format

    Olla->>Client: Response (Anthropic format)

Translation Stages¶

Stage 1: Request Translation¶

Input (Anthropic Messages API):

{
  "model": "llama4.0:latest",
  "max_tokens": 1024,
  "system": "You are a helpful assistant.",
  "messages": [
    {"role": "user", "content": "Hello!"}
  ],
  "temperature": 0.7
}

Transformations:

Extract system parameter
Convert to first message with role: "system"
Preserve other messages
Map parameter names directly (e.g., max_tokens → max_tokens)
Convert content blocks (if multi-modal)

Output (OpenAI Chat Completions):

{
  "model": "llama4.0:latest",
  "max_tokens": 1024,
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "temperature": 0.7
}

Stage 2: Backend Processing¶

The backend (Ollama, LM Studio, vLLM, etc.) processes the request without knowing it originated from an Anthropic client. From the backend's perspective, it's a standard OpenAI request.

Stage 3: Response Translation¶

Input (OpenAI Chat Completions):

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "llama4.0:latest",
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help you today?"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 10,
    "total_tokens": 25
  }
}

Transformations:

Restructure to Anthropic format
Wrap content in content blocks
Map finish reason ("stop" → "end_turn", "tool_calls" → "tool_use", "length" → "max_tokens")
Extract usage information (prompt_tokens → input_tokens, completion_tokens → output_tokens)
Generate Anthropic-compatible message ID (Base58-encoded with msg_01 prefix)

Output (Anthropic Messages API):

{
  "id": "msg_01ABC123XYZ",
  "type": "message",
  "role": "assistant",
  "content": [
    {
      "type": "text",
      "text": "Hello! How can I help you today?"
    }
  ],
  "model": "llama4.0:latest",
  "stop_reason": "end_turn",
  "usage": {
    "input_tokens": 15,
    "output_tokens": 10
  }
}

Streaming Translation¶

Streaming responses require continuous translation of Server-Sent Events (SSE):

OpenAI Streaming (input from backend):

data: {"choices":[{"delta":{"content":"Hello"}}]}

data: {"choices":[{"delta":{"content":"!"}}]}

data: [DONE]

Anthropic Streaming (output to client):

event: message_start
data: {"type":"message_start","message":{"id":"msg_01ABC","role":"assistant","model":"llama4.0:latest"}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"!"}}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":10}}

event: message_stop
data: {"type":"message_stop"}

Streaming Implementation Details¶

The translator uses a stateful streaming processor that:

Tracks state - Maintains current content block, tool call buffers, and message metadata
Buffers tool arguments - OpenAI streams tool arguments as partial JSON strings, which are buffered and parsed at the end
Sends events in order - Ensures message_start is sent before any content blocks
Handles context switching - Properly closes content blocks when switching between text and tool use
Synchronous processing - Uses blocking scanner for safer error handling (async support is planned)

Supported Translations¶

Anthropic Messages → OpenAI Chat Completions¶

Status: ✅ Fully implemented

Supported Features:

Feature	Anthropic Format	OpenAI Format	Translation Status
Basic messages	`messages` array	`messages` array	✅ Full
System prompt	`system` parameter	First system message	✅ Full
Content blocks	`content` array	String or array	✅ Full
Streaming	SSE events	SSE chunks	✅ Full (synchronous)
Tool use	`tools` array	`functions` array	✅ Full
Tool choice	`tool_choice` (`auto`/`any`/`tool`)	`tool_choice` (`auto`/`required`/object)	✅ Full with semantic mapping
Vision	Image content blocks	Multi-part content	⚠️ Partial (backend dependent)
Stop sequences	`stop_sequences`	`stop`	✅ Direct
Temperature	`temperature`	`temperature`	✅ Direct
Top P	`top_p`	`top_p`	✅ Direct
Top K	`top_k`	N/A	⚠️ Passed through if supported
Max tokens	`max_tokens`	`max_tokens`	✅ Direct

Tool Choice Mapping¶

The translator performs semantic mapping for tool choice parameters:

Anthropic	OpenAI	Behaviour
`"auto"`	`"auto"`	Let model decide whether to use tools
`"any"`	`"required"`	Force model to use a tool (any tool)
`"none"`	`"none"`	Disable tool use
`{"type":"tool","name":"X"}`	`{"type":"function","function":{"name":"X"}}`	Force specific tool

Passthrough Mode¶

What is Passthrough?¶

Passthrough mode is an optimisation that bypasses the translation pipeline entirely when a backend natively supports the incoming request format. For example, vLLM (v0.11.1+), llama.cpp (b4847+), LM Studio (v0.4.1+), and Ollama (v0.14.0+) all natively support the Anthropic Messages API. When Olla detects a compatible backend, it forwards the request directly without any Anthropic-to-OpenAI-and-back conversion.

Key Benefit: Zero translation overhead -- requests are forwarded as-is, preserving the original wire format.

How It Works¶

flowchart TD
    A[Client sends Anthropic request] --> B{Backend supports native Anthropic?}
    B -->|Yes| C[Passthrough Mode]
    B -->|No| D[Translation Mode]
    C --> E[Forward request directly to backend]
    E --> F[Backend processes in native Anthropic format]
    F --> G[Response returned as-is]
    D --> H[Translate Anthropic → OpenAI]
    H --> I[Route to backend]
    I --> J[Translate OpenAI → Anthropic]
    J --> G

Decision Flow:

Request arrives at /olla/anthropic/v1/messages
Olla checks whether the translator implements PassthroughCapable
If yes, checks whether passthrough_enabled is true in the translator config
If yes, checks available endpoints against their profile configurations
If all endpoints' profiles have anthropic_support.enabled: true, passthrough mode is used
If any endpoint does not support passthrough, falls back to translation mode automatically

Passthrough vs Translation Comparison¶

Aspect	Passthrough	Translation
Overhead	Near zero	~1-5ms per request
Backend requirement	Native Anthropic support	OpenAI-compatible
Request modification	None (forwarded as-is)	Full format conversion
Response modification	None	Full format conversion
Streaming	Native SSE format	SSE format conversion
Response header	`X-Olla-Mode: passthrough`	No `X-Olla-Mode` header
Feature support	Backend-dependent	Translation-dependent

Compatible Backends¶

Backends that support passthrough (native Anthropic Messages API):

Backend	Min Version	Token Counting	Profile Config
vLLM	v0.11.1+	No	`config/profiles/vllm.yaml`
llama.cpp	b4847+	Yes	`config/profiles/llamacpp.yaml`
LM Studio	v0.4.1+	No	`config/profiles/lmstudio.yaml`
Ollama	v0.14.0+	No	`config/profiles/ollama.yaml`

Backend Profile Configuration¶

Passthrough is configured in each backend's profile YAML under the api.anthropic_support section:

# Example: config/profiles/vllm.yaml
api:
  anthropic_support:
    enabled: true              # Enable native Anthropic support
    messages_path: /v1/messages # Backend path for Messages API
    token_count: false          # Whether token counting is supported
    min_version: "0.11.1"      # Minimum backend version required
    limitations:               # Optional known limitations
      - no_token_counting

Field	Type	Description
`enabled`	boolean	Whether the backend supports native Anthropic format
`messages_path`	string	Backend path for the Messages API (e.g., `/v1/messages`)
`token_count`	boolean	Whether the backend supports `/v1/messages/count_tokens`
`min_version`	string	Minimum backend version with Anthropic support
`limitations`	list	Known limitations (e.g., `no_token_counting`, `token_counting_404`)

Fallback Behaviour¶

When passthrough is not possible, Olla falls back to translation mode automatically. The fallback reason is tracked in translator metrics:

Fallback Reason	Description
`no_compatible_endpoints`	No healthy endpoints available
`translator_does_not_support_passthrough`	Translator lacks `PassthroughCapable` interface
`cannot_passthrough`	Endpoints don't declare native Anthropic support

Observability¶

Passthrough mode is observable through:

Response header: X-Olla-Mode: passthrough (only set when passthrough is used)
Translator stats endpoint: GET /internal/stats/translators exposes passthrough vs translation request counts, success rates, fallback reason breakdowns, and latency data per translator (see System Endpoints)
Debug logs: Log entries indicate which mode was selected and why

Architecture¶

Translation Layer Position¶

┌─────────────────────────────────────────────────────────────┐
│                        Olla Proxy                           │
│                                                             │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐ │
│  │   Native     │    │  Translator  │    │   Backend    │ │
│  │  Endpoints   │    │    Layer     │    │   Routing    │ │
│  └──────────────┘    └──────────────┘    └──────────────┘ │
│         │                   │                    │         │
│         ▼                   ▼                    ▼         │
│  /olla/openai/*    /olla/anthropic/*    Load Balancer    │
│  (pass-through)    (passthrough or      Health Checks     │
│                     translate)           Connection Pool   │
└─────────────────────────────────────────────────────────────┘

Key Points:

Translation is optional and transparent
Native endpoints bypass translation entirely
Translated endpoints use the same backend infrastructure
Passthrough mode bypasses translation when backends natively support the format
No impact on native endpoint performance

Where Translation Happens¶

Translation occurs in the adapter layer of Olla:

internal/
├── adapter/
│   ├── translator/
│   │   ├── types.go                # PassthroughCapable interface, ProfileLookup
│   │   └── anthropic/
│   │       ├── request.go          # Request translation
│   │       ├── response.go         # Response translation
│   │       ├── streaming.go        # SSE translation
│   │       ├── tools.go            # Tool/function translation
│   │       ├── passthrough.go      # Passthrough support (CanPassthrough, PreparePassthrough)
│   │       └── translator.go       # Main translator
│   ├── stats/
│   │   └── translator_collector.go # Translator metrics (passthrough/translation rates)
│   └── proxy/
│       ├── sherpa/                 # Uses translator
│       └── olla/                   # Uses translator
├── core/
│   ├── constants/
│   │   └── translator.go          # TranslatorMode, FallbackReason constants
│   ├── domain/
│   │   └── profile_config.go      # AnthropicSupportConfig
│   └── ports/
│       └── stats.go               # TranslatorRequestEvent, TranslatorStats
├── app/
│   └── handlers/
│       └── handler_translation.go  # Passthrough/translation decision logic

Process (Translation Mode):

Request arrives at /olla/anthropic/v1/messages
Handler checks if passthrough is possible (see below)
If not, translator converts request to OpenAI format
Proxy routes to backend (standard Olla routing)
Backend responds in OpenAI format
Translator converts response to Anthropic format
Response returned to client

Process (Passthrough Mode):

Request arrives at /olla/anthropic/v1/messages
Handler checks if translator implements PassthroughCapable
CanPassthrough() checks endpoint profiles for anthropic_support.enabled: true
If compatible, PreparePassthrough() extracts model name and target path
Request forwarded directly to backend without any format conversion
Backend responds in native Anthropic format
Response returned to client as-is

Memory Optimisation¶

The translator uses buffer pooling to minimise memory allocations:

Buffer pool: 4KB initial capacity for most chat completions
Object reuse: Buffers are returned to pool after use
GC pressure reduction: Reduces garbage collection overhead during high-throughput operations

Benefits¶

1. Client Flexibility¶

Use any client regardless of its API format:

Claude Code (Anthropic API only) with Ollama
OpenAI libraries with Anthropic-formatted backends
Mix and match clients and backends freely

2. Backend Flexibility¶

Keep existing backend infrastructure:

No backend reconfiguration needed
No API format changes required
Existing OpenAI-compatible backends work as-is

3. Unified Infrastructure¶

One proxy for everything:

Single endpoint configuration
One load balancer for all clients
Unified monitoring and metrics
Consistent health checking

4. Cost Optimisation¶

Use local models with cloud-designed clients:

No cloud API costs
Full local model support
Automatic failover to cloud if needed (via LiteLLM)

5. Future-Proof¶

Easy to add new translations:

Modular translator design
Add new API formats without changing core proxy
Support emerging LLM API standards

Limitations¶

Translation Overhead¶

Performance Impact (estimated from implementation):

Request translation: 0.5-2ms per request
Response translation: 1-5ms per request
Streaming: ~0.1-0.5ms per chunk
Passthrough mode: Near-zero overhead (no translation)

Memory Usage:

Minimal for basic text (~1-5KB per request)
Proportional to content size for vision models
Buffer pool reduces allocation overhead

Recommendation: Use passthrough mode when backends support native Anthropic format (vLLM, llama.cpp, LM Studio, Ollama) for zero translation overhead. Use native endpoints when translation isn't needed for maximum performance.

Feature Parity¶

Not all features translate perfectly:

Anthropic → OpenAI Limitations:

Extended thinking: Not supported (Anthropic-specific feature)
Prompt caching: Not supported (Anthropic-specific feature)
Some advanced parameters may not have OpenAI equivalents

Backend Limitations:

Tool use: Requires function-calling capable model
Vision: Requires multi-modal model (implementation supports base64 images)
Token counting: Estimated, not exact (depends on backend tokeniser)
Parallel queries (especially for agent work in Claude Code) is not supported

Recommendation: For tools like Claude Code, use Claude Code Router.

Streaming Format¶

Streaming translation requires:

Full SSE event restructuring
Potential for slight buffering (tool arguments are buffered until complete)
Client must support SSE with named events
Currently synchronous (async support planned for agentic workflows)

Recommendation: For tools like Claude Code, use Claude Code Router.

Configuration¶

Translation Configuration¶

Anthropic translation is enabled by default. To customise:

translators:
  anthropic:
    enabled: true                   # Enabled by default
    max_message_size: 10485760     # Max request size (10MB)
    passthrough_enabled: true       # Enable passthrough for backends with native Anthropic support

passthrough_enabled Optimisation Flag

The passthrough_enabled field controls whether passthrough mode is active. When true (the default), Olla forwards requests directly to backends whose profiles declare anthropic_support.enabled: true, with zero translation overhead. Set to false to force all requests through the translation pipeline regardless of backend capabilities. This only applies when enabled: true -- when the translator is disabled, passthrough_enabled has no effect.

Disable Translation¶

To disable translation and use native endpoints only:

translators:
  anthropic:
    enabled: false

Anthropic endpoints will return 404 when disabled. By default, translation is enabled.

Performance Tuning¶

For high-throughput translation:

proxy:
  engine: "olla"                    # Use high-performance engine
  profile: "streaming"              # Low-latency streaming

translators:
  anthropic:
    enabled: true
    max_message_size: 52428800     # Increase for large requests (50MB)

Use Cases¶

Use Case 1: Claude Code with Local Models¶

Scenario: Developer wants to use Claude Code but doesn't want cloud API costs.

Solution: Olla translates Claude Code's Anthropic requests to work with local Ollama.

translators:
  anthropic:
    enabled: true

discovery:
  static:
    endpoints:
      - url: "http://localhost:11434"
        type: "ollama"

Command:

export ANTHROPIC_BASE_URL="http://localhost:40114/olla/anthropic"
claude

See detailed Claude Code Integration guide.

Use Case 2: Multi-Client Support¶

Scenario: Team uses mix of OpenAI and Anthropic clients with same backend.

Solution: Olla provides both /olla/openai/* and /olla/anthropic/* endpoints.

Configuration: Same as above, supports both clients simultaneously.

Use Case 3: Cloud Fallback¶

Scenario: Use local models when available, fall back to Anthropic cloud API.

Solution: Combine translation with LiteLLM backend.

translators:
  anthropic:
    enabled: true

discovery:
  static:
    endpoints:
      - url: "http://localhost:11434"
        type: "ollama"
        priority: 100                # Prefer local

      - url: "http://localhost:4000"
        type: "litellm"              # LiteLLM gateway to Anthropic
        priority: 50                 # Fallback

Use Case 4: Testing & Development¶

Scenario: Test Anthropic API integration before deploying to production.

Solution: Use Olla translation with local models for free testing.

Benefit: Develop against Anthropic API without API costs.

Performance Considerations¶

When to Use Translation¶

Good Use Cases:

Client requires specific API format
Testing API integrations locally
Cost optimisation (local instead of cloud)
Multi-client support needed

Avoid When:

Both client and backend use same format (use native endpoints)
Extreme latency requirements (<10ms total)
High throughput (>10,000 req/s per instance)

Optimisation Tips¶

Use High-Performance Engine:

proxy:
  engine: "olla"  # Not "sherpa"

Enable Streaming Profile:
```
proxy:
  profile: "streaming"
```

Connection Pooling:

proxy:
  max_connections_per_endpoint: 100

Local Backends:
Prefer Ollama/LM Studio on same machine
Avoid network latency

Appropriate Timeouts:

proxy:
  response_timeout: 300s      # 5 minutes for long generations
  read_timeout: 60s

Request Size Limits:

translators:
  anthropic:
    max_message_size: 10485760  # 10MB default, adjust as needed

Troubleshooting¶

Translation Errors¶

Issue: Requests fail with translation errors

Possible Causes:

Invalid request format
Unsupported content blocks
Malformed JSON in request

Solutions:

Check request format matches Anthropic Messages API schema
Enable debug logging to see translation details:
```
logging:
  level: debug
```
Verify max_message_size isn't too restrictive
Check logs for specific validation errors

Passthrough Not Activating¶

Issue: Requests are being translated instead of using passthrough mode

Possible Causes:

passthrough_enabled is false in the translator config
Backend profile does not declare api.anthropic_support.enabled: true
Not all healthy endpoints support native Anthropic format

Solutions:

Verify passthrough_enabled: true in your translator config (this is the default)
Check the backend profile for anthropic_support.enabled: true
Check the X-Olla-Mode response header to confirm mode selection
Enable debug logging to see detailed mode selection reasoning
See Anthropic Translation Setup for detailed troubleshooting

Streaming Issues¶

Issue: Streaming responses are incomplete or malformed

Possible Causes:

Backend doesn't support streaming
Network issues interrupting stream
Client doesn't support named SSE events

Solutions:

Verify backend supports streaming:

curl -X POST http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama4.0:latest","messages":[{"role":"user","content":"test"}],"stream":true}'

Check client supports SSE with event names
Monitor logs for stream processing errors

Tool Use Not Working¶

Issue: Tool/function calls not translating correctly

Possible Causes:

Backend model doesn't support function calling
Tool schema incompatible with backend
Tool choice mapping issue

Solutions:

Verify model supports tools:
Llama 3.1+ models support function calling
Check model capabilities in documentation
Review tool definitions in logs
Test with simple tool first

Next Steps¶

Anthropic API Reference - Complete API documentation
Anthropic Translation Setup - Configuration guide
Claude Code Integration - Claude Code setup
Load Balancing - Distribute translated requests
Model Routing - Route to appropriate backends

Proxy Engines - Choose the right engine for performance
Health Checking - Ensure backend availability
Model Unification - Unified model catalogue

API Translation¶

What is API Translation?¶

Why Translation is Needed¶

How Translation Works¶

Request Flow¶

Translation Stages¶

Stage 1: Request Translation¶

Stage 2: Backend Processing¶

Stage 3: Response Translation¶

Streaming Translation¶

Streaming Implementation Details¶

Supported Translations¶

Anthropic Messages → OpenAI Chat Completions¶

Tool Choice Mapping¶

Passthrough Mode¶

What is Passthrough?¶

How It Works¶

Passthrough vs Translation Comparison¶

Compatible Backends¶

Backend Profile Configuration¶

Fallback Behaviour¶

Observability¶

Architecture¶

Translation Layer Position¶

Where Translation Happens¶

Memory Optimisation¶

Benefits¶

1. Client Flexibility¶

2. Backend Flexibility¶

3. Unified Infrastructure¶

4. Cost Optimisation¶

5. Future-Proof¶

Limitations¶

Translation Overhead¶

Feature Parity¶

Streaming Format¶

Configuration¶

Translation Configuration¶

Disable Translation¶

Performance Tuning¶

Use Cases¶

Use Case 1: Claude Code with Local Models¶

Use Case 2: Multi-Client Support¶

Use Case 3: Cloud Fallback¶

Use Case 4: Testing & Development¶

Performance Considerations¶

When to Use Translation¶

Optimisation Tips¶

Troubleshooting¶

Translation Errors¶

Passthrough Not Activating¶

Streaming Issues¶

Tool Use Not Working¶

Next Steps¶

Related Concepts¶