API Translation¶

What is API Translation?¶

API translation is a capability in Olla that converts requests and responses between different LLM API formats in real-time. This enables clients designed for one API format (e.g., Anthropic Messages API) to work seamlessly with backends that implement a different format (e.g., OpenAI Chat Completions API).

Key Benefit: Use any client with any backend, regardless of their native API formats.

Why Translation is Needed¶

Different LLM providers use different API formats:

Provider	API Format	Example Client
Anthropic	Messages API	Claude Code, Crush CLI
OpenAI	Chat Completions API	OpenAI libraries, most tools
Ollama	Native + OpenAI-compatible	Ollama CLI, OpenWebUI
vLLM	OpenAI-compatible	Standard OpenAI clients

Without translation, you would need:

Different proxy setups for different clients
Client-specific backend configurations
Separate infrastructure for each API format

With translation, Olla acts as a universal adapter:

One proxy for all clients
One configuration for all backends
Seamless interoperability
Minimal overhead and latency

How Translation Works¶

Request Flow¶

For an Anthropic Message (say from Claude Code), sent to Olla's Anthropic Endpoint:

sequenceDiagram
    participant Client as Claude Code
    participant Olla as Olla Translator
    participant Backend as OpenAI Backend

    Client->>Olla: POST /olla/anthropic/v1/messages<br/>(Anthropic format)

    Note over Olla: 1. Validate request
    Note over Olla: 2. Translate to OpenAI format
    Note over Olla: 3. Route to available backend

    Olla->>Backend: POST /v1/chat/completions<br/>(OpenAI format)
    Backend->>Olla: Response (OpenAI format)

    Note over Olla: 4. Translate to Anthropic format

    Olla->>Client: Response (Anthropic format)

Translation Stages¶

Stage 1: Request Translation¶

Input (Anthropic Messages API):

{
  "model": "llama4.0:latest",
  "max_tokens": 1024,
  "system": "You are a helpful assistant.",
  "messages": [
    {"role": "user", "content": "Hello!"}
  ],
  "temperature": 0.7
}

Transformations:

Extract system parameter
Convert to first message with role: "system"
Preserve other messages
Map parameter names directly (e.g., max_tokens → max_tokens)
Convert content blocks (if multi-modal)

Output (OpenAI Chat Completions):

{
  "model": "llama4.0:latest",
  "max_tokens": 1024,
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "temperature": 0.7
}

Stage 2: Backend Processing¶

The backend (Ollama, LM Studio, vLLM, etc.) processes the request without knowing it originated from an Anthropic client. From the backend's perspective, it's a standard OpenAI request.

Stage 3: Response Translation¶

Input (OpenAI Chat Completions):

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "llama4.0:latest",
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help you today?"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 10,
    "total_tokens": 25
  }
}

Transformations:

Restructure to Anthropic format
Wrap content in content blocks
Map finish reason ("stop" → "end_turn", "tool_calls" → "tool_use", "length" → "max_tokens")
Extract usage information (prompt_tokens → input_tokens, completion_tokens → output_tokens)
Generate Anthropic-compatible message ID (Base58-encoded with msg_01 prefix)

Output (Anthropic Messages API):

{
  "id": "msg_01ABC123XYZ",
  "type": "message",
  "role": "assistant",
  "content": [
    {
      "type": "text",
      "text": "Hello! How can I help you today?"
    }
  ],
  "model": "llama4.0:latest",
  "stop_reason": "end_turn",
  "usage": {
    "input_tokens": 15,
    "output_tokens": 10
  }
}

Streaming Translation¶

Streaming responses require continuous translation of Server-Sent Events (SSE):

OpenAI Streaming (input from backend):

data: {"choices":[{"delta":{"content":"Hello"}}]}

data: {"choices":[{"delta":{"content":"!"}}]}

data: [DONE]

Anthropic Streaming (output to client):

event: message_start
data: {"type":"message_start","message":{"id":"msg_01ABC","role":"assistant","model":"llama4.0:latest"}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"!"}}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":10}}

event: message_stop
data: {"type":"message_stop"}

Streaming Implementation Details¶

The translator uses a stateful streaming processor that:

Tracks state - Maintains current content block, tool call buffers, and message metadata
Buffers tool arguments - OpenAI streams tool arguments as partial JSON strings, which are buffered and parsed at the end
Sends events in order - Ensures message_start is sent before any content blocks
Handles context switching - Properly closes content blocks when switching between text and tool use
Synchronous processing - Uses blocking scanner for safer error handling (async support is planned)

Supported Translations¶

Anthropic Messages → OpenAI Chat Completions¶

Status: ✅ Fully implemented

Supported Features:

Feature	Anthropic Format	OpenAI Format	Translation Status
Basic messages	`messages` array	`messages` array	✅ Full
System prompt	`system` parameter	First system message	✅ Full
Content blocks	`content` array	String or array	✅ Full
Streaming	SSE events	SSE chunks	✅ Full (synchronous)
Tool use	`tools` array	`functions` array	✅ Full
Tool choice	`tool_choice` (`auto`/`any`/`tool`)	`tool_choice` (`auto`/`required`/object)	✅ Full with semantic mapping
Vision	Image content blocks	Multi-part content	✅ Full
Stop sequences	`stop_sequences`	`stop`	✅ Direct
Temperature	`temperature`	`temperature`	✅ Direct
Top P	`top_p`	`top_p`	✅ Direct
Top K	`top_k`	N/A	⚠️ Passed through if supported
Max tokens	`max_tokens`	`max_tokens`	✅ Direct

Tool Choice Mapping¶

The translator performs semantic mapping for tool choice parameters:

Anthropic	OpenAI	Behaviour
`"auto"`	`"auto"`	Let model decide whether to use tools
`"any"`	`"required"`	Force model to use a tool (any tool)
`"none"`	`"none"`	Disable tool use
`{"type":"tool","name":"X"}`	`{"type":"function","function":{"name":"X"}}`	Force specific tool

Architecture¶

Translation Layer Position¶

┌─────────────────────────────────────────────────────────────┐
│                        Olla Proxy                           │
│                                                             │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐ │
│  │   Native     │    │  Translator  │    │   Backend    │ │
│  │  Endpoints   │    │    Layer     │    │   Routing    │ │
│  └──────────────┘    └──────────────┘    └──────────────┘ │
│         │                   │                    │         │
│         ▼                   ▼                    ▼         │
│  /olla/openai/*    /olla/anthropic/*    Load Balancer    │
│  (pass-through)      (translate)        Health Checks     │
│                                          Connection Pool   │
└─────────────────────────────────────────────────────────────┘

Key Points:

Translation is optional and transparent
Native endpoints bypass translation entirely
Translated endpoints use the same backend infrastructure
No impact on native endpoint performance

Where Translation Happens¶

Translation occurs in the adapter layer of Olla:

internal/
├── adapter/
│   ├── translator/
│   │   └── anthropic/
│   │       ├── request.go          # Request translation
│   │       ├── response.go         # Response translation
│   │       ├── streaming.go        # SSE translation
│   │       ├── tools.go            # Tool/function translation
│   │       └── translator.go       # Main translator
│   └── proxy/
│       ├── sherpa/                 # Uses translator
│       └── olla/                   # Uses translator

Process:

Request arrives at /olla/anthropic/v1/messages
Handler invokes Anthropic translator
Translator converts request to OpenAI format
Proxy routes to backend (standard Olla routing)
Backend responds in OpenAI format
Translator converts response to Anthropic format
Response returned to client

Memory Optimisation¶

The translator uses buffer pooling to minimise memory allocations:

Buffer pool: 4KB initial capacity for most chat completions
Object reuse: Buffers are returned to pool after use
GC pressure reduction: Reduces garbage collection overhead during high-throughput operations

Benefits¶

1. Client Flexibility¶

Use any client regardless of its API format:

Claude Code (Anthropic API only) with Ollama
OpenAI libraries with Anthropic-formatted backends
Mix and match clients and backends freely

2. Backend Flexibility¶

Keep existing backend infrastructure:

No backend reconfiguration needed
No API format changes required
Existing OpenAI-compatible backends work as-is

3. Unified Infrastructure¶

One proxy for everything:

Single endpoint configuration
One load balancer for all clients
Unified monitoring and metrics
Consistent health checking

4. Cost Optimisation¶

Use local models with cloud-designed clients:

No cloud API costs
Full local model support
Automatic failover to cloud if needed (via LiteLLM)

5. Future-Proof¶

Easy to add new translations:

Modular translator design
Add new API formats without changing core proxy
Support emerging LLM API standards

Limitations¶

Translation Overhead¶

Performance Impact (estimated from implementation):

Request translation: 0.5-2ms per request
Response translation: 1-5ms per request
Streaming: ~0.1-0.5ms per chunk

Memory Usage:

Minimal for basic text (~1-5KB per request)
Proportional to content size for vision models
Buffer pool reduces allocation overhead

Recommendation: Use native endpoints when translation isn't needed for maximum performance.

Feature Parity¶

Not all features translate perfectly:

Anthropic → OpenAI Limitations:

Extended thinking: Not supported (Anthropic-specific feature)
Prompt caching: Not supported (Anthropic-specific feature)
Some advanced parameters may not have OpenAI equivalents

Backend Limitations:

Tool use: Requires function-calling capable model
Vision: Requires multi-modal model (implementation supports base64 images)
Token counting: Estimated, not exact (depends on backend tokeniser)
Parallel queries (especially for agent work in Claude Code) is not supported

Recommendation: For tools like Claude Code, use Claude Code Router.

Streaming Format¶

Streaming translation requires:

Full SSE event restructuring
Potential for slight buffering (tool arguments are buffered until complete)
Client must support SSE with named events
Currently synchronous (async support planned for agentic workflows)

Recommendation: For tools like Claude Code, use Claude Code Router.

Configuration¶

Enable Translation¶

translators:
  anthropic:
    enabled: true                   # Enable Anthropic translator
    max_message_size: 10485760     # Max request size (10MB)

Disable Translation¶

To use native endpoints only:

translators:
  anthropic:
    enabled: false

Anthropic endpoints will return 404 when disabled.

Performance Tuning¶

For high-throughput translation:

proxy:
  engine: "olla"                    # Use high-performance engine
  profile: "streaming"              # Low-latency streaming

translators:
  anthropic:
    enabled: true
    max_message_size: 52428800     # Increase for large requests (50MB)

Use Cases¶

Use Case 1: Claude Code with Local Models¶

Scenario: Developer wants to use Claude Code but doesn't want cloud API costs.

Solution: Olla translates Claude Code's Anthropic requests to work with local Ollama.

translators:
  anthropic:
    enabled: true

discovery:
  static:
    endpoints:
      - url: "http://localhost:11434"
        type: "ollama"

Command:

export ANTHROPIC_API_BASE_URL="http://localhost:40114/olla/anthropic/v1"
claude-code

See detailed Claude Code Integration guide.

Use Case 2: Multi-Client Support¶

Scenario: Team uses mix of OpenAI and Anthropic clients with same backend.

Solution: Olla provides both /olla/openai/* and /olla/anthropic/* endpoints.

Configuration: Same as above, supports both clients simultaneously.

Use Case 3: Cloud Fallback¶

Scenario: Use local models when available, fall back to Anthropic cloud API.

Solution: Combine translation with LiteLLM backend.

translators:
  anthropic:
    enabled: true

discovery:
  static:
    endpoints:
      - url: "http://localhost:11434"
        type: "ollama"
        priority: 100                # Prefer local

      - url: "http://localhost:4000"
        type: "litellm"              # LiteLLM gateway to Anthropic
        priority: 50                 # Fallback

Use Case 4: Testing & Development¶

Scenario: Test Anthropic API integration before deploying to production.

Solution: Use Olla translation with local models for free testing.

Benefit: Develop against Anthropic API without API costs.

Performance Considerations¶

When to Use Translation¶

Good Use Cases:

Client requires specific API format
Testing API integrations locally
Cost optimisation (local instead of cloud)
Multi-client support needed

Avoid When:

Both client and backend use same format (use native endpoints)
Extreme latency requirements (<10ms total)
High throughput (>10,000 req/s per instance)

Optimisation Tips¶

Use High-Performance Engine:

proxy:
  engine: "olla"  # Not "sherpa"

Enable Streaming Profile:
```
proxy:
  profile: "streaming"
```

Connection Pooling:

proxy:
  max_connections_per_endpoint: 100

Local Backends:
Prefer Ollama/LM Studio on same machine
Avoid network latency

Appropriate Timeouts:

proxy:
  response_timeout: 300s      # 5 minutes for long generations
  read_timeout: 60s

Request Size Limits:

translators:
  anthropic:
    max_message_size: 10485760  # 10MB default, adjust as needed

Troubleshooting¶

Translation Errors¶

Issue: Requests fail with translation errors

Possible Causes:

Invalid request format
Unsupported content blocks
Malformed JSON in request

Solutions:

Check request format matches Anthropic Messages API schema
Enable debug logging to see translation details:
```
logging:
  level: debug
```
Verify max_message_size isn't too restrictive
Check logs for specific validation errors

Streaming Issues¶

Issue: Streaming responses are incomplete or malformed

Possible Causes:

Backend doesn't support streaming
Network issues interrupting stream
Client doesn't support named SSE events

Solutions:

Verify backend supports streaming:

curl -X POST http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama4.0:latest","messages":[{"role":"user","content":"test"}],"stream":true}'

Check client supports SSE with event names
Monitor logs for stream processing errors

Tool Use Not Working¶

Issue: Tool/function calls not translating correctly

Possible Causes:

Backend model doesn't support function calling
Tool schema incompatible with backend
Tool choice mapping issue

Solutions:

Verify model supports tools:
Llama 3.1+ models support function calling
Check model capabilities in documentation
Review tool definitions in logs
Test with simple tool first

Next Steps¶

Anthropic API Reference - Complete API documentation
Anthropic Translation Setup - Configuration guide
Claude Code Integration - Claude Code setup
Load Balancing - Distribute translated requests
Model Routing - Route to appropriate backends

Proxy Engines - Choose the right engine for performance
Health Checking - Ensure backend availability
Model Unification - Unified model catalogue

API Translation¶

What is API Translation?¶

Why Translation is Needed¶

How Translation Works¶

Request Flow¶

Translation Stages¶

Stage 1: Request Translation¶

Stage 2: Backend Processing¶

Stage 3: Response Translation¶

Streaming Translation¶

Streaming Implementation Details¶

Supported Translations¶

Anthropic Messages → OpenAI Chat Completions¶

Tool Choice Mapping¶

Architecture¶

Translation Layer Position¶

Where Translation Happens¶

Memory Optimisation¶

Benefits¶

1. Client Flexibility¶

2. Backend Flexibility¶

3. Unified Infrastructure¶

4. Cost Optimisation¶

5. Future-Proof¶

Limitations¶

Translation Overhead¶

Feature Parity¶

Streaming Format¶

Configuration¶

Enable Translation¶

Disable Translation¶

Performance Tuning¶

Use Cases¶

Use Case 1: Claude Code with Local Models¶

Use Case 2: Multi-Client Support¶

Use Case 3: Cloud Fallback¶

Use Case 4: Testing & Development¶

Performance Considerations¶

When to Use Translation¶

Optimisation Tips¶

Troubleshooting¶

Translation Errors¶

Streaming Issues¶

Tool Use Not Working¶

Next Steps¶

Related Concepts¶