API Translation¶
What is API Translation?¶
API translation is a capability in Olla that converts requests and responses between different LLM API formats in real-time. This enables clients designed for one API format (e.g., Anthropic Messages API) to work seamlessly with backends that implement a different format (e.g., OpenAI Chat Completions API).
Key Benefit: Use any client with any backend, regardless of their native API formats.
Why Translation is Needed¶
Different LLM providers use different API formats:
| Provider | API Format | Example Client |
|---|---|---|
| Anthropic | Messages API | Claude Code, Crush CLI |
| OpenAI | Chat Completions API | OpenAI libraries, most tools |
| Ollama | Native + OpenAI-compatible | Ollama CLI, OpenWebUI |
| vLLM | OpenAI-compatible | Standard OpenAI clients |
Without translation, you would need:
- Different proxy setups for different clients
- Client-specific backend configurations
- Separate infrastructure for each API format
With translation, Olla acts as a universal adapter:
- One proxy for all clients
- One configuration for all backends
- Seamless interoperability
- Minimal overhead and latency
How Translation Works¶
Request Flow¶
For an Anthropic Message (say from Claude Code), sent to Olla's Anthropic Endpoint:
sequenceDiagram
participant Client as Claude Code
participant Olla as Olla Translator
participant Backend as OpenAI Backend
Client->>Olla: POST /olla/anthropic/v1/messages<br/>(Anthropic format)
Note over Olla: 1. Validate request
Note over Olla: 2. Translate to OpenAI format
Note over Olla: 3. Route to available backend
Olla->>Backend: POST /v1/chat/completions<br/>(OpenAI format)
Backend->>Olla: Response (OpenAI format)
Note over Olla: 4. Translate to Anthropic format
Olla->>Client: Response (Anthropic format) Translation Stages¶
Stage 1: Request Translation¶
Input (Anthropic Messages API):
{
"model": "llama4.0:latest",
"max_tokens": 1024,
"system": "You are a helpful assistant.",
"messages": [
{"role": "user", "content": "Hello!"}
],
"temperature": 0.7
}
Transformations:
- Extract
systemparameter - Convert to first message with
role: "system" - Preserve other messages
- Map parameter names directly (e.g.,
max_tokens→max_tokens) - Convert content blocks (if multi-modal)
Output (OpenAI Chat Completions):
{
"model": "llama4.0:latest",
"max_tokens": 1024,
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"temperature": 0.7
}
Stage 2: Backend Processing¶
The backend (Ollama, LM Studio, vLLM, etc.) processes the request without knowing it originated from an Anthropic client. From the backend's perspective, it's a standard OpenAI request.
Stage 3: Response Translation¶
Input (OpenAI Chat Completions):
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"model": "llama4.0:latest",
"choices": [{
"message": {
"role": "assistant",
"content": "Hello! How can I help you today?"
},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 15,
"completion_tokens": 10,
"total_tokens": 25
}
}
Transformations:
- Restructure to Anthropic format
- Wrap content in content blocks
- Map finish reason (
"stop"→"end_turn","tool_calls"→"tool_use","length"→"max_tokens") - Extract usage information (
prompt_tokens→input_tokens,completion_tokens→output_tokens) - Generate Anthropic-compatible message ID (Base58-encoded with
msg_01prefix)
Output (Anthropic Messages API):
{
"id": "msg_01ABC123XYZ",
"type": "message",
"role": "assistant",
"content": [
{
"type": "text",
"text": "Hello! How can I help you today?"
}
],
"model": "llama4.0:latest",
"stop_reason": "end_turn",
"usage": {
"input_tokens": 15,
"output_tokens": 10
}
}
Streaming Translation¶
Streaming responses require continuous translation of Server-Sent Events (SSE):
OpenAI Streaming (input from backend):
data: {"choices":[{"delta":{"content":"Hello"}}]}
data: {"choices":[{"delta":{"content":"!"}}]}
data: [DONE]
Anthropic Streaming (output to client):
event: message_start
data: {"type":"message_start","message":{"id":"msg_01ABC","role":"assistant","model":"llama4.0:latest"}}
event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"!"}}
event: content_block_stop
data: {"type":"content_block_stop","index":0}
event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":10}}
event: message_stop
data: {"type":"message_stop"}
Streaming Implementation Details¶
The translator uses a stateful streaming processor that:
- Tracks state - Maintains current content block, tool call buffers, and message metadata
- Buffers tool arguments - OpenAI streams tool arguments as partial JSON strings, which are buffered and parsed at the end
- Sends events in order - Ensures
message_startis sent before any content blocks - Handles context switching - Properly closes content blocks when switching between text and tool use
- Synchronous processing - Uses blocking scanner for safer error handling (async support is planned)
Supported Translations¶
Anthropic Messages → OpenAI Chat Completions¶
Status: ✅ Fully implemented
Supported Features:
| Feature | Anthropic Format | OpenAI Format | Translation Status |
|---|---|---|---|
| Basic messages | messages array | messages array | ✅ Full |
| System prompt | system parameter | First system message | ✅ Full |
| Content blocks | content array | String or array | ✅ Full |
| Streaming | SSE events | SSE chunks | ✅ Full (synchronous) |
| Tool use | tools array | functions array | ✅ Full |
| Tool choice | tool_choice (auto/any/tool) | tool_choice (auto/required/object) | ✅ Full with semantic mapping |
| Vision | Image content blocks | Multi-part content | ⚠️ Partial (backend dependent) |
| Stop sequences | stop_sequences | stop | ✅ Direct |
| Temperature | temperature | temperature | ✅ Direct |
| Top P | top_p | top_p | ✅ Direct |
| Top K | top_k | N/A | ⚠️ Passed through if supported |
| Max tokens | max_tokens | max_tokens | ✅ Direct |
Tool Choice Mapping¶
The translator performs semantic mapping for tool choice parameters:
| Anthropic | OpenAI | Behaviour |
|---|---|---|
"auto" | "auto" | Let model decide whether to use tools |
"any" | "required" | Force model to use a tool (any tool) |
"none" | "none" | Disable tool use |
{"type":"tool","name":"X"} | {"type":"function","function":{"name":"X"}} | Force specific tool |
Passthrough Mode¶
What is Passthrough?¶
Passthrough mode is an optimisation that bypasses the translation pipeline entirely when a backend natively supports the incoming request format. For example, vLLM (v0.11.1+), llama.cpp (b4847+), LM Studio (v0.4.1+), and Ollama (v0.14.0+) all natively support the Anthropic Messages API. When Olla detects a compatible backend, it forwards the request directly without any Anthropic-to-OpenAI-and-back conversion.
Key Benefit: Zero translation overhead -- requests are forwarded as-is, preserving the original wire format.
How It Works¶
flowchart TD
A[Client sends Anthropic request] --> B{Backend supports native Anthropic?}
B -->|Yes| C[Passthrough Mode]
B -->|No| D[Translation Mode]
C --> E[Forward request directly to backend]
E --> F[Backend processes in native Anthropic format]
F --> G[Response returned as-is]
D --> H[Translate Anthropic → OpenAI]
H --> I[Route to backend]
I --> J[Translate OpenAI → Anthropic]
J --> G Decision Flow:
- Request arrives at
/olla/anthropic/v1/messages - Olla checks whether the translator implements
PassthroughCapable - If yes, checks whether
passthrough_enabledistruein the translator config - If yes, checks available endpoints against their profile configurations
- If all endpoints' profiles have
anthropic_support.enabled: true, passthrough mode is used - If any endpoint does not support passthrough, falls back to translation mode automatically
Passthrough vs Translation Comparison¶
| Aspect | Passthrough | Translation |
|---|---|---|
| Overhead | Near zero | ~1-5ms per request |
| Backend requirement | Native Anthropic support | OpenAI-compatible |
| Request modification | None (forwarded as-is) | Full format conversion |
| Response modification | None | Full format conversion |
| Streaming | Native SSE format | SSE format conversion |
| Response header | X-Olla-Mode: passthrough | No X-Olla-Mode header |
| Feature support | Backend-dependent | Translation-dependent |
Compatible Backends¶
Backends that support passthrough (native Anthropic Messages API):
| Backend | Min Version | Token Counting | Profile Config |
|---|---|---|---|
| vLLM | v0.11.1+ | No | config/profiles/vllm.yaml |
| llama.cpp | b4847+ | Yes | config/profiles/llamacpp.yaml |
| LM Studio | v0.4.1+ | No | config/profiles/lmstudio.yaml |
| Ollama | v0.14.0+ | No | config/profiles/ollama.yaml |
Backend Profile Configuration¶
Passthrough is configured in each backend's profile YAML under the api.anthropic_support section:
# Example: config/profiles/vllm.yaml
api:
anthropic_support:
enabled: true # Enable native Anthropic support
messages_path: /v1/messages # Backend path for Messages API
token_count: false # Whether token counting is supported
min_version: "0.11.1" # Minimum backend version required
limitations: # Optional known limitations
- no_token_counting
| Field | Type | Description |
|---|---|---|
enabled | boolean | Whether the backend supports native Anthropic format |
messages_path | string | Backend path for the Messages API (e.g., /v1/messages) |
token_count | boolean | Whether the backend supports /v1/messages/count_tokens |
min_version | string | Minimum backend version with Anthropic support |
limitations | list | Known limitations (e.g., no_token_counting, token_counting_404) |
Fallback Behaviour¶
When passthrough is not possible, Olla falls back to translation mode automatically. The fallback reason is tracked in translator metrics:
| Fallback Reason | Description |
|---|---|
no_compatible_endpoints | No healthy endpoints available |
translator_does_not_support_passthrough | Translator lacks PassthroughCapable interface |
cannot_passthrough | Endpoints don't declare native Anthropic support |
Observability¶
Passthrough mode is observable through:
- Response header:
X-Olla-Mode: passthrough(only set when passthrough is used) - Translator stats endpoint:
GET /internal/stats/translatorsexposes passthrough vs translation request counts, success rates, fallback reason breakdowns, and latency data per translator (see System Endpoints) - Debug logs: Log entries indicate which mode was selected and why
Architecture¶
Translation Layer Position¶
┌─────────────────────────────────────────────────────────────┐
│ Olla Proxy │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Native │ │ Translator │ │ Backend │ │
│ │ Endpoints │ │ Layer │ │ Routing │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ /olla/openai/* /olla/anthropic/* Load Balancer │
│ (pass-through) (passthrough or Health Checks │
│ translate) Connection Pool │
└─────────────────────────────────────────────────────────────┘
Key Points:
- Translation is optional and transparent
- Native endpoints bypass translation entirely
- Translated endpoints use the same backend infrastructure
- Passthrough mode bypasses translation when backends natively support the format
- No impact on native endpoint performance
Where Translation Happens¶
Translation occurs in the adapter layer of Olla:
internal/
├── adapter/
│ ├── translator/
│ │ ├── types.go # PassthroughCapable interface, ProfileLookup
│ │ └── anthropic/
│ │ ├── request.go # Request translation
│ │ ├── response.go # Response translation
│ │ ├── streaming.go # SSE translation
│ │ ├── tools.go # Tool/function translation
│ │ ├── passthrough.go # Passthrough support (CanPassthrough, PreparePassthrough)
│ │ └── translator.go # Main translator
│ ├── stats/
│ │ └── translator_collector.go # Translator metrics (passthrough/translation rates)
│ └── proxy/
│ ├── sherpa/ # Uses translator
│ └── olla/ # Uses translator
├── core/
│ ├── constants/
│ │ └── translator.go # TranslatorMode, FallbackReason constants
│ ├── domain/
│ │ └── profile_config.go # AnthropicSupportConfig
│ └── ports/
│ └── stats.go # TranslatorRequestEvent, TranslatorStats
├── app/
│ └── handlers/
│ └── handler_translation.go # Passthrough/translation decision logic
Process (Translation Mode):
- Request arrives at
/olla/anthropic/v1/messages - Handler checks if passthrough is possible (see below)
- If not, translator converts request to OpenAI format
- Proxy routes to backend (standard Olla routing)
- Backend responds in OpenAI format
- Translator converts response to Anthropic format
- Response returned to client
Process (Passthrough Mode):
- Request arrives at
/olla/anthropic/v1/messages - Handler checks if translator implements
PassthroughCapable CanPassthrough()checks endpoint profiles foranthropic_support.enabled: true- If compatible,
PreparePassthrough()extracts model name and target path - Request forwarded directly to backend without any format conversion
- Backend responds in native Anthropic format
- Response returned to client as-is
Memory Optimisation¶
The translator uses buffer pooling to minimise memory allocations:
- Buffer pool: 4KB initial capacity for most chat completions
- Object reuse: Buffers are returned to pool after use
- GC pressure reduction: Reduces garbage collection overhead during high-throughput operations
Benefits¶
1. Client Flexibility¶
Use any client regardless of its API format:
- Claude Code (Anthropic API only) with Ollama
- OpenAI libraries with Anthropic-formatted backends
- Mix and match clients and backends freely
2. Backend Flexibility¶
Keep existing backend infrastructure:
- No backend reconfiguration needed
- No API format changes required
- Existing OpenAI-compatible backends work as-is
3. Unified Infrastructure¶
One proxy for everything:
- Single endpoint configuration
- One load balancer for all clients
- Unified monitoring and metrics
- Consistent health checking
4. Cost Optimisation¶
Use local models with cloud-designed clients:
- No cloud API costs
- Full local model support
- Automatic failover to cloud if needed (via LiteLLM)
5. Future-Proof¶
Easy to add new translations:
- Modular translator design
- Add new API formats without changing core proxy
- Support emerging LLM API standards
Limitations¶
Translation Overhead¶
Performance Impact (estimated from implementation):
- Request translation: 0.5-2ms per request
- Response translation: 1-5ms per request
- Streaming: ~0.1-0.5ms per chunk
- Passthrough mode: Near-zero overhead (no translation)
Memory Usage:
- Minimal for basic text (~1-5KB per request)
- Proportional to content size for vision models
- Buffer pool reduces allocation overhead
Recommendation: Use passthrough mode when backends support native Anthropic format (vLLM, llama.cpp, LM Studio, Ollama) for zero translation overhead. Use native endpoints when translation isn't needed for maximum performance.
Feature Parity¶
Not all features translate perfectly:
Anthropic → OpenAI Limitations:
- Extended thinking: Not supported (Anthropic-specific feature)
- Prompt caching: Not supported (Anthropic-specific feature)
- Some advanced parameters may not have OpenAI equivalents
Backend Limitations:
- Tool use: Requires function-calling capable model
- Vision: Requires multi-modal model (implementation supports base64 images)
- Token counting: Estimated, not exact (depends on backend tokeniser)
- Parallel queries (especially for agent work in Claude Code) is not supported
Recommendation: For tools like Claude Code, use Claude Code Router.
Streaming Format¶
Streaming translation requires:
- Full SSE event restructuring
- Potential for slight buffering (tool arguments are buffered until complete)
- Client must support SSE with named events
- Currently synchronous (async support planned for agentic workflows)
Recommendation: For tools like Claude Code, use Claude Code Router.
Configuration¶
Translation Configuration¶
Anthropic translation is enabled by default. To customise:
translators:
anthropic:
enabled: true # Enabled by default
max_message_size: 10485760 # Max request size (10MB)
passthrough_enabled: true # Enable passthrough for backends with native Anthropic support
passthrough_enabled Optimisation Flag
The passthrough_enabled field controls whether passthrough mode is active. When true (the default), Olla forwards requests directly to backends whose profiles declare anthropic_support.enabled: true, with zero translation overhead. Set to false to force all requests through the translation pipeline regardless of backend capabilities. This only applies when enabled: true -- when the translator is disabled, passthrough_enabled has no effect.
Disable Translation¶
To disable translation and use native endpoints only:
Anthropic endpoints will return 404 when disabled. By default, translation is enabled.
Performance Tuning¶
For high-throughput translation:
proxy:
engine: "olla" # Use high-performance engine
profile: "streaming" # Low-latency streaming
translators:
anthropic:
enabled: true
max_message_size: 52428800 # Increase for large requests (50MB)
Use Cases¶
Use Case 1: Claude Code with Local Models¶
Scenario: Developer wants to use Claude Code but doesn't want cloud API costs.
Solution: Olla translates Claude Code's Anthropic requests to work with local Ollama.
translators:
anthropic:
enabled: true
discovery:
static:
endpoints:
- url: "http://localhost:11434"
type: "ollama"
Command:
See detailed Claude Code Integration guide.
Use Case 2: Multi-Client Support¶
Scenario: Team uses mix of OpenAI and Anthropic clients with same backend.
Solution: Olla provides both /olla/openai/* and /olla/anthropic/* endpoints.
Configuration: Same as above, supports both clients simultaneously.
Use Case 3: Cloud Fallback¶
Scenario: Use local models when available, fall back to Anthropic cloud API.
Solution: Combine translation with LiteLLM backend.
translators:
anthropic:
enabled: true
discovery:
static:
endpoints:
- url: "http://localhost:11434"
type: "ollama"
priority: 100 # Prefer local
- url: "http://localhost:4000"
type: "litellm" # LiteLLM gateway to Anthropic
priority: 50 # Fallback
Use Case 4: Testing & Development¶
Scenario: Test Anthropic API integration before deploying to production.
Solution: Use Olla translation with local models for free testing.
Benefit: Develop against Anthropic API without API costs.
Performance Considerations¶
When to Use Translation¶
Good Use Cases:
- Client requires specific API format
- Testing API integrations locally
- Cost optimisation (local instead of cloud)
- Multi-client support needed
Avoid When:
- Both client and backend use same format (use native endpoints)
- Extreme latency requirements (<10ms total)
- High throughput (>10,000 req/s per instance)
Optimisation Tips¶
-
Use High-Performance Engine:
-
Enable Streaming Profile:
-
Connection Pooling:
-
Local Backends:
- Prefer Ollama/LM Studio on same machine
-
Avoid network latency
-
Appropriate Timeouts:
-
Request Size Limits:
Troubleshooting¶
Translation Errors¶
Issue: Requests fail with translation errors
Possible Causes:
- Invalid request format
- Unsupported content blocks
- Malformed JSON in request
Solutions:
- Check request format matches Anthropic Messages API schema
- Enable debug logging to see translation details:
- Verify
max_message_sizeisn't too restrictive - Check logs for specific validation errors
Passthrough Not Activating¶
Issue: Requests are being translated instead of using passthrough mode
Possible Causes:
passthrough_enabledisfalsein the translator config- Backend profile does not declare
api.anthropic_support.enabled: true - Not all healthy endpoints support native Anthropic format
Solutions:
- Verify
passthrough_enabled: truein your translator config (this is the default) - Check the backend profile for
anthropic_support.enabled: true - Check the
X-Olla-Moderesponse header to confirm mode selection - Enable debug logging to see detailed mode selection reasoning
- See Anthropic Translation Setup for detailed troubleshooting
Streaming Issues¶
Issue: Streaming responses are incomplete or malformed
Possible Causes:
- Backend doesn't support streaming
- Network issues interrupting stream
- Client doesn't support named SSE events
Solutions:
- Verify backend supports streaming:
- Check client supports SSE with event names
- Monitor logs for stream processing errors
Tool Use Not Working¶
Issue: Tool/function calls not translating correctly
Possible Causes:
- Backend model doesn't support function calling
- Tool schema incompatible with backend
- Tool choice mapping issue
Solutions:
- Verify model supports tools:
- Llama 3.1+ models support function calling
- Check model capabilities in documentation
- Review tool definitions in logs
- Test with simple tool first
Next Steps¶
- Anthropic API Reference - Complete API documentation
- Anthropic Translation Setup - Configuration guide
- Claude Code Integration - Claude Code setup
- Load Balancing - Distribute translated requests
- Model Routing - Route to appropriate backends
Related Concepts¶
- Proxy Engines - Choose the right engine for performance
- Health Checking - Ensure backend availability
- Model Unification - Unified model catalogue