Docker Model Runner API¶
Proxy endpoints for Docker Model Runner inference servers. Available through the /olla/dmr/ prefix.
Endpoints Overview¶
| Method | URI | Description |
|---|---|---|
| GET | /olla/dmr/engines/v1/models | List available models |
| POST | /olla/dmr/engines/v1/chat/completions | Chat completion |
| POST | /olla/dmr/engines/v1/completions | Text completion |
| POST | /olla/dmr/engines/v1/embeddings | Generate embeddings |
| GET | /olla/dmr/engines/llama.cpp/v1/models | List models (llama.cpp engine) |
| POST | /olla/dmr/engines/llama.cpp/v1/chat/completions | Chat completion (llama.cpp engine) |
| GET | /olla/dmr/engines/vllm/v1/models | List models (vLLM engine) |
| POST | /olla/dmr/engines/vllm/v1/chat/completions | Chat completion (vLLM engine) |
The /engines/v1/... paths use automatic engine selection. The explicit /engines/llama.cpp/v1/... and /engines/vllm/v1/... paths target a specific engine directly.
GET /olla/dmr/engines/v1/models¶
List models available on the Docker Model Runner instance.
Returns an empty data array when no models have been loaded yet — this is normal behaviour due to lazy model loading and does not indicate an unhealthy endpoint.
Request¶
Response¶
{
"object": "list",
"data": [
{
"id": "ai/smollm2",
"object": "model",
"created": 1734000000,
"owned_by": "ai"
},
{
"id": "ai/llama3.2",
"object": "model",
"created": 1734000001,
"owned_by": "ai"
}
]
}
POST /olla/dmr/engines/v1/chat/completions¶
OpenAI-compatible chat completion routed to the appropriate DMR engine.
Request¶
curl -X POST http://localhost:40114/olla/dmr/engines/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ai/smollm2",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is Docker Model Runner?"
}
],
"temperature": 0.7,
"max_tokens": 300,
"stream": false
}'
Response¶
{
"id": "chatcmpl-dmr-abc123",
"object": "chat.completion",
"created": 1734000000,
"model": "ai/smollm2",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Docker Model Runner is Docker's built-in LLM inference server that ships with Docker Desktop 4.40+. It enables you to pull, run, and serve large language models directly from Docker Hub as OCI artifacts, using llama.cpp for GGUF models and vLLM for safetensors models."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 22,
"completion_tokens": 54,
"total_tokens": 76
}
}
Streaming Response¶
When "stream": true:
data: {"id":"chatcmpl-dmr-abc123","object":"chat.completion.chunk","created":1734000000,"model":"ai/smollm2","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
data: {"id":"chatcmpl-dmr-abc123","object":"chat.completion.chunk","created":1734000000,"model":"ai/smollm2","choices":[{"index":0,"delta":{"content":"Docker"},"finish_reason":null}]}
...
data: {"id":"chatcmpl-dmr-abc123","object":"chat.completion.chunk","created":1734000001,"model":"ai/smollm2","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
POST /olla/dmr/engines/v1/completions¶
Text completion (non-chat) via DMR.
Request¶
curl -X POST http://localhost:40114/olla/dmr/engines/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ai/smollm2",
"prompt": "Docker Model Runner is",
"max_tokens": 100,
"temperature": 0.8,
"stream": false
}'
Response¶
{
"id": "cmpl-dmr-xyz789",
"object": "text_completion",
"created": 1734000000,
"model": "ai/smollm2",
"choices": [
{
"text": " Docker's built-in LLM inference server, enabling developers to run AI models locally using the same tools they use for containers.",
"index": 0,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 5,
"completion_tokens": 28,
"total_tokens": 33
}
}
POST /olla/dmr/engines/v1/embeddings¶
Generate embeddings using a DMR model that supports embeddings.
Request¶
curl -X POST http://localhost:40114/olla/dmr/engines/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "ai/mxbai-embed-large",
"input": "Docker Model Runner provides local LLM inference",
"encoding_format": "float"
}'
Response¶
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [0.0234, -0.0567, 0.0891, "..."]
}
],
"model": "ai/mxbai-embed-large",
"usage": {
"prompt_tokens": 8,
"total_tokens": 8
}
}
Explicit Engine Endpoints¶
Use these paths to target a specific inference engine, bypassing automatic engine selection.
llama.cpp Engine¶
# List models on llama.cpp engine
curl http://localhost:40114/olla/dmr/engines/llama.cpp/v1/models
# Chat completion via llama.cpp
curl -X POST http://localhost:40114/olla/dmr/engines/llama.cpp/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ai/smollm2",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 100
}'
vLLM Engine¶
# List models on vLLM engine (Linux + NVIDIA only)
curl http://localhost:40114/olla/dmr/engines/vllm/v1/models
# Chat completion via vLLM
curl -X POST http://localhost:40114/olla/dmr/engines/vllm/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ai/llama3.1-8b",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 100
}'
Model Naming¶
DMR uses a namespace/name format for all model identifiers:
| Example | Description |
|---|---|
ai/smollm2 | SmolLM2 from the ai namespace |
ai/llama3.2 | Llama 3.2 from the ai namespace |
ai/phi4-mini | Phi-4 Mini from the ai namespace |
Always use the full namespace/name format in the model field of API requests.
Request Headers¶
All requests are forwarded with:
X-Olla-Request-ID- Unique request identifierX-Forwarded-For- Client IP address- Custom headers from endpoint configuration
Response Headers¶
All responses include:
X-Olla-Endpoint- Backend endpoint name (e.g.,local-dmr)X-Olla-Model- Model used for the requestX-Olla-Backend-Type- Alwaysdocker-model-runnerfor these endpointsX-Olla-Response-Time- Total processing time
Configuration Example¶
endpoints:
- url: "http://localhost:12434"
name: "local-dmr"
type: "docker-model-runner"
priority: 95
model_url: "/engines/v1/models"
health_check_url: "/engines/v1/models"
check_interval: 10s
check_timeout: 5s
See the Docker Model Runner Integration Guide for full configuration and setup instructions.