Skip to content

Docker Model Runner Integration

Home docs.docker.com/ai/model-runner/
Since Olla v0.0.17
Type docker-model-runner (use in endpoint configuration)
Profile dmr.yaml (see latest)
Features
  • Proxy Forwarding
  • Health Check (via model listing)
  • Model Unification
  • Model Detection & Normalisation
  • OpenAI API Compatibility
  • Native Anthropic Messages API
  • Anthropic Token Counting
  • Multi-Engine Support (llama.cpp + vLLM)
  • OCI Model Distribution
Unsupported
  • Dedicated Health Endpoint (uses /engines/v1/models)
  • Prometheus Metrics Endpoint
  • Model Management via API (use docker model CLI)
Attributes
  • OpenAI Compatible
  • Lazy Model Loading
  • OCI Artifact Distribution
  • Hardware Accelerated (Metal / CUDA)
  • Multi-Engine (llama.cpp & vLLM)
Prefixes
Endpoints See below

Lazy Model Loading

Docker Model Runner loads models into memory on the first inference request, not at startup. The /engines/v1/models endpoint returns 200 with an empty data array when no models have been pulled yet — this is normal and does not indicate an unhealthy endpoint.

Configuration

Basic Setup

Add Docker Model Runner to your Olla configuration:

discovery:
  static:
    endpoints:
      - url: "http://localhost:12434"
        name: "local-dmr"
        type: "docker-model-runner"
        priority: 95
        model_url: "/engines/v1/models"
        health_check_url: "/engines/v1/models"
        check_interval: 10s
        check_timeout: 5s

From Within a Container

When Olla itself runs inside a Docker container, use the internal hostname:

discovery:
  static:
    endpoints:
      - url: "http://model-runner.docker.internal"
        name: "dmr-internal"
        type: "docker-model-runner"
        priority: 95
        model_url: "/engines/v1/models"
        health_check_url: "/engines/v1/models"
        check_interval: 10s
        check_timeout: 5s

Production Setup

Configure multiple DMR instances across machines (Linux with NVIDIA):

discovery:
  static:
    endpoints:
      - url: "http://gpu-host-1:12434"
        name: "dmr-gpu-1"
        type: "docker-model-runner"
        priority: 95
        model_url: "/engines/v1/models"
        health_check_url: "/engines/v1/models"
        check_interval: 10s
        check_timeout: 5s

      - url: "http://gpu-host-2:12434"
        name: "dmr-gpu-2"
        type: "docker-model-runner"
        priority: 95
        model_url: "/engines/v1/models"
        health_check_url: "/engines/v1/models"
        check_interval: 10s
        check_timeout: 5s

proxy:
  engine: "olla"
  load_balancer: "least-connections"

Anthropic Messages API Support

Docker Model Runner natively supports the Anthropic Messages API at /anthropic/v1/messages, enabling Olla to forward Anthropic-format requests directly without translation overhead (passthrough mode).

When Olla detects a DMR endpoint with native Anthropic support (configured via the anthropic_support section in config/profiles/dmr.yaml), it bypasses the Anthropic-to-OpenAI translation pipeline and forwards requests directly to the backend.

Profile configuration (from config/profiles/dmr.yaml):

api:
  anthropic_support:
    enabled: true
    messages_path: /anthropic/v1/messages
    token_count: true

Key details:

  • Token counting (/anthropic/v1/messages/count_tokens): Supported
  • Passthrough mode is automatic — no client-side configuration needed
  • Responses include X-Olla-Mode: passthrough header when passthrough is active
  • Falls back to translation mode if passthrough conditions are not met

For more information, see API Translation and Anthropic API Reference.

Endpoints Supported

The following endpoints are supported by the DMR integration profile:

Path Description
/engines/v1/models List Models & Health Check (engine-routed)
/engines/v1/chat/completions Chat Completions (engine-routed, OpenAI format)
/engines/v1/completions Text Completions (engine-routed, OpenAI format)
/engines/v1/embeddings Embeddings (engine-routed, OpenAI format)
/engines/llama.cpp/v1/models List Models (explicit llama.cpp engine)
/engines/llama.cpp/v1/chat/completions Chat Completions (explicit llama.cpp engine)
/engines/llama.cpp/v1/completions Text Completions (explicit llama.cpp engine)
/engines/llama.cpp/v1/embeddings Embeddings (explicit llama.cpp engine)
/engines/vllm/v1/models List Models (explicit vLLM engine)
/engines/vllm/v1/chat/completions Chat Completions (explicit vLLM engine)
/engines/vllm/v1/completions Text Completions (explicit vLLM engine)
/engines/vllm/v1/embeddings Embeddings (explicit vLLM engine)

The engine-routed paths (/engines/v1/...) let DMR automatically select the appropriate backend engine based on the model format. Use the explicit engine paths when you need to target a specific engine directly.

Usage Examples

Chat Completion

curl -X POST http://localhost:40114/olla/dmr/engines/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/smollm2",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain what Docker Model Runner is"}
    ],
    "temperature": 0.7,
    "max_tokens": 500
  }'

Streaming Response

curl -X POST http://localhost:40114/olla/dmr/engines/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/smollm2",
    "messages": [
      {"role": "user", "content": "Write a short story about containers"}
    ],
    "stream": true,
    "temperature": 0.8
  }'

List Models

curl http://localhost:40114/olla/dmr/engines/v1/models

Example response:

{
  "object": "list",
  "data": [
    {
      "id": "ai/smollm2",
      "object": "model",
      "created": 1734000000,
      "owned_by": "ai"
    },
    {
      "id": "ai/llama3.2",
      "object": "model",
      "created": 1734000001,
      "owned_by": "ai"
    }
  ]
}

Explicit Engine Routing

Target the llama.cpp engine directly (useful for GGUF-specific parameters):

curl -X POST http://localhost:40114/olla/dmr/engines/llama.cpp/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/smollm2",
    "messages": [
      {"role": "user", "content": "Hello"}
    ],
    "temperature": 0.7,
    "max_tokens": 200
  }'

Docker Model Runner Specifics

Multi-Engine Architecture

DMR automatically routes inference requests to the appropriate backend engine based on the model format:

Engine Model Format Use Case
llama.cpp GGUF (quantised) Default; all platforms; CPU and GPU
vLLM safetensors Linux + NVIDIA; high throughput

The /engines/v1/... paths use automatic engine selection. The explicit /engines/llama.cpp/v1/... and /engines/vllm/v1/... paths target a specific engine regardless of model format.

Model Naming Convention

DMR uses a namespace/name format for model identifiers, matching Docker Hub image conventions:

ai/smollm2
ai/llama3.2
ai/phi4-mini
huggingface/meta-llama-3.1-8b

Use this exact format in API requests:

{
  "model": "ai/smollm2",
  "messages": [...]
}

Lazy Model Loading

DMR does not load models into memory until the first inference request. This means:

  • The /engines/v1/models endpoint returns an empty data array until at least one model has been used
  • The first request after startup (or after a model is idle) may have higher latency
  • Olla treats an empty model list as a healthy state for DMR endpoints — this does not trigger circuit breaker logic

Supported Platforms

Platform Acceleration Notes
macOS Apple Silicon Metal (GPU) Default; fully supported
Linux x86_64 + NVIDIA CUDA Requires NVIDIA driver 575.57.08+
Windows WSL2 + NVIDIA CUDA Requires NVIDIA driver 576.57+
Windows ARM64 + Qualcomm Adreno GPU 6xx series+
Linux (CPU / AMD / Intel) Vulkan Supported since October 2025

Base URL Reference

Access Method Base URL
Host (macOS/Linux/Windows) http://localhost:12434
Docker container (Docker Desktop) http://model-runner.docker.internal
Docker container (Linux Engine) http://172.17.0.1:12434

Enabling Docker Model Runner

DMR ships with Docker Desktop 4.40+ but must be enabled before use.

Enable via CLI

# Enable DMR with TCP access on port 12434
docker desktop enable model-runner --tcp 12434

Pull Models

Models are distributed as OCI artifacts from Docker Hub:

# Pull a small model for testing
docker model pull ai/smollm2

# Pull Llama 3.2
docker model pull ai/llama3.2

# List locally available models
docker model ls

Verify the API is Reachable

# List models via the API
curl http://localhost:12434/engines/v1/models

# Test a chat completion
curl -X POST http://localhost:12434/engines/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/smollm2",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 50
  }'

Profile Customisation

To customise DMR behaviour, create config/profiles/dmr-custom.yaml. See Profile Configuration for detailed explanations of each section.

Example Customisation

name: docker-model-runner
version: "1.0"

# Add additional routing prefixes
routing:
  prefixes:
    - dmr
    - docker   # Add shorter alias

# Increase timeout for large models
characteristics:
  timeout: 10m

# Adjust concurrency for your hardware
resources:
  concurrency_limits:
    - min_memory_gb: 30
      max_concurrent: 1   # Reduce for constrained hardware
    - min_memory_gb: 8
      max_concurrent: 4
    - min_memory_gb: 0
      max_concurrent: 6

See Profile Configuration for complete customisation options.

Troubleshooting

DMR Not Responding

Issue: Requests to http://localhost:12434 fail with connection refused.

Solution:

  1. Verify Docker Desktop 4.40+ is installed and running
  2. Enable DMR with TCP access:
    docker desktop enable model-runner --tcp 12434
    
  3. Confirm the port is listening:
    curl http://localhost:12434/engines/v1/models
    

Empty Model List

Issue: /engines/v1/models returns {"object":"list","data":[]}.

Solution: This is expected when no models have been used yet. Pull a model first:

docker model pull ai/smollm2

After pulling, the model appears in the list only after it has been loaded (i.e., after the first inference request). This is normal lazy-loading behaviour.

Slow First Response

Issue: The first request takes significantly longer than subsequent ones.

Solution: This is expected due to lazy loading. The model is loaded into memory on the first request. Subsequent requests use the already-loaded model. Consider sending a warm-up request at startup if latency on the first user request is a concern.

Platform Not Supported

Issue: DMR fails to start or GPU acceleration is unavailable.

Solution: Check platform requirements:

  • macOS: Requires Apple Silicon (M1 or later)
  • Linux: Requires NVIDIA GPU with driver 575.57.08+ for GPU acceleration; CPU-only is also supported via Vulkan
  • Windows: Requires WSL2 with NVIDIA driver 576.57+ for NVIDIA GPU; Qualcomm Adreno 6xx+ for ARM

Model Name Not Found

Issue: Inference requests fail with model not found errors.

Solution: Use the full namespace/name format:

# Incorrect
"model": "smollm2"

# Correct
"model": "ai/smollm2"

Verify available model names with:

curl http://localhost:12434/engines/v1/models

Best Practices

1. Use the Correct Endpoint for Health Checks

DMR has no dedicated /health endpoint. Olla uses /engines/v1/models for both model discovery and health checking. This endpoint always returns 200 (even with an empty list), so Olla correctly treats DMR as healthy at startup before any models are loaded.

2. Prefer Engine-Routed Paths

Use /engines/v1/... rather than explicit engine paths unless you have a specific reason to target a particular engine. DMR selects the optimal engine automatically based on model format.

3. Account for Lazy Loading in Timeouts

The DMR profile sets a 5-minute timeout by default to accommodate first-request model loading. If you are running large models (34B+), consider increasing the timeout:

characteristics:
  timeout: 10m

resources:
  timeout_scaling:
    base_timeout_seconds: 300
    load_time_buffer: true

4. Use Priority Routing

If you have both DMR and other backends (e.g., Ollama), set priority to route requests appropriately:

endpoints:
  - url: "http://localhost:12434"
    name: "dmr-local"
    type: "docker-model-runner"
    priority: 95   # High priority — local, built-in

  - url: "http://localhost:11434"
    name: "ollama-local"
    type: "ollama"
    priority: 80   # Lower priority — fallback

Integration with Tools

OpenAI SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:40114/olla/dmr/engines/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="ai/smollm2",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)
print(response.choices[0].message.content)

LangChain

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="http://localhost:40114/olla/dmr/engines/v1",
    api_key="not-needed",
    model="ai/smollm2",
    temperature=0.7
)

response = llm.invoke("Hello!")
print(response.content)

Anthropic SDK (via Passthrough)

When the target backend is DMR, Olla passes Anthropic-format requests through directly:

import anthropic

client = anthropic.Anthropic(
    base_url="http://localhost:40114/olla/anthropic",
    api_key="not-needed"
)

message = client.messages.create(
    model="ai/smollm2",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)
print(message.content[0].text)

Next Steps