LMDeploy Integration¶

Home	github.com/InternLM/lmdeploy
Since	Olla `v0.0.21`
Type	`lmdeploy` (use in endpoint configuration)
Profile	`lmdeploy.yaml` (see latest)
Features	Proxy Forwarding Health Check (native) Model Unification Model Detection & Normalisation OpenAI API Compatibility Token Encoding API Reward/Score Pooling VLM Inference (same `api_server`)
Unsupported	`/v1/embeddings` (returns HTTP 400 — use `/pooling`) `proxy_server` component (no `/health` endpoint) Model Management (loading/unloading)
Attributes	OpenAI Compatible GPU Optimised (TurboMind C++/CUDA engine) Continuous Batching VLM Support
Prefixes	`/lmdeploy` (see Routing Prefixes)
Endpoints	See below

Configuration¶

Basic Setup¶

Register an LMDeploy api_server instance with Olla:

discovery:
  static:
    endpoints:
      - url: "http://localhost:23333"
        name: "local-lmdeploy"
        type: "lmdeploy"
        priority: 82
        model_url: "/v1/models"
        health_check_url: "/health"
        check_interval: 5s
        check_timeout: 2s

The default port for lmdeploy serve api_server is 23333. Register individual api_server instances directly — do not point Olla at the proxy_server component, which lacks a /health endpoint and only forwards a subset of routes.

Authentication¶

LMDeploy supports optional Bearer-token authentication via the --api-keys flag. Configure the token in Olla's endpoint headers so it is forwarded on every proxied request:

discovery:
  static:
    endpoints:
      - url: "http://gpu-server:23333"
        name: "lmdeploy-prod"
        type: "lmdeploy"
        priority: 82
        health_check_url: "/health"
        check_interval: 10s
        check_timeout: 5s
        headers:
          Authorization: "Bearer ${LMDEPLOY_API_KEY}"

The /health endpoint is auth-exempt on LMDeploy, so health checks will succeed even when a key is required for inference.

Multiple Instances¶

discovery:
  static:
    endpoints:
      - url: "http://gpu1:23333"
        name: "lmdeploy-1"
        type: "lmdeploy"
        priority: 100

      - url: "http://gpu2:23333"
        name: "lmdeploy-2"
        type: "lmdeploy"
        priority: 100

proxy:
  engine: "olla"
  load_balancer: "least-connections"

Endpoints Supported¶

Path	Description
`/health`	Health Check
`/v1/models`	List Models (OpenAI format)
`/v1/chat/completions`	Chat Completions (OpenAI format)
`/v1/completions`	Text Completions (OpenAI format)
`/v1/encode`	Token Encoding (LMDeploy-specific)
`/generate`	Native Generation Endpoint
`/pooling`	Reward/Score Pooling (not `/v1/embeddings`)
`/is_sleeping`	Sleep State Probe

Usage Examples¶

Chat Completion¶

curl -X POST http://localhost:40114/olla/lmdeploy/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "internlm/internlm2_5-7b-chat",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is TurboMind?"}
    ],
    "temperature": 0.7,
    "max_tokens": 300
  }'

Streaming¶

curl -X POST http://localhost:40114/olla/lmdeploy/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "internlm/internlm2_5-7b-chat",
    "messages": [{"role": "user", "content": "Write a short story"}],
    "stream": true,
    "temperature": 0.8
  }'

Token Encoding¶

curl -X POST http://localhost:40114/olla/lmdeploy/v1/encode \
  -H "Content-Type: application/json" \
  -d '{
    "model": "internlm/internlm2_5-7b-chat",
    "input": "Hello, world!"
  }'

Pooling (Reward/Score)¶

# Use /pooling — not /v1/embeddings (which returns HTTP 400)
curl -X POST http://localhost:40114/olla/lmdeploy/pooling \
  -H "Content-Type: application/json" \
  -d '{
    "model": "internlm/internlm2_5-7b-chat",
    "input": "The quick brown fox"
  }'

Starting LMDeploy¶

Basic Start¶

pip install lmdeploy

lmdeploy serve api_server internlm/internlm2_5-7b-chat \
  --server-port 23333

TurboMind Backend (Default, GPU)¶

lmdeploy serve api_server internlm/internlm2_5-7b-chat \
  --backend turbomind \
  --server-port 23333 \
  --tp 1

PyTorch Backend¶

Use pytorch when a model is not supported by TurboMind, or for CPU inference:

lmdeploy serve api_server internlm/internlm2_5-7b-chat \
  --backend pytorch \
  --server-port 23333

With Authentication¶

lmdeploy serve api_server internlm/internlm2_5-7b-chat \
  --server-port 23333 \
  --api-keys my-secret-key

VLM Inference¶

Vision-language models use the same api_server entrypoint — no separate binary:

lmdeploy serve api_server InternLM/internlm-xcomposer2-7b \
  --server-port 23333

Docker¶

docker run --gpus all \
  -p 23333:23333 \
  openmmlab/lmdeploy:latest \
  lmdeploy serve api_server internlm/internlm2_5-7b-chat \
  --server-port 23333

LMDeploy Specifics¶

Sleep/Wake¶

LMDeploy supports a sleep mode to release GPU memory when idle:

# Suspend the engine (GPU memory freed)
curl -X POST http://localhost:23333/sleep

# Resume the engine
curl -X POST http://localhost:23333/wakeup

# Check state (proxied via Olla)
curl http://localhost:40114/olla/lmdeploy/is_sleeping

Olla treats a sleeping engine as transiently unavailable and will route around it if other healthy instances exist. Once the engine wakes, health checks recover it automatically.

Embeddings vs Pooling¶

LMDeploy does not implement /v1/embeddings. The correct path for reward-model scoring and embedding-style pooling is /pooling. This is a deliberate upstream design decision — using TurboMind's native pooling path rather than the OpenAI embeddings spec.

Model Naming¶

LMDeploy serves models by their HuggingFace identifiers:

internlm/internlm2_5-7b-chat
meta-llama/Meta-Llama-3.1-8B-Instruct
mistralai/Mistral-7B-Instruct-v0.2
Qwen/Qwen2.5-7B-Instruct

Proxy Server vs API Server¶

LMDeploy ships two server components:

Component	Port	Use with Olla?
`api_server`	23333	Yes — has `/health`, full route support
`proxy_server`	8000	No — no `/health`, limited routes

Always register individual api_server instances. The proxy_server is LMDeploy's own load balancer and is redundant when Olla is in the stack.

Profile Customisation¶

Create config/profiles/lmdeploy-custom.yaml to override defaults. See Profile Configuration for the full schema.

name: lmdeploy
version: "1.0"

# Add a shorter routing prefix
routing:
  prefixes:
    - lmdeploy
    - turbomind

# Increase timeout for large 70B models
characteristics:
  timeout: 5m

OpenAI SDK¶

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:40114/olla/lmdeploy/v1",
    api_key="not-needed"  # omit if no --api-keys set on lmdeploy
)

response = client.chat.completions.create(
    model="internlm/internlm2_5-7b-chat",
    messages=[{"role": "user", "content": "Hello!"}]
)

Next Steps¶

LMDeploy API Reference - Endpoint details and response formats
Profile Configuration - Customise LMDeploy behaviour
Load Balancing - Scale across multiple LMDeploy instances
Health Checking - Circuit breakers and failover