Skip to content

LMDeploy Integration

Home github.com/InternLM/lmdeploy
Since Olla v0.0.21
Type lmdeploy (use in endpoint configuration)
Profile lmdeploy.yaml (see latest)
Features
  • Proxy Forwarding
  • Health Check (native)
  • Model Unification
  • Model Detection & Normalisation
  • OpenAI API Compatibility
  • Token Encoding API
  • Reward/Score Pooling
  • VLM Inference (same api_server)
Unsupported
  • /v1/embeddings (returns HTTP 400 — use /pooling)
  • proxy_server component (no /health endpoint)
  • Model Management (loading/unloading)
Attributes
  • OpenAI Compatible
  • GPU Optimised (TurboMind C++/CUDA engine)
  • Continuous Batching
  • VLM Support
Prefixes
Endpoints See below

Configuration

Basic Setup

Register an LMDeploy api_server instance with Olla:

discovery:
  static:
    endpoints:
      - url: "http://localhost:23333"
        name: "local-lmdeploy"
        type: "lmdeploy"
        priority: 82
        model_url: "/v1/models"
        health_check_url: "/health"
        check_interval: 5s
        check_timeout: 2s

The default port for lmdeploy serve api_server is 23333. Register individual api_server instances directly — do not point Olla at the proxy_server component, which lacks a /health endpoint and only forwards a subset of routes.

Authentication

LMDeploy supports optional Bearer-token authentication via the --api-keys flag. Configure the token in Olla's endpoint headers so it is forwarded on every proxied request:

discovery:
  static:
    endpoints:
      - url: "http://gpu-server:23333"
        name: "lmdeploy-prod"
        type: "lmdeploy"
        priority: 82
        health_check_url: "/health"
        check_interval: 10s
        check_timeout: 5s
        headers:
          Authorization: "Bearer ${LMDEPLOY_API_KEY}"

The /health endpoint is auth-exempt on LMDeploy, so health checks will succeed even when a key is required for inference.

Multiple Instances

discovery:
  static:
    endpoints:
      - url: "http://gpu1:23333"
        name: "lmdeploy-1"
        type: "lmdeploy"
        priority: 100

      - url: "http://gpu2:23333"
        name: "lmdeploy-2"
        type: "lmdeploy"
        priority: 100

proxy:
  engine: "olla"
  load_balancer: "least-connections"

Endpoints Supported

Path Description
/health Health Check
/v1/models List Models (OpenAI format)
/v1/chat/completions Chat Completions (OpenAI format)
/v1/completions Text Completions (OpenAI format)
/v1/encode Token Encoding (LMDeploy-specific)
/generate Native Generation Endpoint
/pooling Reward/Score Pooling (not /v1/embeddings)
/is_sleeping Sleep State Probe

Usage Examples

Chat Completion

curl -X POST http://localhost:40114/olla/lmdeploy/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "internlm/internlm2_5-7b-chat",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is TurboMind?"}
    ],
    "temperature": 0.7,
    "max_tokens": 300
  }'

Streaming

curl -X POST http://localhost:40114/olla/lmdeploy/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "internlm/internlm2_5-7b-chat",
    "messages": [{"role": "user", "content": "Write a short story"}],
    "stream": true,
    "temperature": 0.8
  }'

Token Encoding

curl -X POST http://localhost:40114/olla/lmdeploy/v1/encode \
  -H "Content-Type: application/json" \
  -d '{
    "model": "internlm/internlm2_5-7b-chat",
    "input": "Hello, world!"
  }'

Pooling (Reward/Score)

# Use /pooling — not /v1/embeddings (which returns HTTP 400)
curl -X POST http://localhost:40114/olla/lmdeploy/pooling \
  -H "Content-Type: application/json" \
  -d '{
    "model": "internlm/internlm2_5-7b-chat",
    "input": "The quick brown fox"
  }'

Starting LMDeploy

Basic Start

pip install lmdeploy

lmdeploy serve api_server internlm/internlm2_5-7b-chat \
  --server-port 23333

TurboMind Backend (Default, GPU)

lmdeploy serve api_server internlm/internlm2_5-7b-chat \
  --backend turbomind \
  --server-port 23333 \
  --tp 1

PyTorch Backend

Use pytorch when a model is not supported by TurboMind, or for CPU inference:

lmdeploy serve api_server internlm/internlm2_5-7b-chat \
  --backend pytorch \
  --server-port 23333

With Authentication

lmdeploy serve api_server internlm/internlm2_5-7b-chat \
  --server-port 23333 \
  --api-keys my-secret-key

VLM Inference

Vision-language models use the same api_server entrypoint — no separate binary:

lmdeploy serve api_server InternLM/internlm-xcomposer2-7b \
  --server-port 23333

Docker

docker run --gpus all \
  -p 23333:23333 \
  openmmlab/lmdeploy:latest \
  lmdeploy serve api_server internlm/internlm2_5-7b-chat \
  --server-port 23333

LMDeploy Specifics

Sleep/Wake

LMDeploy supports a sleep mode to release GPU memory when idle:

# Suspend the engine (GPU memory freed)
curl -X POST http://localhost:23333/sleep

# Resume the engine
curl -X POST http://localhost:23333/wakeup

# Check state (proxied via Olla)
curl http://localhost:40114/olla/lmdeploy/is_sleeping

Olla treats a sleeping engine as transiently unavailable and will route around it if other healthy instances exist. Once the engine wakes, health checks recover it automatically.

Embeddings vs Pooling

LMDeploy does not implement /v1/embeddings. The correct path for reward-model scoring and embedding-style pooling is /pooling. This is a deliberate upstream design decision — using TurboMind's native pooling path rather than the OpenAI embeddings spec.

Model Naming

LMDeploy serves models by their HuggingFace identifiers:

  • internlm/internlm2_5-7b-chat
  • meta-llama/Meta-Llama-3.1-8B-Instruct
  • mistralai/Mistral-7B-Instruct-v0.2
  • Qwen/Qwen2.5-7B-Instruct

Proxy Server vs API Server

LMDeploy ships two server components:

Component Port Use with Olla?
api_server 23333 Yes — has /health, full route support
proxy_server 8000 No — no /health, limited routes

Always register individual api_server instances. The proxy_server is LMDeploy's own load balancer and is redundant when Olla is in the stack.

Profile Customisation

Create config/profiles/lmdeploy-custom.yaml to override defaults. See Profile Configuration for the full schema.

name: lmdeploy
version: "1.0"

# Add a shorter routing prefix
routing:
  prefixes:
    - lmdeploy
    - turbomind

# Increase timeout for large 70B models
characteristics:
  timeout: 5m

OpenAI SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:40114/olla/lmdeploy/v1",
    api_key="not-needed"  # omit if no --api-keys set on lmdeploy
)

response = client.chat.completions.create(
    model="internlm/internlm2_5-7b-chat",
    messages=[{"role": "user", "content": "Hello!"}]
)

Next Steps