LMDeploy Integration¶
| Home | github.com/InternLM/lmdeploy |
|---|---|
| Since | Olla v0.0.21 |
| Type | lmdeploy (use in endpoint configuration) |
| Profile | lmdeploy.yaml (see latest) |
| Features |
|
| Unsupported |
|
| Attributes |
|
| Prefixes |
|
| Endpoints | See below |
Configuration¶
Basic Setup¶
Register an LMDeploy api_server instance with Olla:
discovery:
static:
endpoints:
- url: "http://localhost:23333"
name: "local-lmdeploy"
type: "lmdeploy"
priority: 82
model_url: "/v1/models"
health_check_url: "/health"
check_interval: 5s
check_timeout: 2s
The default port for lmdeploy serve api_server is 23333. Register individual api_server instances directly — do not point Olla at the proxy_server component, which lacks a /health endpoint and only forwards a subset of routes.
Authentication¶
LMDeploy supports optional Bearer-token authentication via the --api-keys flag. Configure the token in Olla's endpoint headers so it is forwarded on every proxied request:
discovery:
static:
endpoints:
- url: "http://gpu-server:23333"
name: "lmdeploy-prod"
type: "lmdeploy"
priority: 82
health_check_url: "/health"
check_interval: 10s
check_timeout: 5s
headers:
Authorization: "Bearer ${LMDEPLOY_API_KEY}"
The /health endpoint is auth-exempt on LMDeploy, so health checks will succeed even when a key is required for inference.
Multiple Instances¶
discovery:
static:
endpoints:
- url: "http://gpu1:23333"
name: "lmdeploy-1"
type: "lmdeploy"
priority: 100
- url: "http://gpu2:23333"
name: "lmdeploy-2"
type: "lmdeploy"
priority: 100
proxy:
engine: "olla"
load_balancer: "least-connections"
Endpoints Supported¶
| Path | Description |
|---|---|
/health | Health Check |
/v1/models | List Models (OpenAI format) |
/v1/chat/completions | Chat Completions (OpenAI format) |
/v1/completions | Text Completions (OpenAI format) |
/v1/encode | Token Encoding (LMDeploy-specific) |
/generate | Native Generation Endpoint |
/pooling | Reward/Score Pooling (not /v1/embeddings) |
/is_sleeping | Sleep State Probe |
Usage Examples¶
Chat Completion¶
curl -X POST http://localhost:40114/olla/lmdeploy/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "internlm/internlm2_5-7b-chat",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is TurboMind?"}
],
"temperature": 0.7,
"max_tokens": 300
}'
Streaming¶
curl -X POST http://localhost:40114/olla/lmdeploy/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "internlm/internlm2_5-7b-chat",
"messages": [{"role": "user", "content": "Write a short story"}],
"stream": true,
"temperature": 0.8
}'
Token Encoding¶
curl -X POST http://localhost:40114/olla/lmdeploy/v1/encode \
-H "Content-Type: application/json" \
-d '{
"model": "internlm/internlm2_5-7b-chat",
"input": "Hello, world!"
}'
Pooling (Reward/Score)¶
# Use /pooling — not /v1/embeddings (which returns HTTP 400)
curl -X POST http://localhost:40114/olla/lmdeploy/pooling \
-H "Content-Type: application/json" \
-d '{
"model": "internlm/internlm2_5-7b-chat",
"input": "The quick brown fox"
}'
Starting LMDeploy¶
Basic Start¶
TurboMind Backend (Default, GPU)¶
lmdeploy serve api_server internlm/internlm2_5-7b-chat \
--backend turbomind \
--server-port 23333 \
--tp 1
PyTorch Backend¶
Use pytorch when a model is not supported by TurboMind, or for CPU inference:
With Authentication¶
lmdeploy serve api_server internlm/internlm2_5-7b-chat \
--server-port 23333 \
--api-keys my-secret-key
VLM Inference¶
Vision-language models use the same api_server entrypoint — no separate binary:
Docker¶
docker run --gpus all \
-p 23333:23333 \
openmmlab/lmdeploy:latest \
lmdeploy serve api_server internlm/internlm2_5-7b-chat \
--server-port 23333
LMDeploy Specifics¶
Sleep/Wake¶
LMDeploy supports a sleep mode to release GPU memory when idle:
# Suspend the engine (GPU memory freed)
curl -X POST http://localhost:23333/sleep
# Resume the engine
curl -X POST http://localhost:23333/wakeup
# Check state (proxied via Olla)
curl http://localhost:40114/olla/lmdeploy/is_sleeping
Olla treats a sleeping engine as transiently unavailable and will route around it if other healthy instances exist. Once the engine wakes, health checks recover it automatically.
Embeddings vs Pooling¶
LMDeploy does not implement /v1/embeddings. The correct path for reward-model scoring and embedding-style pooling is /pooling. This is a deliberate upstream design decision — using TurboMind's native pooling path rather than the OpenAI embeddings spec.
Model Naming¶
LMDeploy serves models by their HuggingFace identifiers:
internlm/internlm2_5-7b-chatmeta-llama/Meta-Llama-3.1-8B-Instructmistralai/Mistral-7B-Instruct-v0.2Qwen/Qwen2.5-7B-Instruct
Proxy Server vs API Server¶
LMDeploy ships two server components:
| Component | Port | Use with Olla? |
|---|---|---|
api_server | 23333 | Yes — has /health, full route support |
proxy_server | 8000 | No — no /health, limited routes |
Always register individual api_server instances. The proxy_server is LMDeploy's own load balancer and is redundant when Olla is in the stack.
Profile Customisation¶
Create config/profiles/lmdeploy-custom.yaml to override defaults. See Profile Configuration for the full schema.
name: lmdeploy
version: "1.0"
# Add a shorter routing prefix
routing:
prefixes:
- lmdeploy
- turbomind
# Increase timeout for large 70B models
characteristics:
timeout: 5m
OpenAI SDK¶
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:40114/olla/lmdeploy/v1",
api_key="not-needed" # omit if no --api-keys set on lmdeploy
)
response = client.chat.completions.create(
model="internlm/internlm2_5-7b-chat",
messages=[{"role": "user", "content": "Hello!"}]
)
Next Steps¶
- LMDeploy API Reference - Endpoint details and response formats
- Profile Configuration - Customise LMDeploy behaviour
- Load Balancing - Scale across multiple LMDeploy instances
- Health Checking - Circuit breakers and failover