vLLM Integration¶
| Home | github.com/vllm-project/vllm |
|---|---|
| Since | Olla v0.0.16 |
| Type | vllm (use in endpoint configuration) |
| Profile | vllm.yaml (see latest) |
| Features |
|
| Unsupported |
|
| Attributes |
|
| Prefixes |
|
| Endpoints | See below |
Configuration¶
Basic Setup¶
Add vLLM to your Olla configuration:
discovery:
static:
endpoints:
- url: "http://localhost:8000"
name: "local-vllm"
type: "vllm"
priority: 80
model_url: "/v1/models"
health_check_url: "/health"
check_interval: 5s
check_timeout: 2s
Production Setup¶
Configure vLLM for high-throughput production:
discovery:
static:
endpoints:
- url: "http://gpu-server:8000"
name: "vllm-prod"
type: "vllm"
priority: 100
health_check_url: "/health"
check_interval: 10s
check_timeout: 5s
- url: "http://gpu-server:8001"
name: "vllm-prod-2"
type: "vllm"
priority: 100
health_check_url: "/health"
check_interval: 10s
check_timeout: 5s
proxy:
engine: "olla" # Use high-performance engine
load_balancer: "least-connections"
Anthropic Messages API Support¶
vLLM v0.11.1+ natively supports the Anthropic Messages API, enabling Olla to forward Anthropic-format requests directly without translation overhead (passthrough mode).
When Olla detects that a vLLM endpoint supports native Anthropic format (via the anthropic_support section in config/profiles/vllm.yaml), it will bypass the Anthropic-to-OpenAI translation pipeline and forward requests directly to /v1/messages on the backend.
Profile configuration (from config/profiles/vllm.yaml):
api:
anthropic_support:
enabled: true
messages_path: /v1/messages
token_count: false
min_version: "0.11.1"
limitations:
- no_token_counting
Key details:
- Minimum vLLM version: v0.11.1
- Token counting (
/v1/messages/count_tokens): Not supported - Passthrough mode is automatic -- no client-side configuration needed
- Responses include
X-Olla-Mode: passthroughheader when passthrough is active - Falls back to translation mode if passthrough conditions are not met
For more information, see API Translation and Anthropic API Reference.
Endpoints Supported¶
The following endpoints are supported by the vLLM integration profile:
| Path | Description |
|---|---|
/health | Health Check (vLLM-specific) |
/metrics | Prometheus Metrics |
/version | vLLM Version Information |
/v1/models | List Models (OpenAI format) |
/v1/chat/completions | Chat Completions (OpenAI format) |
/v1/completions | Text Completions (OpenAI format) |
/v1/embeddings | Embeddings/Pooling API |
/tokenize | Encode Text to Tokens |
/detokenize | Decode Tokens to Text |
/v1/tokenize | Versioned Tokenise Endpoint |
/v1/detokenize | Versioned Detokenise Endpoint |
/rerank | Reranking API |
/v1/rerank | Versioned Reranking API |
/v2/rerank | v2 Reranking API |
/get_tokenizer_info | Tokeniser Configuration Info |
Usage Examples¶
Chat Completion¶
curl -X POST http://localhost:40114/olla/vllm/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms"}
],
"temperature": 0.7,
"max_tokens": 500
}'
Streaming Response¶
curl -X POST http://localhost:40114/olla/vllm/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"messages": [
{"role": "user", "content": "Write a story about a robot"}
],
"stream": true,
"temperature": 0.8
}'
Tokenisation¶
# Encode text to tokens
curl -X POST http://localhost:40114/olla/vllm/tokenize \
-H "Content-Type: application/json" \
-d '{
"text": "Hello, world!",
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct"
}'
# Decode tokens to text
curl -X POST http://localhost:40114/olla/vllm/detokenize \
-H "Content-Type: application/json" \
-d '{
"tokens": [15496, 11, 1917, 0],
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct"
}'
Reranking¶
curl -X POST http://localhost:40114/olla/vllm/v1/rerank \
-H "Content-Type: application/json" \
-d '{
"model": "BAAI/bge-reranker-v2-m3",
"query": "What is machine learning?",
"documents": [
"Machine learning is a subset of artificial intelligence",
"The weather today is sunny",
"ML algorithms learn from data"
]
}'
Metrics Access¶
# Get Prometheus metrics
curl http://localhost:40114/olla/vllm/metrics
# Check health status
curl http://localhost:40114/olla/vllm/health
# Get version information
curl http://localhost:40114/olla/vllm/version
vLLM Specifics¶
High-Performance Features¶
vLLM includes several optimisations:
- PagedAttention: Memory-efficient attention mechanism
- Continuous Batching: Dynamic request batching
- Tensor Parallelism: Multi-GPU support
- Quantisation Support: INT4/INT8 for reduced memory
Resource Configuration¶
The vLLM profile includes GPU-optimised settings:
characteristics:
timeout: 2m
max_concurrent_requests: 100 # High concurrency support
streaming_support: true
resources:
defaults:
requires_gpu: true
min_gpu_memory_gb: 8
Memory Requirements¶
vLLM requires more memory for KV cache:
| Model Size | GPU Memory Required | Recommended | Max Concurrent |
|---|---|---|---|
| 70B | 140GB | 160GB | 10 |
| 34B | 70GB | 80GB | 20 |
| 13B | 30GB | 40GB | 50 |
| 7B | 16GB | 24GB | 100 |
| 3B | 8GB | 12GB | 100 |
Model Naming¶
vLLM uses full HuggingFace model names:
meta-llama/Meta-Llama-3.1-8B-Instructmistralai/Mistral-7B-Instruct-v0.2codellama/CodeLlama-13b-Instruct-hf
Starting vLLM Server¶
Basic Start¶
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 8000
Production Configuration¶
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.95 \
--max-model-len 32768 \
--port 8000 \
--host 0.0.0.0
Docker Deployment¶
docker run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3.1-8B-Instruct
Profile Customisation¶
To customise vLLM behaviour, create config/profiles/vllm-custom.yaml. See Profile Configuration for detailed explanations of each section.
Example Customisation¶
name: vllm
version: "1.0"
# Add custom prefixes
routing:
prefixes:
- vllm
- gpu # Add custom prefix
# Adjust for larger models
characteristics:
timeout: 5m # Increase for 70B models
# Modify concurrency limits
resources:
concurrency_limits:
- min_memory_gb: 100
max_concurrent: 5 # Reduce for very large models
- min_memory_gb: 50
max_concurrent: 15 # Adjust based on GPU memory
See Profile Configuration for complete customisation options.
Monitoring¶
Prometheus Metrics¶
vLLM exposes detailed metrics at /metrics:
# Example Prometheus configuration
scrape_configs:
- job_name: 'vllm'
static_configs:
- targets: ['localhost:40114']
metrics_path: '/olla/vllm/metrics'
Key metrics include:
vllm:num_requests_running- Active requestsvllm:num_requests_waiting- Queued requestsvllm:gpu_cache_usage_perc- GPU cache utilisationvllm:time_to_first_token_seconds- TTFT latency
Health Monitoring¶
# Check health endpoint
curl http://localhost:40114/olla/vllm/health
# Response when healthy
{"status": "healthy"}
# Response when unhealthy
{"status": "unhealthy", "reason": "model not loaded"}
Troubleshooting¶
Out of Memory (OOM)¶
Issue: CUDA out of memory errors
Solution: 1. Reduce --gpu-memory-utilization (default 0.9) 2. Decrease --max-model-len 3. Use quantisation (--quantization awq or --quantization gptq) 4. Enable tensor parallelism for multi-GPU
Slow First Token¶
Issue: High time to first token (TTFT)
Solution:
- Enable prefix caching:
--enable-prefix-caching - Increase GPU memory utilisation
- Use smaller model or quantisation
Connection Timeout¶
Issue: Requests timeout during model loading
Solution: Increase timeout in profile:
characteristics:
timeout: 10m # Increase for initial model load
resources:
timeout_scaling:
base_timeout_seconds: 300
load_time_buffer: true
High Queue Wait Times¶
Issue: Requests queuing with "num_requests_waiting" high
Solution:
- Add more vLLM instances
- Use load balancing across multiple servers
- Increase
--max-num-seqs(default 256)
Best Practices¶
1. Use Appropriate GPU Memory¶
# Conservative setting for stability
--gpu-memory-utilization 0.9
# Aggressive setting for throughput
--gpu-memory-utilization 0.95
2. Configure Tensor Parallelism¶
For models requiring multiple GPUs:
# 70B model on 4x A100 80GB
--tensor-parallel-size 4
# 34B model on 2x A100 40GB
--tensor-parallel-size 2
3. Enable Prefix Caching¶
For chat applications with system prompts:
4. Monitor and Scale¶
# Multiple vLLM instances
discovery:
static:
endpoints:
- url: "http://gpu1:8000"
name: "vllm-1"
type: "vllm"
priority: 100
- url: "http://gpu2:8000"
name: "vllm-2"
type: "vllm"
priority: 100
proxy:
load_balancer: "least-connections"
Integration with Tools¶
OpenAI SDK¶
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:40114/olla/vllm/v1",
api_key="not-needed" # vLLM doesn't require API keys
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "Hello!"}
]
)
LangChain¶
from langchain.llms import VLLM
llm = VLLM(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
vllm_server="http://localhost:40114/olla/vllm",
temperature=0.7
)
LlamaIndex¶
from llama_index.llms import OpenAI
llm = OpenAI(
api_base="http://localhost:40114/olla/vllm/v1",
api_key="dummy",
model="meta-llama/Meta-Llama-3.1-8B-Instruct"
)
Next Steps¶
- Profile Configuration - Customise vLLM behaviour
- Model Unification - Understand model management
- Load Balancing - Scale with multiple vLLM instances
- Monitoring - Set up Prometheus monitoring