vLLM Integration¶
Home | github.com/vllm-project/vllm |
---|---|
Type | vllm (use in endpoint configuration) |
Profile | vllm.yaml (see latest) |
Features |
|
Unsupported |
|
Attributes |
|
Prefixes |
|
Endpoints | See below |
Configuration¶
Basic Setup¶
Add vLLM to your Olla configuration:
discovery:
static:
endpoints:
- url: "http://localhost:8000"
name: "local-vllm"
type: "vllm"
priority: 80
model_url: "/v1/models"
health_check_url: "/health"
check_interval: 5s
check_timeout: 2s
Production Setup¶
Configure vLLM for high-throughput production:
discovery:
static:
endpoints:
- url: "http://gpu-server:8000"
name: "vllm-prod"
type: "vllm"
priority: 100
health_check_url: "/health"
check_interval: 10s
check_timeout: 5s
- url: "http://gpu-server:8001"
name: "vllm-prod-2"
type: "vllm"
priority: 100
health_check_url: "/health"
check_interval: 10s
check_timeout: 5s
proxy:
engine: "olla" # Use high-performance engine
load_balancer: "least-connections"
Endpoints Supported¶
The following endpoints are supported by the vLLM integration profile:
Path | Description |
---|---|
/health | Health Check (vLLM-specific) |
/metrics | Prometheus Metrics |
/version | vLLM Version Information |
/v1/models | List Models (OpenAI format) |
/v1/chat/completions | Chat Completions (OpenAI format) |
/v1/completions | Text Completions (OpenAI format) |
/v1/embeddings | Embeddings/Pooling API |
/tokenize | Encode Text to Tokens |
/detokenize | Decode Tokens to Text |
/v1/tokenize | Versioned Tokenise Endpoint |
/v1/detokenize | Versioned Detokenise Endpoint |
/rerank | Reranking API |
/v1/rerank | Versioned Reranking API |
/v2/rerank | v2 Reranking API |
/get_tokenizer_info | Tokeniser Configuration Info |
Usage Examples¶
Chat Completion¶
curl -X POST http://localhost:40114/olla/vllm/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms"}
],
"temperature": 0.7,
"max_tokens": 500
}'
Streaming Response¶
curl -X POST http://localhost:40114/olla/vllm/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"messages": [
{"role": "user", "content": "Write a story about a robot"}
],
"stream": true,
"temperature": 0.8
}'
Tokenisation¶
# Encode text to tokens
curl -X POST http://localhost:40114/olla/vllm/tokenize \
-H "Content-Type: application/json" \
-d '{
"text": "Hello, world!",
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct"
}'
# Decode tokens to text
curl -X POST http://localhost:40114/olla/vllm/detokenize \
-H "Content-Type: application/json" \
-d '{
"tokens": [15496, 11, 1917, 0],
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct"
}'
Reranking¶
curl -X POST http://localhost:40114/olla/vllm/v1/rerank \
-H "Content-Type: application/json" \
-d '{
"model": "BAAI/bge-reranker-v2-m3",
"query": "What is machine learning?",
"documents": [
"Machine learning is a subset of artificial intelligence",
"The weather today is sunny",
"ML algorithms learn from data"
]
}'
Metrics Access¶
# Get Prometheus metrics
curl http://localhost:40114/olla/vllm/metrics
# Check health status
curl http://localhost:40114/olla/vllm/health
# Get version information
curl http://localhost:40114/olla/vllm/version
vLLM Specifics¶
High-Performance Features¶
vLLM includes several optimisations:
- PagedAttention: Memory-efficient attention mechanism
- Continuous Batching: Dynamic request batching
- Tensor Parallelism: Multi-GPU support
- Quantisation Support: INT4/INT8 for reduced memory
Resource Configuration¶
The vLLM profile includes GPU-optimised settings:
characteristics:
timeout: 2m
max_concurrent_requests: 100 # High concurrency support
streaming_support: true
resources:
defaults:
requires_gpu: true
min_gpu_memory_gb: 8
Memory Requirements¶
vLLM requires more memory for KV cache:
Model Size | GPU Memory Required | Recommended | Max Concurrent |
---|---|---|---|
70B | 140GB | 160GB | 10 |
34B | 70GB | 80GB | 20 |
13B | 30GB | 40GB | 50 |
7B | 16GB | 24GB | 100 |
3B | 8GB | 12GB | 100 |
Model Naming¶
vLLM uses full HuggingFace model names:
meta-llama/Meta-Llama-3.1-8B-Instruct
mistralai/Mistral-7B-Instruct-v0.2
codellama/CodeLlama-13b-Instruct-hf
Starting vLLM Server¶
Basic Start¶
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 8000
Production Configuration¶
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.95 \
--max-model-len 32768 \
--port 8000 \
--host 0.0.0.0
Docker Deployment¶
docker run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3.1-8B-Instruct
Profile Customisation¶
To customise vLLM behaviour, create config/profiles/vllm-custom.yaml
. See Profile Configuration for detailed explanations of each section.
Example Customisation¶
name: vllm
version: "1.0"
# Add custom prefixes
routing:
prefixes:
- vllm
- gpu # Add custom prefix
# Adjust for larger models
characteristics:
timeout: 5m # Increase for 70B models
# Modify concurrency limits
resources:
concurrency_limits:
- min_memory_gb: 100
max_concurrent: 5 # Reduce for very large models
- min_memory_gb: 50
max_concurrent: 15 # Adjust based on GPU memory
See Profile Configuration for complete customisation options.
Monitoring¶
Prometheus Metrics¶
vLLM exposes detailed metrics at /metrics
:
# Example Prometheus configuration
scrape_configs:
- job_name: 'vllm'
static_configs:
- targets: ['localhost:40114']
metrics_path: '/olla/vllm/metrics'
Key metrics include:
vllm:num_requests_running
- Active requestsvllm:num_requests_waiting
- Queued requestsvllm:gpu_cache_usage_perc
- GPU cache utilisationvllm:time_to_first_token_seconds
- TTFT latency
Health Monitoring¶
# Check health endpoint
curl http://localhost:40114/olla/vllm/health
# Response when healthy
{"status": "healthy"}
# Response when unhealthy
{"status": "unhealthy", "reason": "model not loaded"}
Troubleshooting¶
Out of Memory (OOM)¶
Issue: CUDA out of memory errors
Solution: 1. Reduce --gpu-memory-utilization
(default 0.9) 2. Decrease --max-model-len
3. Use quantisation (--quantization awq
or --quantization gptq
) 4. Enable tensor parallelism for multi-GPU
Slow First Token¶
Issue: High time to first token (TTFT)
Solution:
- Enable prefix caching:
--enable-prefix-caching
- Increase GPU memory utilisation
- Use smaller model or quantisation
Connection Timeout¶
Issue: Requests timeout during model loading
Solution: Increase timeout in profile:
characteristics:
timeout: 10m # Increase for initial model load
resources:
timeout_scaling:
base_timeout_seconds: 300
load_time_buffer: true
High Queue Wait Times¶
Issue: Requests queuing with "num_requests_waiting" high
Solution:
- Add more vLLM instances
- Use load balancing across multiple servers
- Increase
--max-num-seqs
(default 256)
Best Practices¶
1. Use Appropriate GPU Memory¶
# Conservative setting for stability
--gpu-memory-utilization 0.9
# Aggressive setting for throughput
--gpu-memory-utilization 0.95
2. Configure Tensor Parallelism¶
For models requiring multiple GPUs:
# 70B model on 4x A100 80GB
--tensor-parallel-size 4
# 34B model on 2x A100 40GB
--tensor-parallel-size 2
3. Enable Prefix Caching¶
For chat applications with system prompts:
4. Monitor and Scale¶
# Multiple vLLM instances
discovery:
static:
endpoints:
- url: "http://gpu1:8000"
name: "vllm-1"
type: "vllm"
priority: 100
- url: "http://gpu2:8000"
name: "vllm-2"
type: "vllm"
priority: 100
proxy:
load_balancer: "least-connections"
Integration with Tools¶
OpenAI SDK¶
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:40114/olla/vllm/v1",
api_key="not-needed" # vLLM doesn't require API keys
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "Hello!"}
]
)
LangChain¶
from langchain.llms import VLLM
llm = VLLM(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
vllm_server="http://localhost:40114/olla/vllm",
temperature=0.7
)
LlamaIndex¶
from llama_index.llms import OpenAI
llm = OpenAI(
api_base="http://localhost:40114/olla/vllm/v1",
api_key="dummy",
model="meta-llama/Meta-Llama-3.1-8B-Instruct"
)
Next Steps¶
- Profile Configuration - Customise vLLM behaviour
- Model Unification - Understand model management
- Load Balancing - Scale with multiple vLLM instances
- Monitoring - Set up Prometheus monitoring