Skip to content

vLLM Integration

Home github.com/vllm-project/vllm
Type vllm (use in endpoint configuration)
Profile vllm.yaml (see latest)
Features
  • Proxy Forwarding
  • Health Check (native)
  • Model Unification
  • Model Detection & Normalisation
  • OpenAI API Compatibility
  • Prometheus Metrics
  • Tokenisation API
  • Reranking API
Unsupported
  • Model Management (loading/unloading)
  • Instance Management
  • Model Download
Attributes
  • OpenAI Compatible
  • High Concurrency (PagedAttention)
  • GPU Optimised
  • Continuous Batching
Prefixes
Endpoints See below

Configuration

Basic Setup

Add vLLM to your Olla configuration:

discovery:
  static:
    endpoints:
      - url: "http://localhost:8000"
        name: "local-vllm"
        type: "vllm"
        priority: 80
        model_url: "/v1/models"
        health_check_url: "/health"
        check_interval: 5s
        check_timeout: 2s

Production Setup

Configure vLLM for high-throughput production:

discovery:
  static:
    endpoints:
      - url: "http://gpu-server:8000"
        name: "vllm-prod"
        type: "vllm"
        priority: 100
        health_check_url: "/health"
        check_interval: 10s
        check_timeout: 5s

      - url: "http://gpu-server:8001"
        name: "vllm-prod-2"
        type: "vllm"
        priority: 100
        health_check_url: "/health"
        check_interval: 10s
        check_timeout: 5s

proxy:
  engine: "olla"  # Use high-performance engine
  load_balancer: "least-connections"

Endpoints Supported

The following endpoints are supported by the vLLM integration profile:

Path Description
/health Health Check (vLLM-specific)
/metrics Prometheus Metrics
/version vLLM Version Information
/v1/models List Models (OpenAI format)
/v1/chat/completions Chat Completions (OpenAI format)
/v1/completions Text Completions (OpenAI format)
/v1/embeddings Embeddings/Pooling API
/tokenize Encode Text to Tokens
/detokenize Decode Tokens to Text
/v1/tokenize Versioned Tokenise Endpoint
/v1/detokenize Versioned Detokenise Endpoint
/rerank Reranking API
/v1/rerank Versioned Reranking API
/v2/rerank v2 Reranking API
/get_tokenizer_info Tokeniser Configuration Info

Usage Examples

Chat Completion

curl -X POST http://localhost:40114/olla/vllm/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing in simple terms"}
    ],
    "temperature": 0.7,
    "max_tokens": 500
  }'

Streaming Response

curl -X POST http://localhost:40114/olla/vllm/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "messages": [
      {"role": "user", "content": "Write a story about a robot"}
    ],
    "stream": true,
    "temperature": 0.8
  }'

Tokenisation

# Encode text to tokens
curl -X POST http://localhost:40114/olla/vllm/tokenize \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, world!",
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct"
  }'

# Decode tokens to text
curl -X POST http://localhost:40114/olla/vllm/detokenize \
  -H "Content-Type: application/json" \
  -d '{
    "tokens": [15496, 11, 1917, 0],
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct"
  }'

Reranking

curl -X POST http://localhost:40114/olla/vllm/v1/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "model": "BAAI/bge-reranker-v2-m3",
    "query": "What is machine learning?",
    "documents": [
      "Machine learning is a subset of artificial intelligence",
      "The weather today is sunny",
      "ML algorithms learn from data"
    ]
  }'

Metrics Access

# Get Prometheus metrics
curl http://localhost:40114/olla/vllm/metrics

# Check health status
curl http://localhost:40114/olla/vllm/health

# Get version information
curl http://localhost:40114/olla/vllm/version

vLLM Specifics

High-Performance Features

vLLM includes several optimisations:

  • PagedAttention: Memory-efficient attention mechanism
  • Continuous Batching: Dynamic request batching
  • Tensor Parallelism: Multi-GPU support
  • Quantisation Support: INT4/INT8 for reduced memory

Resource Configuration

The vLLM profile includes GPU-optimised settings:

characteristics:
  timeout: 2m
  max_concurrent_requests: 100  # High concurrency support
  streaming_support: true

resources:
  defaults:
    requires_gpu: true
    min_gpu_memory_gb: 8

Memory Requirements

vLLM requires more memory for KV cache:

Model Size GPU Memory Required Recommended Max Concurrent
70B 140GB 160GB 10
34B 70GB 80GB 20
13B 30GB 40GB 50
7B 16GB 24GB 100
3B 8GB 12GB 100

Model Naming

vLLM uses full HuggingFace model names:

  • meta-llama/Meta-Llama-3.1-8B-Instruct
  • mistralai/Mistral-7B-Instruct-v0.2
  • codellama/CodeLlama-13b-Instruct-hf

Starting vLLM Server

Basic Start

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --port 8000

Production Configuration

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 32768 \
  --port 8000 \
  --host 0.0.0.0

Docker Deployment

docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct

Profile Customisation

To customise vLLM behaviour, create config/profiles/vllm-custom.yaml. See Profile Configuration for detailed explanations of each section.

Example Customisation

name: vllm
version: "1.0"

# Add custom prefixes
routing:
  prefixes:
    - vllm
    - gpu      # Add custom prefix

# Adjust for larger models
characteristics:
  timeout: 5m     # Increase for 70B models

# Modify concurrency limits
resources:
  concurrency_limits:
    - min_memory_gb: 100
      max_concurrent: 5    # Reduce for very large models
    - min_memory_gb: 50
      max_concurrent: 15   # Adjust based on GPU memory

See Profile Configuration for complete customisation options.

Monitoring

Prometheus Metrics

vLLM exposes detailed metrics at /metrics:

# Example Prometheus configuration
scrape_configs:
  - job_name: 'vllm'
    static_configs:
      - targets: ['localhost:40114']
    metrics_path: '/olla/vllm/metrics'

Key metrics include:

  • vllm:num_requests_running - Active requests
  • vllm:num_requests_waiting - Queued requests
  • vllm:gpu_cache_usage_perc - GPU cache utilisation
  • vllm:time_to_first_token_seconds - TTFT latency

Health Monitoring

# Check health endpoint
curl http://localhost:40114/olla/vllm/health

# Response when healthy
{"status": "healthy"}

# Response when unhealthy
{"status": "unhealthy", "reason": "model not loaded"}

Troubleshooting

Out of Memory (OOM)

Issue: CUDA out of memory errors

Solution: 1. Reduce --gpu-memory-utilization (default 0.9) 2. Decrease --max-model-len 3. Use quantisation (--quantization awq or --quantization gptq) 4. Enable tensor parallelism for multi-GPU

Slow First Token

Issue: High time to first token (TTFT)

Solution:

  1. Enable prefix caching: --enable-prefix-caching
  2. Increase GPU memory utilisation
  3. Use smaller model or quantisation

Connection Timeout

Issue: Requests timeout during model loading

Solution: Increase timeout in profile:

characteristics:
  timeout: 10m  # Increase for initial model load

resources:
  timeout_scaling:
    base_timeout_seconds: 300
    load_time_buffer: true

High Queue Wait Times

Issue: Requests queuing with "num_requests_waiting" high

Solution:

  1. Add more vLLM instances
  2. Use load balancing across multiple servers
  3. Increase --max-num-seqs (default 256)

Best Practices

1. Use Appropriate GPU Memory

# Conservative setting for stability
--gpu-memory-utilization 0.9

# Aggressive setting for throughput
--gpu-memory-utilization 0.95

2. Configure Tensor Parallelism

For models requiring multiple GPUs:

# 70B model on 4x A100 80GB
--tensor-parallel-size 4

# 34B model on 2x A100 40GB
--tensor-parallel-size 2

3. Enable Prefix Caching

For chat applications with system prompts:

--enable-prefix-caching
--block-size 16

4. Monitor and Scale

# Multiple vLLM instances
discovery:
  static:
    endpoints:
      - url: "http://gpu1:8000"
        name: "vllm-1"
        type: "vllm"
        priority: 100

      - url: "http://gpu2:8000"
        name: "vllm-2"
        type: "vllm"
        priority: 100

proxy:
  load_balancer: "least-connections"

Integration with Tools

OpenAI SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:40114/olla/vllm/v1",
    api_key="not-needed"  # vLLM doesn't require API keys
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

LangChain

from langchain.llms import VLLM

llm = VLLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    vllm_server="http://localhost:40114/olla/vllm",
    temperature=0.7
)

LlamaIndex

from llama_index.llms import OpenAI

llm = OpenAI(
    api_base="http://localhost:40114/olla/vllm/v1",
    api_key="dummy",
    model="meta-llama/Meta-Llama-3.1-8B-Instruct"
)

Next Steps