vLLM Integration¶

Home	github.com/vllm-project/vllm
Since	Olla `v0.0.16`
Type	`vllm` (use in endpoint configuration)
Profile	`vllm.yaml` (see latest)
Features	Proxy Forwarding Health Check (native) Model Unification Model Detection & Normalisation OpenAI API Compatibility Prometheus Metrics Tokenisation API Reranking API
Unsupported	Model Management (loading/unloading) Instance Management Model Download
Attributes	OpenAI Compatible High Concurrency (PagedAttention) GPU Optimised Continuous Batching
Prefixes	`/vllm` (see Routing Prefixes)
Endpoints	See below

Configuration¶

Basic Setup¶

Add vLLM to your Olla configuration:

discovery:
  static:
    endpoints:
      - url: "http://localhost:8000"
        name: "local-vllm"
        type: "vllm"
        priority: 80
        model_url: "/v1/models"
        health_check_url: "/health"
        check_interval: 5s
        check_timeout: 2s

Production Setup¶

Configure vLLM for high-throughput production:

discovery:
  static:
    endpoints:
      - url: "http://gpu-server:8000"
        name: "vllm-prod"
        type: "vllm"
        priority: 100
        health_check_url: "/health"
        check_interval: 10s
        check_timeout: 5s

      - url: "http://gpu-server:8001"
        name: "vllm-prod-2"
        type: "vllm"
        priority: 100
        health_check_url: "/health"
        check_interval: 10s
        check_timeout: 5s

proxy:
  engine: "olla"  # Use high-performance engine
  load_balancer: "least-connections"

Endpoints Supported¶

The following endpoints are supported by the vLLM integration profile:

Path	Description
`/health`	Health Check (vLLM-specific)
`/metrics`	Prometheus Metrics
`/version`	vLLM Version Information
`/v1/models`	List Models (OpenAI format)
`/v1/chat/completions`	Chat Completions (OpenAI format)
`/v1/completions`	Text Completions (OpenAI format)
`/v1/embeddings`	Embeddings/Pooling API
`/tokenize`	Encode Text to Tokens
`/detokenize`	Decode Tokens to Text
`/v1/tokenize`	Versioned Tokenise Endpoint
`/v1/detokenize`	Versioned Detokenise Endpoint
`/rerank`	Reranking API
`/v1/rerank`	Versioned Reranking API
`/v2/rerank`	v2 Reranking API
`/get_tokenizer_info`	Tokeniser Configuration Info

Usage Examples¶

Chat Completion¶

curl -X POST http://localhost:40114/olla/vllm/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing in simple terms"}
    ],
    "temperature": 0.7,
    "max_tokens": 500
  }'

Streaming Response¶

curl -X POST http://localhost:40114/olla/vllm/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "messages": [
      {"role": "user", "content": "Write a story about a robot"}
    ],
    "stream": true,
    "temperature": 0.8
  }'

Tokenisation¶

# Encode text to tokens
curl -X POST http://localhost:40114/olla/vllm/tokenize \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, world!",
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct"
  }'

# Decode tokens to text
curl -X POST http://localhost:40114/olla/vllm/detokenize \
  -H "Content-Type: application/json" \
  -d '{
    "tokens": [15496, 11, 1917, 0],
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct"
  }'

Reranking¶

curl -X POST http://localhost:40114/olla/vllm/v1/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "model": "BAAI/bge-reranker-v2-m3",
    "query": "What is machine learning?",
    "documents": [
      "Machine learning is a subset of artificial intelligence",
      "The weather today is sunny",
      "ML algorithms learn from data"
    ]
  }'

Metrics Access¶

# Get Prometheus metrics
curl http://localhost:40114/olla/vllm/metrics

# Check health status
curl http://localhost:40114/olla/vllm/health

# Get version information
curl http://localhost:40114/olla/vllm/version

vLLM Specifics¶

High-Performance Features¶

vLLM includes several optimisations:

PagedAttention: Memory-efficient attention mechanism
Continuous Batching: Dynamic request batching
Tensor Parallelism: Multi-GPU support
Quantisation Support: INT4/INT8 for reduced memory

Resource Configuration¶

The vLLM profile includes GPU-optimised settings:

characteristics:
  timeout: 2m
  max_concurrent_requests: 100  # High concurrency support
  streaming_support: true

resources:
  defaults:
    requires_gpu: true
    min_gpu_memory_gb: 8

Memory Requirements¶

vLLM requires more memory for KV cache:

Model Size	GPU Memory Required	Recommended	Max Concurrent
70B	140GB	160GB	10
34B	70GB	80GB	20
13B	30GB	40GB	50
7B	16GB	24GB	100
3B	8GB	12GB	100

Model Naming¶

vLLM uses full HuggingFace model names:

meta-llama/Meta-Llama-3.1-8B-Instruct
mistralai/Mistral-7B-Instruct-v0.2
codellama/CodeLlama-13b-Instruct-hf

Starting vLLM Server¶

Basic Start¶

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --port 8000

Production Configuration¶

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 32768 \
  --port 8000 \
  --host 0.0.0.0

Docker Deployment¶

docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct

Profile Customisation¶

To customise vLLM behaviour, create config/profiles/vllm-custom.yaml. See Profile Configuration for detailed explanations of each section.

Example Customisation¶

name: vllm
version: "1.0"

# Add custom prefixes
routing:
  prefixes:
    - vllm
    - gpu      # Add custom prefix

# Adjust for larger models
characteristics:
  timeout: 5m     # Increase for 70B models

# Modify concurrency limits
resources:
  concurrency_limits:
    - min_memory_gb: 100
      max_concurrent: 5    # Reduce for very large models
    - min_memory_gb: 50
      max_concurrent: 15   # Adjust based on GPU memory

See Profile Configuration for complete customisation options.

Monitoring¶

Prometheus Metrics¶

vLLM exposes detailed metrics at /metrics:

# Example Prometheus configuration
scrape_configs:
  - job_name: 'vllm'
    static_configs:
      - targets: ['localhost:40114']
    metrics_path: '/olla/vllm/metrics'

Key metrics include:

vllm:num_requests_running - Active requests
vllm:num_requests_waiting - Queued requests
vllm:gpu_cache_usage_perc - GPU cache utilisation
vllm:time_to_first_token_seconds - TTFT latency

Health Monitoring¶

# Check health endpoint
curl http://localhost:40114/olla/vllm/health

# Response when healthy
{"status": "healthy"}

# Response when unhealthy
{"status": "unhealthy", "reason": "model not loaded"}

Troubleshooting¶

Out of Memory (OOM)¶

Issue: CUDA out of memory errors

Solution: 1. Reduce --gpu-memory-utilization (default 0.9) 2. Decrease --max-model-len 3. Use quantisation (--quantization awq or --quantization gptq) 4. Enable tensor parallelism for multi-GPU

Slow First Token¶

Issue: High time to first token (TTFT)

Solution:

Enable prefix caching: --enable-prefix-caching
Increase GPU memory utilisation
Use smaller model or quantisation

Connection Timeout¶

Issue: Requests timeout during model loading

Solution: Increase timeout in profile:

characteristics:
  timeout: 10m  # Increase for initial model load

resources:
  timeout_scaling:
    base_timeout_seconds: 300
    load_time_buffer: true

High Queue Wait Times¶

Issue: Requests queuing with "num_requests_waiting" high

Solution:

Add more vLLM instances
Use load balancing across multiple servers
Increase --max-num-seqs (default 256)

Best Practices¶

1. Use Appropriate GPU Memory¶

# Conservative setting for stability
--gpu-memory-utilization 0.9

# Aggressive setting for throughput
--gpu-memory-utilization 0.95

2. Configure Tensor Parallelism¶

For models requiring multiple GPUs:

# 70B model on 4x A100 80GB
--tensor-parallel-size 4

# 34B model on 2x A100 40GB
--tensor-parallel-size 2

3. Enable Prefix Caching¶

For chat applications with system prompts:

--enable-prefix-caching
--block-size 16

4. Monitor and Scale¶

# Multiple vLLM instances
discovery:
  static:
    endpoints:
      - url: "http://gpu1:8000"
        name: "vllm-1"
        type: "vllm"
        priority: 100

      - url: "http://gpu2:8000"
        name: "vllm-2"
        type: "vllm"
        priority: 100

proxy:
  load_balancer: "least-connections"

Integration with Tools¶

OpenAI SDK¶

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:40114/olla/vllm/v1",
    api_key="not-needed"  # vLLM doesn't require API keys
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

LangChain¶

from langchain.llms import VLLM

llm = VLLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    vllm_server="http://localhost:40114/olla/vllm",
    temperature=0.7
)

LlamaIndex¶

from llama_index.llms import OpenAI

llm = OpenAI(
    api_base="http://localhost:40114/olla/vllm/v1",
    api_key="dummy",
    model="meta-llama/Meta-Llama-3.1-8B-Instruct"
)

Next Steps¶

Profile Configuration - Customise vLLM behaviour
Model Unification - Understand model management
Load Balancing - Scale with multiple vLLM instances
Monitoring - Set up Prometheus monitoring