vLLM API¶
Proxy endpoints for vLLM inference servers. Available through the /olla/vllm/
prefix.
Endpoints Overview¶
Method | URI | Description |
---|---|---|
GET | /olla/vllm/health | Health check |
GET | /olla/vllm/v1/models | List available models |
POST | /olla/vllm/v1/chat/completions | Chat completion |
POST | /olla/vllm/v1/completions | Text completion |
POST | /olla/vllm/v1/embeddings | Generate embeddings |
GET | /olla/vllm/metrics | Prometheus metrics |
GET /olla/vllm/health¶
Check vLLM server health status.
Request¶
Response¶
{
"status": "healthy",
"model_loaded": true,
"gpu_memory_usage": 0.65,
"num_requests_running": 2,
"num_requests_waiting": 0
}
GET /olla/vllm/v1/models¶
List models available on the vLLM server.
Request¶
Response¶
{
"object": "list",
"data": [
{
"id": "meta-llama/Meta-Llama-3-8B-Instruct",
"object": "model",
"created": 1705334400,
"owned_by": "vllm",
"root": "meta-llama/Meta-Llama-3-8B-Instruct",
"parent": null,
"max_model_len": 8192,
"permission": []
},
{
"id": "mistralai/Mistral-7B-Instruct-v0.2",
"object": "model",
"created": 1705334400,
"owned_by": "vllm",
"root": "mistralai/Mistral-7B-Instruct-v0.2",
"parent": null,
"max_model_len": 32768,
"permission": []
}
]
}
POST /olla/vllm/v1/chat/completions¶
OpenAI-compatible chat completion with vLLM optimizations.
Request¶
curl -X POST http://localhost:40114/olla/vllm/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful AI assistant."
},
{
"role": "user",
"content": "Explain the benefits of using vLLM for inference"
}
],
"temperature": 0.7,
"max_tokens": 300,
"stream": false,
"guided_decoding_backend": "outlines",
"guided_json": {
"type": "object",
"properties": {
"benefits": {
"type": "array",
"items": {"type": "string"}
},
"summary": {"type": "string"}
}
}
}'
Response¶
{
"id": "chatcmpl-vllm-abc123",
"object": "chat.completion",
"created": 1705334400,
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "{\n \"benefits\": [\n \"High throughput with continuous batching\",\n \"PagedAttention for efficient memory management\",\n \"Tensor parallelism for multi-GPU serving\",\n \"Optimized CUDA kernels for faster inference\",\n \"Support for quantization methods like AWQ and GPTQ\"\n ],\n \"summary\": \"vLLM provides state-of-the-art serving throughput with efficient memory management and GPU utilization, making it ideal for production LLM deployment.\"\n}"
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 35,
"completion_tokens": 98,
"total_tokens": 133
}
}
Streaming Response¶
When "stream": true
:
data: {"id":"chatcmpl-vllm-abc123","object":"chat.completion.chunk","created":1705334400,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"role":"assistant"},"logprobs":null,"finish_reason":null}]}
data: {"id":"chatcmpl-vllm-abc123","object":"chat.completion.chunk","created":1705334400,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{"content":"vLLM"},"logprobs":null,"finish_reason":null}]}
...
data: {"id":"chatcmpl-vllm-abc123","object":"chat.completion.chunk","created":1705334401,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}]}
data: [DONE]
POST /olla/vllm/v1/completions¶
Text completion with vLLM-specific optimizations.
Request¶
curl -X POST http://localhost:40114/olla/vllm/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"prompt": "The advantages of PagedAttention in vLLM are:",
"max_tokens": 200,
"temperature": 0.8,
"top_p": 0.95,
"top_k": 50,
"repetition_penalty": 1.1,
"best_of": 1,
"use_beam_search": false,
"stream": false
}'
Response¶
{
"id": "cmpl-vllm-xyz789",
"object": "text_completion",
"created": 1705334400,
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"choices": [
{
"text": "\n\n1. **Memory Efficiency**: PagedAttention manages attention key-value (KV) cache memory in non-contiguous blocks, eliminating memory fragmentation and allowing for higher batch sizes.\n\n2. **Dynamic Memory Allocation**: It allocates memory on-demand as sequences grow, rather than pre-allocating maximum sequence length, significantly reducing memory waste.\n\n3. **Memory Sharing**: Enables efficient memory sharing across parallel sampling requests and beam search, reducing redundant memory usage.\n\n4. **Higher Throughput**: By optimizing memory usage, PagedAttention allows vLLM to handle more concurrent requests, increasing overall serving throughput.",
"index": 0,
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 118,
"total_tokens": 130
}
}
POST /olla/vllm/v1/embeddings¶
Generate embeddings using vLLM (if model supports embeddings).
Request¶
curl -X POST http://localhost:40114/olla/vllm/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "BAAI/bge-large-en-v1.5",
"input": "vLLM is a high-throughput inference engine",
"encoding_format": "float"
}'
Response¶
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [0.0234, -0.0567, 0.0891, ...]
}
],
"model": "BAAI/bge-large-en-v1.5",
"usage": {
"prompt_tokens": 10,
"total_tokens": 10
}
}
GET /olla/vllm/metrics¶
Prometheus-compatible metrics endpoint for monitoring.
Request¶
Response (Prometheus format)¶
# HELP vllm_request_duration_seconds Request duration in seconds
# TYPE vllm_request_duration_seconds histogram
vllm_request_duration_seconds_bucket{model="meta-llama/Meta-Llama-3-8B-Instruct",le="0.1"} 45
vllm_request_duration_seconds_bucket{model="meta-llama/Meta-Llama-3-8B-Instruct",le="0.5"} 120
vllm_request_duration_seconds_bucket{model="meta-llama/Meta-Llama-3-8B-Instruct",le="1"} 180
vllm_request_duration_seconds_sum{model="meta-llama/Meta-Llama-3-8B-Instruct"} 125.5
vllm_request_duration_seconds_count{model="meta-llama/Meta-Llama-3-8B-Instruct"} 200
# HELP vllm_num_requests_running Number of requests currently running
# TYPE vllm_num_requests_running gauge
vllm_num_requests_running{model="meta-llama/Meta-Llama-3-8B-Instruct"} 3
# HELP vllm_num_requests_waiting Number of requests waiting in queue
# TYPE vllm_num_requests_waiting gauge
vllm_num_requests_waiting{model="meta-llama/Meta-Llama-3-8B-Instruct"} 0
# HELP vllm_gpu_memory_usage_bytes GPU memory usage in bytes
# TYPE vllm_gpu_memory_usage_bytes gauge
vllm_gpu_memory_usage_bytes{gpu="0"} 12884901888
vLLM-Specific Parameters¶
Sampling Parameters¶
Parameter | Type | Default | Description |
---|---|---|---|
best_of | integer | 1 | Number of sequences to generate and return best |
use_beam_search | boolean | false | Use beam search instead of sampling |
top_k | integer | -1 | Top-k sampling (-1 = disabled) |
min_p | float | 0.0 | Min-p sampling threshold |
repetition_penalty | float | 1.0 | Repetition penalty |
length_penalty | float | 1.0 | Length penalty for beam search |
early_stopping | boolean | false | Stop beam search early |
ignore_eos | boolean | false | Continue generation after EOS |
min_tokens | integer | 0 | Minimum tokens to generate |
skip_special_tokens | boolean | true | Skip special tokens in output |
Guided Generation¶
Parameter | Type | Description |
---|---|---|
guided_json | object | JSON schema for structured output |
guided_regex | string | Regular expression for guided generation |
guided_choice | array | List of allowed choices |
guided_grammar | string | Context-free grammar |
guided_decoding_backend | string | Backend for guided generation (outlines/lm-format-enforcer) |
Advanced Parameters¶
Parameter | Type | Description |
---|---|---|
logprobs | integer | Number of log probabilities to return |
prompt_logprobs | integer | Log probabilities for prompt tokens |
detokenize | boolean | Return detokenized text |
echo | boolean | Include prompt in response |
add_generation_prompt | boolean | Add generation prompt for chat models |
add_special_tokens | boolean | Add special tokens to prompt |
include_stop_str_in_output | boolean | Include stop string in output |
Performance Features¶
vLLM provides several performance optimizations:
- Continuous Batching: Dynamic batching of requests for higher throughput
- PagedAttention: Efficient KV cache memory management
- Tensor Parallelism: Multi-GPU serving support
- Quantization: Support for AWQ, GPTQ, and SqueezeLLM
- Speculative Decoding: Faster inference with draft models
- Prefix Caching: Automatic caching of common prefixes
Configuration Example¶
endpoints:
- url: "http://192.168.0.100:8000"
name: "vllm-server"
type: "vllm"
priority: 90
model_url: "/v1/models"
health_check_url: "/health"
check_interval: 5s
check_timeout: 2s
headers:
X-API-Key: "${VLLM_API_KEY}"
Request Headers¶
All requests are forwarded with:
X-Olla-Request-ID
- Unique request identifierX-Forwarded-For
- Client IP address- Custom headers from endpoint configuration
Response Headers¶
All responses include:
X-Olla-Endpoint
- Backend endpoint name (e.g., "vllm-server")X-Olla-Model
- Model used for the requestX-Olla-Backend-Type
- Always "vllm" for these endpointsX-Olla-Response-Time
- Total processing time