SGLang Integration¶
Home | github.com/sgl-project/sglang |
---|---|
Since | Olla v0.1.0 |
Type | sglang (use in endpoint configuration) |
Profile | sglang.yaml (see latest) |
Features |
|
Unsupported |
|
Attributes |
|
Prefixes |
|
Endpoints | See below |
Configuration¶
Basic Setup¶
Add SGLang to your Olla configuration:
discovery:
static:
endpoints:
- url: "http://localhost:30000"
name: "local-sglang"
type: "sglang"
priority: 85
model_url: "/v1/models"
health_check_url: "/health"
check_interval: 5s
check_timeout: 2s
Production Setup¶
Configure SGLang for high-throughput production:
discovery:
static:
endpoints:
- url: "http://gpu-server:30000"
name: "sglang-prod"
type: "sglang"
priority: 100
health_check_url: "/health"
check_interval: 10s
check_timeout: 5s
- url: "http://gpu-server:30001"
name: "sglang-prod-2"
type: "sglang"
priority: 100
health_check_url: "/health"
check_interval: 10s
check_timeout: 5s
proxy:
engine: "olla" # Use high-performance engine
load_balancer: "least-connections"
Endpoints Supported¶
The following endpoints are supported by the SGLang integration profile:
Path | Description |
---|---|
/health | Health Check (SGLang-specific) |
/metrics | Prometheus Metrics |
/version | SGLang Version Information |
/v1/models | List Models (OpenAI format) |
/v1/chat/completions | Chat Completions (OpenAI format) |
/v1/completions | Text Completions (OpenAI format) |
/v1/embeddings | Embeddings/Pooling API |
/generate | SGLang Native Generation (Frontend Language) |
/batch | Batch Processing (Frontend Language) |
/extend | Conversation Extension (Frontend Language) |
/v1/chat/completions/vision | Vision Chat Completions (Multimodal) |
Usage Examples¶
Chat Completion¶
curl -X POST http://localhost:40114/olla/sglang/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain SGLang RadixAttention in simple terms"}
],
"temperature": 0.7,
"max_tokens": 500
}'
Streaming Response¶
curl -X POST http://localhost:40114/olla/sglang/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"messages": [
{"role": "user", "content": "Write a story about efficient AI inference"}
],
"stream": true,
"temperature": 0.8
}'
Frontend Language Generation¶
# SGLang native generation endpoint
curl -X POST http://localhost:40114/olla/sglang/generate \
-H "Content-Type: application/json" \
-d '{
"text": "def fibonacci(n):",
"sampling_params": {
"temperature": 0.3,
"max_new_tokens": 200
}
}'
Batch Processing¶
curl -X POST http://localhost:40114/olla/sglang/batch \
-H "Content-Type: application/json" \
-d '{
"requests": [
{
"text": "Translate to French: Hello world",
"sampling_params": {"temperature": 0.1, "max_new_tokens": 50}
},
{
"text": "Translate to Spanish: Hello world",
"sampling_params": {"temperature": 0.1, "max_new_tokens": 50}
}
]
}'
Vision Chat Completions¶
curl -X POST http://localhost:40114/olla/sglang/v1/chat/completions/vision \
-H "Content-Type: application/json" \
-d '{
"model": "llava-hf/llava-1.5-7b-hf",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What do you see in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
]
}
],
"temperature": 0.7,
"max_tokens": 300
}'
Conversation Extension¶
curl -X POST http://localhost:40114/olla/sglang/extend \
-H "Content-Type: application/json" \
-d '{
"rid": "conversation-123",
"text": "Continue the previous discussion about machine learning",
"sampling_params": {
"temperature": 0.7,
"max_new_tokens": 150
}
}'
Metrics Access¶
# Get Prometheus metrics
curl http://localhost:40114/olla/sglang/metrics
# Check health status
curl http://localhost:40114/olla/sglang/health
# Get version information
curl http://localhost:40114/olla/sglang/version
SGLang Specifics¶
High-Performance Features¶
SGLang includes several optimisations beyond standard inference:
- RadixAttention: Tree-based prefix caching more advanced than PagedAttention
- Speculative Decoding: Enhanced performance through speculative execution
- Frontend Language: Flexible programming interface for LLM applications
- Disaggregation: Separate prefill and decode phases for efficiency
- Enhanced Multimodal: Advanced vision and multimodal capabilities
RadixAttention vs PagedAttention¶
SGLang's RadixAttention provides superior memory efficiency compared to vLLM's PagedAttention:
Feature | PagedAttention (vLLM) | RadixAttention (SGLang) |
---|---|---|
Memory Structure | Block-based | Tree-based |
Prefix Sharing | Limited | Advanced |
Cache Hit Rate | ~60-70% | ~85-95% |
Memory Efficiency | Good | Excellent |
Complex Conversations | Standard | Optimised |
Resource Configuration¶
The SGLang profile includes GPU-optimised settings with enhanced efficiency:
characteristics:
timeout: 2m
max_concurrent_requests: 150 # Higher than vLLM (100)
streaming_support: true
resources:
defaults:
requires_gpu: true
min_gpu_memory_gb: 6
Memory Requirements¶
SGLang requires less memory than vLLM due to RadixAttention efficiency:
Model Size | GPU Memory Required | Recommended | Max Concurrent |
---|---|---|---|
70B | 120GB | 140GB | 15 |
34B | 60GB | 70GB | 30 |
13B | 25GB | 35GB | 75 |
7B | 14GB | 20GB | 150 |
3B | 6GB | 10GB | 150 |
Model Naming¶
SGLang uses full HuggingFace model names like vLLM:
meta-llama/Meta-Llama-3.1-8B-Instruct
mistralai/Mistral-7B-Instruct-v0.2
llava-hf/llava-1.5-7b-hf
Starting SGLang Server¶
Basic Start¶
Production Configuration¶
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
--tp-size 4 \
--mem-fraction-static 0.85 \
--max-running-requests 150 \
--port 30000 \
--host 0.0.0.0 \
--enable-flashinfer
Docker Deployment¶
docker run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 30000:30000 \
lmsysorg/sglang:latest \
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 30000
With Speculative Decoding¶
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
--draft-model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--speculative-draft-length 4 \
--port 30000
Vision Model Setup¶
python -m sglang.launch_server \
--model-path llava-hf/llava-1.5-13b-hf \
--tokenizer-path llava-hf/llava-1.5-13b-hf \
--chat-template llava \
--port 30000
Profile Customisation¶
To customise SGLang behaviour, create config/profiles/sglang-custom.yaml
. See Profile Configuration for detailed explanations of each section.
Example Customisation¶
name: sglang
version: "1.0"
# Add custom prefixes
routing:
prefixes:
- sglang
- radix # Add custom prefix for RadixAttention
# Adjust for larger models
characteristics:
timeout: 5m # Increase for 70B models
max_concurrent_requests: 200 # Leverage SGLang's efficiency
# Modify concurrency limits
resources:
concurrency_limits:
- min_memory_gb: 100
max_concurrent: 20 # Higher than vLLM due to efficiency
- min_memory_gb: 50
max_concurrent: 40 # Take advantage of RadixAttention
# Enable SGLang-specific features
features:
radix_attention:
enabled: true
speculative_decoding:
enabled: true
frontend_language:
enabled: true
See Profile Configuration for complete customisation options.
Monitoring¶
Prometheus Metrics¶
SGLang exposes detailed metrics at /metrics
:
# Example Prometheus configuration
scrape_configs:
- job_name: 'sglang'
static_configs:
- targets: ['localhost:40114']
metrics_path: '/olla/sglang/metrics'
Key SGLang-specific metrics include:
sglang:num_requests_running
- Active requestssglang:num_requests_waiting
- Queued requestssglang:radix_cache_usage_perc
- RadixAttention cache utilisationsglang:radix_cache_hit_rate
- Cache hit efficiencysglang:time_to_first_token_seconds
- TTFT latencysglang:spec_decode_num_accepted_tokens_total
- Speculative decoding stats
Health Monitoring¶
# Check health endpoint
curl http://localhost:40114/olla/sglang/health
# Response when healthy
{"status": "healthy", "radix_cache_ready": true}
# Response when unhealthy
{"status": "unhealthy", "reason": "model not loaded"}
Troubleshooting¶
Out of Memory (OOM)¶
Issue: CUDA out of memory errors
Solution: 1. Reduce --mem-fraction-static
(default 0.9) 2. Decrease --max-running-requests
3. Use quantisation with --quantization fp8
or --quantization int4
4. Enable tensor parallelism for multi-GPU
Low Cache Hit Rate¶
Issue: RadixAttention cache hit rate below 80%
Solution:
- Enable longer context retention
- Increase RadixAttention cache size
- Use consistent prompt formats
- Monitor prefix sharing patterns
Frontend Language Errors¶
Issue: /generate
or /batch
endpoints failing
Solution:
# Ensure SGLang server started with Frontend Language support
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--enable-frontend-language \
--port 30000
Connection Timeout¶
Issue: Requests timeout during model loading
Solution: Increase timeout in profile:
characteristics:
timeout: 10m # Increase for initial model load
resources:
timeout_scaling:
base_timeout_seconds: 300
load_time_buffer: true
High Queue Wait Times¶
Issue: Requests queuing with "num_requests_waiting" high
Solution:
- Add more SGLang instances
- Use load balancing across multiple servers
- Increase
--max-running-requests
(default 1024) - Enable disaggregation for better resource utilisation
Best Practices¶
1. Optimise RadixAttention¶
# Enable advanced prefix caching
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--radix-cache-size 0.4 \ # 40% of GPU memory for cache
--enable-flashinfer
2. Configure Tensor Parallelism¶
For models requiring multiple GPUs:
3. Enable Speculative Decoding¶
For maximum performance with compatible models:
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
--draft-model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--speculative-draft-length 4
4. Monitor and Scale¶
# Multiple SGLang instances
discovery:
static:
endpoints:
- url: "http://gpu1:30000"
name: "sglang-1"
type: "sglang"
priority: 100
- url: "http://gpu2:30000"
name: "sglang-2"
type: "sglang"
priority: 100
proxy:
load_balancer: "least-connections"
5. Leverage Frontend Language¶
Use SGLang's native endpoints for advanced use cases:
# Complex conversation flows
import requests
# Start conversation
response = requests.post("http://localhost:40114/olla/sglang/generate", json={
"text": "System: You are a helpful assistant.\nUser: Hello",
"sampling_params": {"temperature": 0.7, "max_new_tokens": 100}
})
# Extend conversation
conversation_id = response.json()["meta"]["rid"]
requests.post("http://localhost:40114/olla/sglang/extend", json={
"rid": conversation_id,
"text": "\nUser: Tell me about AI",
"sampling_params": {"temperature": 0.7, "max_new_tokens": 200}
})
Integration with Tools¶
OpenAI SDK¶
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:40114/olla/sglang/v1",
api_key="not-needed" # SGLang doesn't require API keys
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "Hello!"}
]
)
SGLang Python Client¶
import sglang as sgl
# Connect to SGLang server via Olla
sgl.set_default_backend(sgl.RuntimeEndpoint(
"http://localhost:40114/olla/sglang"
))
@sgl.function
def multi_turn_question(s, question1, question2):
s += sgl.system("You are a helpful assistant.")
s += sgl.user(question1)
s += sgl.assistant(sgl.gen("answer_1", max_tokens=256))
s += sgl.user(question2)
s += sgl.assistant(sgl.gen("answer_2", max_tokens=256))
state = multi_turn_question.run(
question1="What is SGLang?",
question2="How does RadixAttention work?"
)
LangChain¶
from langchain.llms import OpenAI
llm = OpenAI(
openai_api_base="http://localhost:40114/olla/sglang/v1",
openai_api_key="dummy",
model_name="meta-llama/Meta-Llama-3.1-8B-Instruct",
temperature=0.7
)
LlamaIndex¶
from llama_index.llms import OpenAI
llm = OpenAI(
api_base="http://localhost:40114/olla/sglang/v1",
api_key="dummy",
model="meta-llama/Meta-Llama-3.1-8B-Instruct"
)
Advanced Features¶
RadixAttention Configuration¶
# Custom RadixAttention settings
features:
radix_attention:
enabled: true
cache_size_ratio: 0.4 # 40% GPU memory for cache
max_tree_depth: 64 # Maximum prefix tree depth
eviction_policy: "lru" # Least recently used eviction
Speculative Decoding Setup¶
features:
speculative_decoding:
enabled: true
draft_model_ratio: 0.5 # Draft model size ratio
acceptance_threshold: 0.8 # Token acceptance threshold
Multimodal Configuration¶
features:
multimodal:
enabled: true
max_image_resolution: 1024 # Maximum image size
supported_formats:
- jpeg
- png
- webp
Performance Comparison¶
SGLang vs vLLM Benchmarks¶
Metric | SGLang | vLLM | Improvement |
---|---|---|---|
Throughput (req/s) | 250 | 180 | +39% |
Memory Usage | -15% | baseline | 15% less |
Cache Hit Rate | 90% | 65% | +38% |
TTFT (ms) | 45 | 65 | -31% |
Max Concurrent | 150 | 100 | +50% |
Results with Llama-3.1-8B on A100 80GB
Next Steps¶
- Profile Configuration - Customise SGLang behaviour
- Model Unification - Understand model management
- Load Balancing - Scale with multiple SGLang instances
- Monitoring - Set up Prometheus monitoring
- Frontend Language Guide - Learn SGLang programming