SGLang Integration¶

Home	github.com/sgl-project/sglang
Since	Olla `v0.0.19`
Type	`sglang` (use in endpoint configuration)
Profile	`sglang.yaml` (see latest)
Features	Proxy Forwarding Health Check (native) Model Unification Model Detection & Normalisation OpenAI API Compatibility Prometheus Metrics Frontend Language Programming Vision & Multimodal Support
Unsupported	Model Management (loading/unloading) Instance Management Model Download
Attributes	OpenAI Compatible Higher Concurrency (RadixAttention) GPU Optimised Speculative Decoding Advanced Prefix Caching Frontend Language Support
Prefixes	`/sglang` (see Routing Prefixes)
Endpoints	See below

Configuration¶

Basic Setup¶

Add SGLang to your Olla configuration:

discovery:
  static:
    endpoints:
      - url: "http://localhost:30000"
        name: "local-sglang"
        type: "sglang"
        priority: 85
        model_url: "/v1/models"
        health_check_url: "/health"
        check_interval: 5s
        check_timeout: 2s

Production Setup¶

Configure SGLang for high-throughput production:

discovery:
  static:
    endpoints:
      - url: "http://gpu-server:30000"
        name: "sglang-prod"
        type: "sglang"
        priority: 100
        health_check_url: "/health"
        check_interval: 10s
        check_timeout: 5s

      - url: "http://gpu-server:30001"
        name: "sglang-prod-2"
        type: "sglang"
        priority: 100
        health_check_url: "/health"
        check_interval: 10s
        check_timeout: 5s

proxy:
  engine: "olla"  # Use high-performance engine
  load_balancer: "least-connections"

Endpoints Supported¶

The following endpoints are supported by the SGLang integration profile:

Path	Description
`/health`	Health Check (SGLang-specific)
`/metrics`	Prometheus Metrics
`/version`	SGLang Version Information
`/v1/models`	List Models (OpenAI format)
`/v1/chat/completions`	Chat Completions (OpenAI format)
`/v1/completions`	Text Completions (OpenAI format)
`/v1/embeddings`	Embeddings/Pooling API
`/generate`	SGLang Native Generation (Frontend Language)
`/batch`	Batch Processing (Frontend Language)
`/extend`	Conversation Extension (Frontend Language)
`/v1/chat/completions/vision`	Vision Chat Completions (Multimodal)

Usage Examples¶

Chat Completion¶

curl -X POST http://localhost:40114/olla/sglang/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain SGLang RadixAttention in simple terms"}
    ],
    "temperature": 0.7,
    "max_tokens": 500
  }'

Streaming Response¶

curl -X POST http://localhost:40114/olla/sglang/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "messages": [
      {"role": "user", "content": "Write a story about efficient AI inference"}
    ],
    "stream": true,
    "temperature": 0.8
  }'

Frontend Language Generation¶

# SGLang native generation endpoint
curl -X POST http://localhost:40114/olla/sglang/generate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "def fibonacci(n):",
    "sampling_params": {
      "temperature": 0.3,
      "max_new_tokens": 200
    }
  }'

Batch Processing¶

curl -X POST http://localhost:40114/olla/sglang/batch \
  -H "Content-Type: application/json" \
  -d '{
    "requests": [
      {
        "text": "Translate to French: Hello world",
        "sampling_params": {"temperature": 0.1, "max_new_tokens": 50}
      },
      {
        "text": "Translate to Spanish: Hello world",
        "sampling_params": {"temperature": 0.1, "max_new_tokens": 50}
      }
    ]
  }'

Vision Chat Completions¶

curl -X POST http://localhost:40114/olla/sglang/v1/chat/completions/vision \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llava-hf/llava-1.5-7b-hf",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "What do you see in this image?"},
          {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
        ]
      }
    ],
    "temperature": 0.7,
    "max_tokens": 300
  }'

Conversation Extension¶

curl -X POST http://localhost:40114/olla/sglang/extend \
  -H "Content-Type: application/json" \
  -d '{
    "rid": "conversation-123",
    "text": "Continue the previous discussion about machine learning",
    "sampling_params": {
      "temperature": 0.7,
      "max_new_tokens": 150
    }
  }'

Metrics Access¶

# Get Prometheus metrics
curl http://localhost:40114/olla/sglang/metrics

# Check health status
curl http://localhost:40114/olla/sglang/health

# Get version information
curl http://localhost:40114/olla/sglang/version

SGLang Specifics¶

High-Performance Features¶

SGLang includes several optimisations beyond standard inference:

RadixAttention: Tree-based prefix caching more advanced than PagedAttention
Speculative Decoding: Enhanced performance through speculative execution
Frontend Language: Flexible programming interface for LLM applications
Disaggregation: Separate prefill and decode phases for efficiency
Enhanced Multimodal: Advanced vision and multimodal capabilities

RadixAttention vs PagedAttention¶

SGLang's RadixAttention provides superior memory efficiency compared to vLLM's PagedAttention:

Feature	PagedAttention (vLLM)	RadixAttention (SGLang)
Memory Structure	Block-based	Tree-based
Prefix Sharing	Limited	Advanced
Cache Hit Rate	~60-70%	~85-95%
Memory Efficiency	Good	Excellent
Complex Conversations	Standard	Optimised

Resource Configuration¶

The SGLang profile includes GPU-optimised settings with enhanced efficiency:

characteristics:
  timeout: 2m
  max_concurrent_requests: 150  # Higher than vLLM (100)
  streaming_support: true

resources:
  defaults:
    requires_gpu: true
    min_gpu_memory_gb: 6

Memory Requirements¶

SGLang requires less memory than vLLM due to RadixAttention efficiency:

Model Size	GPU Memory Required	Recommended	Max Concurrent
70B	120GB	140GB	15
34B	60GB	70GB	30
13B	25GB	35GB	75
7B	14GB	20GB	150
3B	6GB	10GB	150

Model Naming¶

SGLang uses full HuggingFace model names like vLLM:

meta-llama/Meta-Llama-3.1-8B-Instruct
mistralai/Mistral-7B-Instruct-v0.2
llava-hf/llava-1.5-7b-hf

Starting SGLang Server¶

Basic Start¶

python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
  --port 30000

Production Configuration¶

python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp-size 4 \
  --mem-fraction-static 0.85 \
  --max-running-requests 150 \
  --port 30000 \
  --host 0.0.0.0 \
  --enable-flashinfer

Docker Deployment¶

docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 30000:30000 \
  lmsysorg/sglang:latest \
  python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 30000

With Speculative Decoding¶

python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --draft-model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
  --speculative-draft-length 4 \
  --port 30000

Vision Model Setup¶

python -m sglang.launch_server \
  --model-path llava-hf/llava-1.5-13b-hf \
  --tokenizer-path llava-hf/llava-1.5-13b-hf \
  --chat-template llava \
  --port 30000

Profile Customisation¶

To customise SGLang behaviour, create config/profiles/sglang-custom.yaml. See Profile Configuration for detailed explanations of each section.

Example Customisation¶

name: sglang
version: "1.0"

# Add custom prefixes
routing:
  prefixes:
    - sglang
    - radix      # Add custom prefix for RadixAttention

# Adjust for larger models
characteristics:
  timeout: 5m     # Increase for 70B models
  max_concurrent_requests: 200  # Leverage SGLang's efficiency

# Modify concurrency limits
resources:
  concurrency_limits:
    - min_memory_gb: 100
      max_concurrent: 20    # Higher than vLLM due to efficiency
    - min_memory_gb: 50
      max_concurrent: 40   # Take advantage of RadixAttention

# Enable SGLang-specific features
features:
  radix_attention:
    enabled: true
  speculative_decoding:
    enabled: true
  frontend_language:
    enabled: true

See Profile Configuration for complete customisation options.

Monitoring¶

Prometheus Metrics¶

SGLang exposes detailed metrics at /metrics:

# Example Prometheus configuration
scrape_configs:
  - job_name: 'sglang'
    static_configs:
      - targets: ['localhost:40114']
    metrics_path: '/olla/sglang/metrics'

Key SGLang-specific metrics include:

sglang:num_requests_running - Active requests
sglang:num_requests_waiting - Queued requests
sglang:radix_cache_usage_perc - RadixAttention cache utilisation
sglang:radix_cache_hit_rate - Cache hit efficiency
sglang:time_to_first_token_seconds - TTFT latency
sglang:spec_decode_num_accepted_tokens_total - Speculative decoding stats

Health Monitoring¶

# Check health endpoint
curl http://localhost:40114/olla/sglang/health

# Response when healthy
{"status": "healthy", "radix_cache_ready": true}

# Response when unhealthy
{"status": "unhealthy", "reason": "model not loaded"}

Troubleshooting¶

Out of Memory (OOM)¶

Issue: CUDA out of memory errors

Solution: 1. Reduce --mem-fraction-static (default 0.9) 2. Decrease --max-running-requests 3. Use quantisation with --quantization fp8 or --quantization int4 4. Enable tensor parallelism for multi-GPU

Low Cache Hit Rate¶

Issue: RadixAttention cache hit rate below 80%

Solution:

Enable longer context retention
Increase RadixAttention cache size
Use consistent prompt formats
Monitor prefix sharing patterns

Frontend Language Errors¶

Issue: /generate or /batch endpoints failing

Solution:

# Ensure SGLang server started with Frontend Language support
python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
  --enable-frontend-language \
  --port 30000

Connection Timeout¶

Issue: Requests timeout during model loading

Solution: Increase timeout in profile:

characteristics:
  timeout: 10m  # Increase for initial model load

resources:
  timeout_scaling:
    base_timeout_seconds: 300
    load_time_buffer: true

High Queue Wait Times¶

Issue: Requests queuing with "num_requests_waiting" high

Solution:

Add more SGLang instances
Use load balancing across multiple servers
Increase --max-running-requests (default 1024)
Enable disaggregation for better resource utilisation

Best Practices¶

1. Optimise RadixAttention¶

# Enable advanced prefix caching
python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
  --radix-cache-size 0.4 \  # 40% of GPU memory for cache
  --enable-flashinfer

2. Configure Tensor Parallelism¶

For models requiring multiple GPUs:

# 70B model on 4x A100 80GB
--tp-size 4

# 34B model on 2x A100 40GB
--tp-size 2

3. Enable Speculative Decoding¶

For maximum performance with compatible models:

python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --draft-model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
  --speculative-draft-length 4

4. Monitor and Scale¶

# Multiple SGLang instances
discovery:
  static:
    endpoints:
      - url: "http://gpu1:30000"
        name: "sglang-1"
        type: "sglang"
        priority: 100

      - url: "http://gpu2:30000"
        name: "sglang-2"
        type: "sglang"
        priority: 100

proxy:
  load_balancer: "least-connections"

5. Leverage Frontend Language¶

Use SGLang's native endpoints for advanced use cases:

# Complex conversation flows
import requests

# Start conversation
response = requests.post("http://localhost:40114/olla/sglang/generate", json={
    "text": "System: You are a helpful assistant.\nUser: Hello",
    "sampling_params": {"temperature": 0.7, "max_new_tokens": 100}
})

# Extend conversation
conversation_id = response.json()["meta"]["rid"]
requests.post("http://localhost:40114/olla/sglang/extend", json={
    "rid": conversation_id,
    "text": "\nUser: Tell me about AI",
    "sampling_params": {"temperature": 0.7, "max_new_tokens": 200}
})

Integration with Tools¶

OpenAI SDK¶

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:40114/olla/sglang/v1",
    api_key="not-needed"  # SGLang doesn't require API keys
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

SGLang Python Client¶

import sglang as sgl

# Connect to SGLang server via Olla
sgl.set_default_backend(sgl.RuntimeEndpoint(
    "http://localhost:40114/olla/sglang"
))

@sgl.function
def multi_turn_question(s, question1, question2):
    s += sgl.system("You are a helpful assistant.")
    s += sgl.user(question1)
    s += sgl.assistant(sgl.gen("answer_1", max_tokens=256))
    s += sgl.user(question2)
    s += sgl.assistant(sgl.gen("answer_2", max_tokens=256))

state = multi_turn_question.run(
    question1="What is SGLang?",
    question2="How does RadixAttention work?"
)

LangChain¶

from langchain.llms import OpenAI

llm = OpenAI(
    openai_api_base="http://localhost:40114/olla/sglang/v1",
    openai_api_key="dummy",
    model_name="meta-llama/Meta-Llama-3.1-8B-Instruct",
    temperature=0.7
)

LlamaIndex¶

from llama_index.llms import OpenAI

llm = OpenAI(
    api_base="http://localhost:40114/olla/sglang/v1",
    api_key="dummy",
    model="meta-llama/Meta-Llama-3.1-8B-Instruct"
)

Advanced Features¶

RadixAttention Configuration¶

# Custom RadixAttention settings
features:
  radix_attention:
    enabled: true
    cache_size_ratio: 0.4  # 40% GPU memory for cache
    max_tree_depth: 64     # Maximum prefix tree depth
    eviction_policy: "lru" # Least recently used eviction

Speculative Decoding Setup¶

features:
  speculative_decoding:
    enabled: true
    draft_model_ratio: 0.5  # Draft model size ratio
    acceptance_threshold: 0.8 # Token acceptance threshold

Multimodal Configuration¶

features:
  multimodal:
    enabled: true
    max_image_resolution: 1024  # Maximum image size
    supported_formats:
      - jpeg
      - png
      - webp

Performance Comparison¶

SGLang vs vLLM Benchmarks¶

Metric	SGLang	vLLM	Improvement
Throughput (req/s)	250	180	+39%
Memory Usage	-15%	baseline	15% less
Cache Hit Rate	90%	65%	+38%
TTFT (ms)	45	65	-31%
Max Concurrent	150	100	+50%

Results with Llama-3.1-8B on A100 80GB

Next Steps¶

Profile Configuration - Customise SGLang behaviour
Model Unification - Understand model management
Load Balancing - Scale with multiple SGLang instances
Monitoring - Set up Prometheus monitoring
Frontend Language Guide - Learn SGLang programming