Lemonade SDK Proxy Support¶

Home	github.com/lemonade-sdk/lemonade
Since	Olla `v0.0.19`
Type	`lemonade` (use in endpoint configuration)
Profile	`lemonade.yaml` (see latest)
Features	Transparent Proxy Forwarding Model Discovery & Unification Load Balancing Across Instances Health Monitoring OpenAI Format Conversion Recipe & Checkpoint Metadata Preservation
Backend Features (Proxied)	AMD Hardware Acceleration (NPU, iGPU, CPU) Multi-Recipe Support (ONNX & GGUF) Model Lifecycle Management System Information & Statistics
Prefixes	`/lemonade` (see Routing Prefixes)
Lemonade SDK Docs	Official Documentation

Overview¶

Olla provides proxy support for Lemonade SDK backends, enabling:

Transparent Request Forwarding: All requests to /olla/lemonade/* are forwarded to configured Lemonade SDK backends
Load Balancing: Distribute requests across multiple Lemonade instances
Health Monitoring: Automatic health checks and circuit breaking for unhealthy backends
Model Unification: Lemonade models appear in Olla's unified model catalogue alongside other providers
OpenAI Compatibility: /v1/models endpoint returns OpenAI-format responses

What Olla Provides:

Routing and load balancing to Lemonade backends
Model discovery and catalogue integration
Health checks and availability monitoring
Format conversion for model listings

What Lemonade SDK Provides:

Actual LLM inference with hardware optimisation
Model lifecycle management (pull, load, unload, delete)
AMD-specific acceleration (NPU, iGPU, CPU)
System information and runtime statistics

Olla acts as a transparent proxy - it forwards requests to Lemonade SDK and adds operational features like load balancing and health monitoring.

Configuration¶

Basic Setup¶

Configure a single Lemonade SDK endpoint:

discovery:
  static:
    endpoints:
      - url: "http://localhost:8000"
        name: "local-lemonade"
        type: "lemonade"
        priority: 85
        model_url: "/api/v1/models"
        health_check_url: "/api/v1/health"
        check_interval: 5s
        check_timeout: 2s

Configuration Fields:

url: Lemonade SDK server address (default: http://localhost:8000)
type: Must be "lemonade" to use the Lemonade profile
priority: Routing priority (0-100, higher = preferred)
model_url: Endpoint for model discovery (default: /api/v1/models)
health_check_url: Health check endpoint (default: /api/v1/health)

Production Setup: Multiple Instances¶

Load balance across multiple Lemonade SDK instances:

discovery:
  static:
    endpoints:
      # Primary NPU-optimised instance
      - url: "http://npu-server:8000"
        name: "lemonade-npu"
        type: "lemonade"
        priority: 100
        tags:
          hardware: npu
          location: office

      # Secondary GPU-optimised instance
      - url: "http://gpu-server:8000"
        name: "lemonade-gpu"
        type: "lemonade"
        priority: 90
        tags:
          hardware: igpu
          location: office

      # Fallback CPU instance
      - url: "http://cpu-server:8000"
        name: "lemonade-cpu"
        type: "lemonade"
        priority: 75
        tags:
          hardware: cpu
          location: backup

Olla will:

Route requests to the highest-priority healthy backend
Automatically failover if a backend becomes unhealthy
Periodically check health and restore failed backends
Provide a unified model catalogue across all instances

How It Works¶

Request Flow¶

Client Request → Olla Proxy → Lemonade SDK Backend
                     ↓
            - Load Balancing
            - Health Checking
            - Header Injection
                     ↓
                 Response ← Lemonade SDK

Step-by-Step:

Client sends request to /olla/lemonade/api/v1/chat/completions
Olla routes based on:
Backend health status (circuit breaker)
Priority configuration
Load balancing strategy
Request forwarded to http://backend:8000/api/v1/chat/completions
Lemonade processes the inference request
Olla adds headers to response:
X-Olla-Endpoint: Backend name used
X-Olla-Model: Model identifier
X-Olla-Backend-Type: lemonade
X-Olla-Response-Time: Processing time
Response returned to client

Model Discovery¶

Olla periodically queries /api/v1/models from each Lemonade backend:

// Lemonade SDK Response
{
  "object": "list",
  "data": [
    {
      "id": "Qwen2.5-0.5B-Instruct-CPU",
      "checkpoint": "amd/Qwen2.5-0.5B-Instruct-quantized_int4-float16-cpu-onnx",
      "recipe": "oga-cpu",
      "created": 1759361710,
      "owned_by": "lemonade"
    }
  ]
}

Olla extracts:

Model ID: Used for routing requests
Checkpoint: HuggingFace model path (preserved in metadata)
Recipe: Inference engine (oga-cpu, oga-npu, llamacpp, etc.)
Format: Inferred from recipe (ONNX for oga-*, GGUF for llamacpp)

Unified Catalogue:

Models from all Lemonade backends appear in Olla's unified catalogue at /internal/status/models alongside models from Ollama, OpenAI, and other providers.

What Olla Adds vs What Lemonade Provides¶

Feature	Olla	Lemonade SDK
LLM Inference	❌	✅ Hardware-optimised
Model Loading	❌	✅ Dynamic memory management
Hardware Detection	❌	✅ NPU/iGPU/CPU detection
Load Balancing	✅	❌
Health Monitoring	✅	❌
Multi-Backend Routing	✅	❌
Unified Catalogue	✅	❌
Failover	✅	❌

Model Format Support¶

Lemonade SDK uses recipes to identify inference engines and model formats.

Recipe System¶

Recipe	Engine	Format	Hardware	Description
`oga-cpu`	ONNX Runtime	ONNX	CPU	CPU-optimised inference
`oga-npu`	ONNX Runtime	ONNX	NPU	AMD Ryzen AI NPU acceleration
`oga-igpu`	ONNX Runtime	ONNX	iGPU	DirectML iGPU acceleration
`llamacpp`	llama.cpp	GGUF	CPU/GPU	GGUF quantised models
`flm`	Fast Language Models	Various	CPU/GPU	Alternative inference engine

Checkpoint and Recipe Metadata¶

Olla preserves Lemonade-specific metadata:

{
  "id": "phi-3.5-mini-instruct-npu",
  "provider": "lemonade",
  "metadata": {
    "checkpoint": "amd/Phi-3.5-mini-instruct-onnx-npu",
    "recipe": "oga-npu",
    "format": "onnx"
  }
}

This metadata is:

Discovered during model polling
Stored in the unified registry
Returned in Lemonade-format responses

Format Detection¶

Olla infers model format from recipe:

# From lemonade.yaml profile
models:
  name_format: "{{.Name}}"  # Use friendly names
  capability_patterns:
    chat:
      - "*-Instruct-*"
      - "*-Chat-*"
    embeddings:
      - "*embed*"
    code:
      - "*code*"
      - "*Coder*"

Automatic Detection:

oga-* recipes → ONNX format
llamacpp recipe → GGUF format
Capabilities inferred from model name patterns

Usage Examples¶

Chat Completion (OpenAI-Compatible)¶

curl -X POST http://localhost:40114/olla/lemonade/api/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen2.5-0.5B-Instruct-CPU",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is AMD Ryzen AI?"}
    ],
    "temperature": 0.7,
    "max_tokens": 150
  }'

Response includes Olla headers:

X-Olla-Endpoint: lemonade-npu
X-Olla-Model: Qwen2.5-0.5B-Instruct-CPU
X-Olla-Backend-Type: lemonade
X-Olla-Response-Time: 234ms

Model Listing (OpenAI Format)¶

# OpenAI-compatible format
curl http://localhost:40114/olla/lemonade/v1/models

# Or Lemonade's native path
curl http://localhost:40114/olla/lemonade/api/v1/models

Response:

{
  "object": "list",
  "data": [
    {
      "id": "Qwen2.5-0.5B-Instruct-CPU",
      "object": "model",
      "created": 1759361710,
      "owned_by": "lemonade",
      "checkpoint": "amd/Qwen2.5-0.5B-Instruct-quantized_int4-float16-cpu-onnx",
      "recipe": "oga-cpu"
    }
  ]
}

Accessing Lemonade SDK Features¶

All Lemonade SDK endpoints are accessible through the /olla/lemonade/ prefix:

# System information (proxied to backend)
curl http://localhost:40114/olla/lemonade/api/v1/system-info

# Runtime statistics (proxied to backend)
curl http://localhost:40114/olla/lemonade/api/v1/stats

# Model management (proxied to backend)
curl -X POST http://localhost:40114/olla/lemonade/api/v1/load \
  -H "Content-Type: application/json" \
  -d '{"model": "Phi-3.5-mini-instruct-NPU"}'

These requests are forwarded transparently to the Lemonade SDK backend. Olla does not process or modify them beyond routing.

Lemonade SDK Features (Proxied)¶

The following features are provided by Lemonade SDK and accessed through Olla's proxy:

Model Lifecycle Management¶

# Download/install model (handled by Lemonade)
curl -X POST http://localhost:40114/olla/lemonade/api/v1/pull \
  -d '{"checkpoint": "amd/Phi-3.5-mini-instruct-onnx-npu"}'

# Load model into memory (handled by Lemonade)
curl -X POST http://localhost:40114/olla/lemonade/api/v1/load \
  -d '{"model": "Phi-3.5-mini-instruct-NPU"}'

# Unload from memory (handled by Lemonade)
curl -X POST http://localhost:40114/olla/lemonade/api/v1/unload \
  -d '{"model": "Phi-3.5-mini-instruct-NPU"}'

# Delete from disk (handled by Lemonade)
curl -X POST http://localhost:40114/olla/lemonade/api/v1/delete \
  -d '{"model": "Phi-3.5-mini-instruct-NPU"}'

Hardware Optimisation¶

Lemonade SDK automatically:

Detects available hardware (NPU, iGPU, CPU)
Selects appropriate recipe for each model
Optimises inference for AMD Ryzen AI platforms

Olla does not handle hardware detection - it simply routes to the backend.

System Information¶

curl http://localhost:40114/olla/lemonade/api/v1/system-info

Returns Lemonade SDK's hardware detection results, memory status, and available acceleration engines.

For detailed API documentation, see the Lemonade SDK API Reference or Lemonade SDK Documentation.

Profile Configuration¶

The Lemonade profile (config/profiles/lemonade.yaml) defines routing behaviour, model patterns, and capabilities.

Key Profile Sections¶

# Routing prefixes
routing:
  prefixes:
    - lemonade  # Accessible at /olla/lemonade/

# API compatibility
api:
  openai_compatible: true
  model_discovery_path: /api/v1/models
  health_check_path: /api/v1/health

# Model capability detection
models:
  name_format: "{{.Name}}"
  capability_patterns:
    chat:
      - "*-Instruct-*"
      - "*-Chat-*"
    embeddings:
      - "*embed*"
    code:
      - "*code*"
      - "*Coder*"

# Recipe to format mapping
features:
  backends:
    enabled: true
    supported_engines:
      - "oga-cpu"      # ONNX Runtime for CPU
      - "oga-npu"      # ONNX Runtime for NPU
      - "oga-igpu"     # ONNX Runtime for iGPU
      - "llamacpp"     # llama.cpp for GGUF
      - "flm"          # Fast Language Models

Customising the Profile¶

Create config/profiles/lemonade-custom.yaml:

name: lemonade
version: "1.0"

# Override default timeouts
characteristics:
  timeout: 5m  # Increase for large models
  max_concurrent_requests: 50

# Add custom routing prefixes
routing:
  prefixes:
    - lemonade
    - amd      # Custom prefix: /olla/amd/

# Custom capability patterns
models:
  capability_patterns:
    reasoning:
      - "*DeepSeek-R1*"
      - "*Cogito*"
    vision:
      - "*VL-*"
      - "*Scout*"

See Profile System for complete customisation options.

Monitoring¶

Health Checks¶

Olla monitors Lemonade backend health:

# Olla's health status (includes Lemonade backends)
curl http://localhost:40114/internal/health

Response:

{
  "status": "healthy",
  "endpoints": {
    "lemonade-npu": {
      "url": "http://npu-server:8000",
      "healthy": true,
      "last_check": "2025-01-15T10:30:00Z"
    }
  }
}

Endpoint Status¶

# View all endpoints
curl http://localhost:40114/internal/status/endpoints

Shows: - Health status per backend - Circuit breaker state - Request counts - Error rates

Model Availability¶

# Unified model catalogue
curl http://localhost:40114/internal/status/models

Lists models from all Lemonade backends alongside other providers.

Response Headers¶

Every proxied response includes Olla headers:

HTTP/1.1 200 OK
Content-Type: application/json
X-Olla-Endpoint: lemonade-npu
X-Olla-Model: Phi-3.5-mini-instruct-NPU
X-Olla-Backend-Type: lemonade
X-Olla-Response-Time: 156ms
X-Olla-Request-ID: req-abc123

Use these for: - Debugging: Which backend processed the request - Monitoring: Response time tracking - Auditing: Request tracing

Load Balancing Strategies¶

Configure how Olla selects Lemonade backends:

Priority-Based (Default)¶

balancer:
  strategy: priority

endpoints:
  - url: "http://npu-server:8000"
    priority: 100  # Always prefer NPU
  - url: "http://cpu-server:8000"
    priority: 75   # Fallback to CPU

Always uses highest-priority healthy backend.

Round-Robin¶

balancer:
  strategy: round_robin

endpoints:
  - url: "http://server1:8000"
    priority: 100
  - url: "http://server2:8000"
    priority: 100  # Equal priority

Distributes requests evenly across backends.

Least Connections¶

balancer:
  strategy: least_connections

endpoints:
  - url: "http://server1:8000"
  - url: "http://server2:8000"

Routes to backend with fewest active connections.

See Load Balancing for detailed configuration.

Troubleshooting¶

Backend Unreachable¶

Issue: "failed to connect to backend"

Solution: 1. Verify Lemonade SDK is running:

curl http://localhost:8000/api/v1/health

Check endpoint configuration:

endpoints:
  - url: "http://localhost:8000"  # Correct port?
    type: "lemonade"

Review Olla logs:
```
olla serve --log-level debug
```

Model Not Found¶

Issue: Model appears in /v1/models but inference fails

Solution: 1. Check if model is loaded in Lemonade:

curl http://localhost:8000/api/v1/health

Load model explicitly:

curl -X POST http://localhost:40114/olla/lemonade/api/v1/load \
  -d '{"model": "Qwen2.5-0.5B-Instruct-CPU"}'

Verify model ID matches Lemonade's format (check /api/v1/models response)

Circuit Breaker Triggered¶

Issue: "circuit breaker open for endpoint"

Solution: 1. Check backend health:

curl http://localhost:40114/internal/status/endpoints

Wait for automatic recovery (default: 30s)
Manually reset by fixing backend and waiting for next health check

Slow Responses¶

Issue: First request to a model is slow

Explanation: Lemonade SDK loads models on-demand. First request triggers loading.

Solutions: 1. Pre-load models via Lemonade API:

curl -X POST http://localhost:8000/api/v1/load \
  -d '{"model": "Phi-3.5-mini-instruct-NPU"}'

Increase timeout in Olla config:

proxy:
  response_timeout: 300s  # 5 minutes

Best Practices¶

1. Use Model Unification¶

Enable unified model catalogue:

model_registry:
  enable_unifier: true
  unification:
    enabled: true

Benefits:

Single model catalogue across all backends
Automatic failover to equivalent models
Simplified client code

2. Configure Appropriate Timeouts¶

Lemonade models load on first request:

proxy:
  response_timeout: 300s  # Allow time for model loading
  connection_timeout: 30s

discovery:
  static:
    endpoints:
      - check_timeout: 5s   # Health check timeout
        check_interval: 10s # Check frequency

3. Tag Backends by Hardware¶

Use tags for hardware-specific routing:

endpoints:
  - url: "http://npu-server:8000"
    name: "lemonade-npu"
    tags:
      hardware: npu
      recipe: oga-npu
  - url: "http://cpu-server:8000"
    name: "lemonade-cpu"
    tags:
      hardware: cpu
      recipe: oga-cpu

4. Monitor Health Actively¶

# Set up periodic health monitoring
watch -n 5 'curl -s http://localhost:40114/internal/health | jq'

# Check endpoint status
curl http://localhost:40114/internal/status/endpoints

5. Pre-load Critical Models¶

Avoid first-request latency:

#!/bin/bash
# preload-models.sh
curl -X POST http://localhost:8000/api/v1/load \
  -d '{"model": "Qwen2.5-0.5B-Instruct-CPU"}'

curl -X POST http://localhost:8000/api/v1/load \
  -d '{"model": "Phi-3.5-mini-instruct-NPU"}'

Integration with Tools¶

OpenAI SDK¶

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:40114/olla/lemonade/api",
    api_key="not-needed"  # Lemonade doesn't require keys
)

response = client.chat.completions.create(
    model="Qwen2.5-0.5B-Instruct-CPU",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

LangChain¶

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="http://localhost:40114/olla/lemonade/api/v1",
    model="Phi-3.5-mini-instruct-NPU",
    api_key="not-needed"
)

response = llm.invoke("What is AMD Ryzen AI?")

cURL Testing¶

# Test chat completion
curl -X POST http://localhost:40114/olla/lemonade/api/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen2.5-0.5B-Instruct-CPU",
    "messages": [{"role": "user", "content": "Hi"}],
    "max_tokens": 50
  }'

Next Steps¶

Lemonade SDK API Reference - Detailed endpoint documentation
Profile System - Customise Lemonade behaviour
Model Unification - Unified model catalogue
Load Balancing - Configure multi-backend setups
Lemonade SDK Documentation - Official Lemonade documentation