Lemonade SDK API (via Proxy)¶

Olla proxies requests to Lemonade SDK backends under the /olla/lemonade/ prefix. All requests are forwarded to the configured Lemonade backend with minimal modification.

What this document covers:

How Olla routes requests to Lemonade SDK
The ONE endpoint Olla handles specially (/v1/models)
Standard Olla response headers
Brief reference to Lemonade SDK endpoints

What this document does NOT cover:

Detailed Lemonade SDK API specifications (see Lemonade SDK Documentation)
Backend-only features like model lifecycle management
Hardware detection and system information

Proxy Routing¶

Olla forwards all requests under /olla/lemonade/ to the configured Lemonade SDK backend:

/olla/lemonade/** → http://backend:8000/**

Examples:

/olla/lemonade/api/v1/chat/completions → http://backend:8000/api/v1/chat/completions
/olla/lemonade/api/v1/health → http://backend:8000/api/v1/health
/olla/lemonade/api/v1/system-info → http://backend:8000/api/v1/system-info

Requests are forwarded transparently with:

Original request body preserved
Original headers forwarded
Olla headers added to response

Model Discovery (Special Handling)¶

GET /olla/lemonade/v1/models¶

Special handling: Olla intercepts this endpoint and returns models in OpenAI format from its unified catalogue.

curl http://localhost:40114/olla/lemonade/v1/models

Response (OpenAI format):

id=__span-2-1>{ "object": "list", "data": [ { "id": "Qwen2.5-0.5B-Instruct-CPU", "object": "model", "created": 1759361710, "owned_by": "lemonade", "checkpoint": "amd/Qwen2.5-0.5B-Instruct-quantized_int4-float16-cpu-onnx", "recipe": "oga-cpu" }, { "id": "Phi-3.5-mini-instruct-NPU", "object": "model", "created": 1759361710, "owned_by": "lemonade", "checkpoint": "amd/Phi-3.5-mini-instruct-onnx-npu", "recipe": "oga-npu" } ] class=p>}

Extended Fields (Lemonade-specific):

checkpoint: HuggingFace model path or GGUF filename
recipe: Inference engine (oga-cpu, oga-npu, oga-igpu, llamacpp, flm)

GET /olla/lemonade/api/v1/models¶

Special handling: Same as /v1/models - returns OpenAI-format response.

curl http://localhost:40114/olla/lemonade/api/v1/models

Response format identical to /v1/models above.

Lemonade SDK Endpoints (Proxied)¶

The following endpoints are provided by Lemonade SDK. Olla forwards requests transparently.

For detailed request/response schemas, see the Lemonade SDK documentation.

Health & System¶

Method	Endpoint	Description	Handled By
GET	`/olla/lemonade/api/v1/health`	Health check with loaded model info	Lemonade SDK
GET	`/olla/lemonade/api/v1/stats`	Runtime statistics	Lemonade SDK
GET	`/olla/lemonade/api/v1/system-info`	Hardware detection and capabilities	Lemonade SDK

Example:

# Health check (proxied to backend)
curl http://localhost:40114/olla/lemonade/api/v1/health

# Response from Lemonade SDK
{
  "status": "ok",
  "checkpoint_loaded": "amd/Qwen2.5-0.5B-Instruct-quantized_int4-float16-cpu-onnx",
  "model_loaded": "Qwen2.5-0.5B-Instruct-CPU"
}

Inference (OpenAI-Compatible)¶

Method	Endpoint	Description	Handled By
POST	`/olla/lemonade/api/v1/chat/completions`	Chat completion	Lemonade SDK
POST	`/olla/lemonade/api/v1/completions`	Text completion	Lemonade SDK
POST	`/olla/lemonade/api/v1/embeddings`	Generate embeddings	Lemonade SDK

Example:

# Chat completion (proxied to backend)
curl -X POST http://localhost:40114/olla/lemonade/api/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen2.5-0.5B-Instruct-CPU",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "temperature": 0.7,
    "max_tokens": 150
  }'

Request format follows OpenAI's chat completion API. See OpenAI API Reference for details.

Model Lifecycle¶

Method	Endpoint	Description	Handled By
POST	`/olla/lemonade/api/v1/pull`	Download/install model from HuggingFace	Lemonade SDK
POST	`/olla/lemonade/api/v1/load`	Load model into memory	Lemonade SDK
POST	`/olla/lemonade/api/v1/unload`	Unload model from memory	Lemonade SDK
POST	`/olla/lemonade/api/v1/delete`	Delete model from disk	Lemonade SDK

Example:

# Load model (proxied to backend)
curl -X POST http://localhost:40114/olla/lemonade/api/v1/load \
  -H "Content-Type: application/json" \
  -d '{"model": "Phi-3.5-mini-instruct-NPU"}'

# Response from Lemonade SDK
{
  "status": "success",
  "message": "Model loaded successfully",
  "model": "Phi-3.5-mini-instruct-NPU"
}

Note: Olla does NOT manage model lifecycle - these operations are handled entirely by the Lemonade SDK backend.

Request Headers (Added by Olla)¶

Olla adds the following header to proxied requests:

Header	Value	Purpose
`X-Olla-Request-ID`	`req-{uuid}`	Request tracing

Example:

POST /api/v1/chat/completions HTTP/1.1
Host: backend:8000
Content-Type: application/json
X-Olla-Request-ID: req-abc123

Response Headers (Added by Olla)¶

Olla adds the following headers to all responses from Lemonade SDK:

Header	Example	Description
`X-Olla-Endpoint`	`lemonade-npu`	Backend name that processed request
`X-Olla-Model`	`Qwen2.5-0.5B-Instruct-CPU`	Model identifier used
`X-Olla-Backend-Type`	`lemonade`	Provider type
`X-Olla-Response-Time`	`234ms`	Total processing time
`X-Olla-Request-ID`	`req-abc123`	Request identifier

Example Response:

HTTP/1.1 200 OK
Content-Type: application/json
X-Olla-Endpoint: lemonade-npu
X-Olla-Model: Phi-3.5-mini-instruct-NPU
X-Olla-Backend-Type: lemonade
X-Olla-Response-Time: 156ms
X-Olla-Request-ID: req-abc123

{
  "id": "chatcmpl-xyz",
  "object": "chat.completion",
  "created": 1759361710,
  "model": "Phi-3.5-mini-instruct-NPU",
  "choices": [...]
}

Use cases:

Debugging: Identify which backend processed the request
Monitoring: Track response times across backends
Auditing: Trace requests through logs
Load balancing verification: Confirm routing behaviour

Recipe and Checkpoint Metadata¶

Lemonade SDK uses recipes to identify inference engines. Olla preserves this metadata in model listings.

Recipe Types¶

Recipe	Engine	Format	Hardware
`oga-cpu`	ONNX Runtime	ONNX	CPU
`oga-npu`	ONNX Runtime	ONNX	AMD Ryzen AI NPU
`oga-igpu`	ONNX Runtime	ONNX	DirectML iGPU
`llamacpp`	llama.cpp	GGUF	CPU/GPU
`flm`	Fast Language Models	Various	CPU/GPU

Checkpoint Format¶

Checkpoints identify HuggingFace model paths:

ONNX models:

amd/Qwen2.5-0.5B-Instruct-quantized_int4-float16-cpu-onnx
amd/Phi-3.5-mini-instruct-onnx-npu

GGUF models:

Llama-3.2-3B-Instruct-Q4_K_M.gguf

Olla extracts and preserves this metadata in the unified catalogue but does not use it for routing decisions.

Configuration Example¶

Basic endpoint configuration:

discovery:
  static:
    endpoints:
      - url: "http://localhost:8000"
        name: "local-lemonade"
        type: "lemonade"
        priority: 85
        model_url: "/api/v1/models"
        health_check_url: "/api/v1/health"
        check_interval: 5s
        check_timeout: 2s

Fields:

url: Lemonade SDK server address
type: Must be "lemonade" to use Lemonade profile
model_url: Endpoint for model discovery (default: /api/v1/models)
health_check_url: Health check endpoint (default: /api/v1/health)

See the Lemonade Integration Guide for detailed configuration examples.

Error Responses¶

Backend Unavailable¶

{
  "error": {
    "message": "No healthy endpoint available for provider: lemonade",
    "type": "endpoint_unavailable",
    "code": 503
  }
}

Cause: All Lemonade backends are unhealthy or unreachable.

Solution: Check backend health at /internal/status/endpoints.

Circuit Breaker Open¶

{
  "error": {
    "message": "Circuit breaker open for endpoint: lemonade-npu",
    "type": "circuit_breaker",
    "code": 503
  }
}

Cause: Backend failed too many health checks and was marked unhealthy.

Solution: Wait for automatic recovery (default: 30s) or fix backend issues.

Model Not Loaded¶

{
  "error": {
    "message": "Model not loaded",
    "type": "model_not_loaded",
    "code": 400
  }
}

Cause: Lemonade SDK backend has no model loaded.

Solution: Load a model using /api/v1/load endpoint.

Monitoring¶

Health Check¶

# Olla's health endpoint (includes all backends)
curl http://localhost:40114/internal/health

Response:

{
  "status": "healthy",
  "endpoints": {
    "lemonade-npu": {
      "url": "http://npu-server:8000",
      "healthy": true,
      "last_check": "2025-01-15T10:30:00Z"
    }
  }
}

Endpoint Status¶

# Detailed endpoint status
curl http://localhost:40114/internal/status/endpoints

Shows health, request counts, error rates, and circuit breaker state for each Lemonade backend.

Model Catalogue¶

# Unified model catalogue
curl http://localhost:40114/internal/status/models

Lists models from all Lemonade backends alongside other providers.

Next Steps¶

Lemonade Integration Guide - Complete setup and configuration
Profile System - Customise Lemonade behaviour
Model Unification - Unified model catalogue
Load Balancing - Multi-backend configuration
Lemonade SDK Documentation - Official Lemonade SDK documentation