Skip to content

oMLX API

Proxy endpoints for oMLX inference servers running on Apple Silicon. Available through the /olla/omlx/ prefix.

oMLX is a multi-model server: a single instance hosts many models concurrently, loading them on demand and evicting the least-recently-used ones under memory pressure. It is OpenAI-compatible on the wire and additionally implements the Anthropic Messages API, a reranking endpoint, and an oMLX-specific model-status endpoint.

Endpoints Overview

Method URI Description
GET /olla/omlx/health Health check
GET /olla/omlx/v1/models List available models
GET /olla/omlx/v1/models/status Loaded-model residency state
POST /olla/omlx/v1/chat/completions Chat completion
POST /olla/omlx/v1/completions Text completion
POST /olla/omlx/v1/embeddings Generate embeddings
POST /olla/omlx/v1/rerank Rerank documents
POST /olla/omlx/v1/responses OpenAI Responses API

Anthropic-format requests are served through Olla's Anthropic endpoint (/olla/anthropic/v1/messages) in passthrough mode. See the Anthropic API Reference.


GET /olla/omlx/health

Check oMLX server health status.

Request

curl -X GET http://localhost:40114/olla/omlx/health

Response

{
  "status": "healthy"
}

GET /olla/omlx/v1/models

List the models the oMLX server has discovered. The id is the configured alias where one is set, otherwise the model's directory name. max_model_len reports the effective context window and is preserved by Olla during discovery.

Request

curl -X GET http://localhost:40114/olla/omlx/v1/models

Response

{
  "object": "list",
  "data": [
    {
      "id": "Qwen2.5-7B-Instruct-4bit",
      "object": "model",
      "created": 1705334400,
      "owned_by": "omlx",
      "max_model_len": 32768
    }
  ]
}

GET /olla/omlx/v1/models/status

Report which models are currently resident in memory. This oMLX-specific endpoint is keyed by directory name (not alias) and is useful for understanding cold-start behaviour. Olla forwards it unchanged.

Request

curl -X GET http://localhost:40114/olla/omlx/v1/models/status

Response

{
  "model_count": 3,
  "loaded_count": 1,
  "models": [
    {
      "id": "Qwen2.5-7B-Instruct-4bit",
      "loaded": true,
      "is_loading": false,
      "pinned": true,
      "estimated_size": 4500000000,
      "last_access": 1705334400.0
    },
    {
      "id": "Llama-3.2-3B-Instruct-4bit",
      "loaded": false,
      "is_loading": false,
      "pinned": false,
      "estimated_size": 1800000000,
      "last_access": null
    }
  ]
}

POST /olla/omlx/v1/chat/completions

OpenAI-compatible chat completion. The first request for a model that is not resident triggers a load and may take several seconds.

Request

curl -X POST http://localhost:40114/olla/omlx/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen2.5-7B-Instruct-4bit",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful AI assistant."
      },
      {
        "role": "user",
        "content": "What is MLX?"
      }
    ],
    "temperature": 0.7,
    "max_tokens": 300,
    "stream": false
  }'

Response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1705334400,
  "model": "Qwen2.5-7B-Instruct-4bit",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "MLX is an array framework for machine learning on Apple Silicon, built by Apple's machine learning research team. It uses the unified memory architecture of M-series chips for efficient GPU-accelerated computation."
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 40,
    "total_tokens": 65
  }
}

Streaming Response

When "stream": true:

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1705334400,"model":"Qwen2.5-7B-Instruct-4bit","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1705334400,"model":"Qwen2.5-7B-Instruct-4bit","choices":[{"index":0,"delta":{"content":"MLX"},"finish_reason":null}]}

...

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1705334401,"model":"Qwen2.5-7B-Instruct-4bit","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

POST /olla/omlx/v1/completions

Text completion.

Request

curl -X POST http://localhost:40114/olla/omlx/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen2.5-7B-Instruct-4bit",
    "prompt": "Apple Silicon is designed for",
    "max_tokens": 200,
    "temperature": 0.8,
    "top_p": 0.95,
    "stream": false
  }'

Response

{
  "id": "cmpl-xyz789",
  "object": "text_completion",
  "created": 1705334400,
  "model": "Qwen2.5-7B-Instruct-4bit",
  "choices": [
    {
      "text": " high-performance, energy-efficient computing. The unified memory architecture lets the CPU, GPU, and Neural Engine share one memory pool, removing the overhead of copying data between processors.",
      "index": 0,
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 6,
    "completion_tokens": 36,
    "total_tokens": 42
  }
}

POST /olla/omlx/v1/embeddings

Generate embeddings from an embedding model loaded by oMLX.

Request

curl -X POST http://localhost:40114/olla/omlx/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/bge-small-en-v1.5",
    "input": "MLX is optimised for Apple Silicon",
    "encoding_format": "float"
  }'

Response

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.0234, -0.0567, 0.0891, ...]
    }
  ],
  "model": "mlx-community/bge-small-en-v1.5",
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 8
  }
}

POST /olla/omlx/v1/rerank

Rerank a set of documents against a query (Cohere/Jina-compatible), using a reranker model loaded by oMLX.

Request

curl -X POST http://localhost:40114/olla/omlx/v1/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/bge-reranker-base",
    "query": "What is unified memory?",
    "documents": [
      "Apple Silicon shares memory between CPU and GPU.",
      "MLX is an array framework for Apple Silicon.",
      "Discrete GPUs use separate VRAM."
    ],
    "top_n": 2
  }'

Response

{
  "results": [
    {"index": 0, "relevance_score": 0.91},
    {"index": 1, "relevance_score": 0.44}
  ],
  "model": "mlx-community/bge-reranker-base",
  "usage": {
    "total_tokens": 38
  }
}

Sampling Parameters

Standard OpenAI-compatible sampling parameters are supported.

Parameter Type Default Description
temperature float 1.0 Sampling temperature
top_p float 1.0 Nucleus sampling threshold
top_k integer - Top-k sampling
max_tokens integer - Maximum tokens to generate
stop string/array - Stop sequences
stream boolean false Enable streaming response
frequency_penalty float 0.0 Frequency penalty
presence_penalty float 0.0 Presence penalty

Configuration Example

discovery:
  static:
    endpoints:
      - url: "http://192.168.0.100:8000"
        name: "omlx-server"
        type: "omlx"
        priority: 75
        model_url: "/v1/models"
        health_check_url: "/health"
        check_interval: 5s
        check_timeout: 2s

Request Headers

All requests are forwarded with:

  • X-Olla-Request-ID - Unique request identifier
  • X-Forwarded-For - Client IP address
  • Custom headers from endpoint configuration

Response Headers

All responses include:

  • X-Olla-Endpoint - Backend endpoint name (e.g., "omlx-server")
  • X-Olla-Model - Model used for the request
  • X-Olla-Backend-Type - Always "omlx" for these endpoints
  • X-Olla-Response-Time - Total processing time