oMLX API¶
Proxy endpoints for oMLX inference servers running on Apple Silicon. Available through the /olla/omlx/ prefix.
oMLX is a multi-model server: a single instance hosts many models concurrently, loading them on demand and evicting the least-recently-used ones under memory pressure. It is OpenAI-compatible on the wire and additionally implements the Anthropic Messages API, a reranking endpoint, and an oMLX-specific model-status endpoint.
Endpoints Overview¶
| Method | URI | Description |
|---|---|---|
| GET | /olla/omlx/health | Health check |
| GET | /olla/omlx/v1/models | List available models |
| GET | /olla/omlx/v1/models/status | Loaded-model residency state |
| POST | /olla/omlx/v1/chat/completions | Chat completion |
| POST | /olla/omlx/v1/completions | Text completion |
| POST | /olla/omlx/v1/embeddings | Generate embeddings |
| POST | /olla/omlx/v1/rerank | Rerank documents |
| POST | /olla/omlx/v1/responses | OpenAI Responses API |
Anthropic-format requests are served through Olla's Anthropic endpoint (/olla/anthropic/v1/messages) in passthrough mode. See the Anthropic API Reference.
GET /olla/omlx/health¶
Check oMLX server health status.
Request¶
Response¶
GET /olla/omlx/v1/models¶
List the models the oMLX server has discovered. The id is the configured alias where one is set, otherwise the model's directory name. max_model_len reports the effective context window and is preserved by Olla during discovery.
Request¶
Response¶
{
"object": "list",
"data": [
{
"id": "Qwen2.5-7B-Instruct-4bit",
"object": "model",
"created": 1705334400,
"owned_by": "omlx",
"max_model_len": 32768
}
]
}
GET /olla/omlx/v1/models/status¶
Report which models are currently resident in memory. This oMLX-specific endpoint is keyed by directory name (not alias) and is useful for understanding cold-start behaviour. Olla forwards it unchanged.
Request¶
Response¶
{
"model_count": 3,
"loaded_count": 1,
"models": [
{
"id": "Qwen2.5-7B-Instruct-4bit",
"loaded": true,
"is_loading": false,
"pinned": true,
"estimated_size": 4500000000,
"last_access": 1705334400.0
},
{
"id": "Llama-3.2-3B-Instruct-4bit",
"loaded": false,
"is_loading": false,
"pinned": false,
"estimated_size": 1800000000,
"last_access": null
}
]
}
POST /olla/omlx/v1/chat/completions¶
OpenAI-compatible chat completion. The first request for a model that is not resident triggers a load and may take several seconds.
Request¶
curl -X POST http://localhost:40114/olla/omlx/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen2.5-7B-Instruct-4bit",
"messages": [
{
"role": "system",
"content": "You are a helpful AI assistant."
},
{
"role": "user",
"content": "What is MLX?"
}
],
"temperature": 0.7,
"max_tokens": 300,
"stream": false
}'
Response¶
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1705334400,
"model": "Qwen2.5-7B-Instruct-4bit",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "MLX is an array framework for machine learning on Apple Silicon, built by Apple's machine learning research team. It uses the unified memory architecture of M-series chips for efficient GPU-accelerated computation."
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 40,
"total_tokens": 65
}
}
Streaming Response¶
When "stream": true:
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1705334400,"model":"Qwen2.5-7B-Instruct-4bit","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1705334400,"model":"Qwen2.5-7B-Instruct-4bit","choices":[{"index":0,"delta":{"content":"MLX"},"finish_reason":null}]}
...
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1705334401,"model":"Qwen2.5-7B-Instruct-4bit","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
POST /olla/omlx/v1/completions¶
Text completion.
Request¶
curl -X POST http://localhost:40114/olla/omlx/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen2.5-7B-Instruct-4bit",
"prompt": "Apple Silicon is designed for",
"max_tokens": 200,
"temperature": 0.8,
"top_p": 0.95,
"stream": false
}'
Response¶
{
"id": "cmpl-xyz789",
"object": "text_completion",
"created": 1705334400,
"model": "Qwen2.5-7B-Instruct-4bit",
"choices": [
{
"text": " high-performance, energy-efficient computing. The unified memory architecture lets the CPU, GPU, and Neural Engine share one memory pool, removing the overhead of copying data between processors.",
"index": 0,
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 6,
"completion_tokens": 36,
"total_tokens": 42
}
}
POST /olla/omlx/v1/embeddings¶
Generate embeddings from an embedding model loaded by oMLX.
Request¶
curl -X POST http://localhost:40114/olla/omlx/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/bge-small-en-v1.5",
"input": "MLX is optimised for Apple Silicon",
"encoding_format": "float"
}'
Response¶
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [0.0234, -0.0567, 0.0891, ...]
}
],
"model": "mlx-community/bge-small-en-v1.5",
"usage": {
"prompt_tokens": 8,
"total_tokens": 8
}
}
POST /olla/omlx/v1/rerank¶
Rerank a set of documents against a query (Cohere/Jina-compatible), using a reranker model loaded by oMLX.
Request¶
curl -X POST http://localhost:40114/olla/omlx/v1/rerank \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/bge-reranker-base",
"query": "What is unified memory?",
"documents": [
"Apple Silicon shares memory between CPU and GPU.",
"MLX is an array framework for Apple Silicon.",
"Discrete GPUs use separate VRAM."
],
"top_n": 2
}'
Response¶
{
"results": [
{"index": 0, "relevance_score": 0.91},
{"index": 1, "relevance_score": 0.44}
],
"model": "mlx-community/bge-reranker-base",
"usage": {
"total_tokens": 38
}
}
Sampling Parameters¶
Standard OpenAI-compatible sampling parameters are supported.
| Parameter | Type | Default | Description |
|---|---|---|---|
temperature | float | 1.0 | Sampling temperature |
top_p | float | 1.0 | Nucleus sampling threshold |
top_k | integer | - | Top-k sampling |
max_tokens | integer | - | Maximum tokens to generate |
stop | string/array | - | Stop sequences |
stream | boolean | false | Enable streaming response |
frequency_penalty | float | 0.0 | Frequency penalty |
presence_penalty | float | 0.0 | Presence penalty |
Configuration Example¶
discovery:
static:
endpoints:
- url: "http://192.168.0.100:8000"
name: "omlx-server"
type: "omlx"
priority: 75
model_url: "/v1/models"
health_check_url: "/health"
check_interval: 5s
check_timeout: 2s
Request Headers¶
All requests are forwarded with:
X-Olla-Request-ID- Unique request identifierX-Forwarded-For- Client IP address- Custom headers from endpoint configuration
Response Headers¶
All responses include:
X-Olla-Endpoint- Backend endpoint name (e.g., "omlx-server")X-Olla-Model- Model used for the requestX-Olla-Backend-Type- Always "omlx" for these endpointsX-Olla-Response-Time- Total processing time