oMLX Integration¶

Home	omlx.ai (source: github.com/jundot/omlx)
Since	Olla `v0.0.28`
Type	`omlx` (use in endpoint configuration)
Profile	`omlx.yaml` (see latest)
Features	Proxy Forwarding Health Check (native) Model Unification Model Detection & Normalisation OpenAI API Compatibility Native Anthropic Messages API Embeddings API Reranking API
Unsupported	Native Token Counting (Olla uses the local estimator; oMLX's `/v1/messages/count_tokens` is not yet forwarded) Model Load/Unload Control (oMLX manages residency itself; Olla does not proxy lifecycle endpoints) Prometheus Metrics
Attributes	Apple Silicon Only (M1/M2/M3/M4, macOS 15.0+) MLX Framework Acceleration Unified Memory Architecture Multi-Model Server (concurrent, lazy-loaded) Tiered KV Cache (hot RAM + cold SSD) LRU/TTL Eviction & Model Pinning
Prefixes	`/omlx` (see Routing Prefixes)
Endpoints	See below

oMLX is a multi-model inference server for Apple Silicon, managed from the macOS menu bar. Unlike single-model MLX servers, a single oMLX instance serves many models concurrently, loading them on demand and evicting the least-recently-used ones when memory runs low. Because it is OpenAI-compatible on the wire, Olla reuses the standard OpenAI parser and forwards requests with no translation overhead.

Configuration¶

Basic Setup¶

Add oMLX to your Olla configuration. A single endpoint exposes every model the server has discovered:

discovery:
  static:
    endpoints:
      - url: "http://localhost:8000"
        name: "local-omlx"
        type: "omlx"
        priority: 75
        model_url: "/v1/models"
        health_check_url: "/health"
        check_interval: 5s
        check_timeout: 2s

Allow for cold starts

oMLX loads models lazily, so the first request for a model that is not resident triggers a load that can take several seconds. The profile defaults to a 3-minute timeout to absorb this. Pin frequently used models in the oMLX admin panel to avoid cold starts on hot paths.

Apple Silicon Network Setup¶

Place multiple Macs behind Olla and balance across them. Because each oMLX instance is multi-model, you do not need one endpoint per model:

discovery:
  static:
    endpoints:
      - url: "http://mac-studio:8000"
        name: "omlx-studio"
        type: "omlx"
        priority: 90
        model_url: "/v1/models"
        health_check_url: "/health"
        check_interval: 5s
        check_timeout: 2s

      - url: "http://mac-mini:8000"
        name: "omlx-mini"
        type: "omlx"
        priority: 80
        model_url: "/v1/models"
        health_check_url: "/health"
        check_interval: 5s
        check_timeout: 2s

proxy:
  engine: "olla"        # High-performance engine
  load_balancer: "priority"

Anthropic Messages API Support¶

oMLX natively implements the Anthropic Messages API, so Olla forwards Anthropic-format requests directly without the Anthropic-to-OpenAI-to-Anthropic translation round trip (passthrough mode).

When Olla detects native Anthropic support (via the anthropic_support section in config/profiles/omlx.yaml), it bypasses the translation pipeline and sends requests straight to /v1/messages on the backend.

Profile configuration (from config/profiles/omlx.yaml):

api:
  anthropic_support:
    enabled: true
    messages_path: /v1/messages
    token_count: false

Key details:

Passthrough mode is automatic -- no client-side configuration needed
Responses include the X-Olla-Mode: passthrough header when passthrough is active
Falls back to translation mode if passthrough conditions are not met
Token counting (/v1/messages/count_tokens): oMLX implements this natively, but Olla currently answers token-count requests with its local estimator rather than forwarding them, so token_count is left false

oMLX also ships a Claude Code context-scaling mode that rescales reported token counts so auto-compact fires at the right time on smaller-context models. This pairs well with pointing Claude Code at Olla's Anthropic endpoint.

For more information, see API Translation and Anthropic API Reference.

Endpoints Supported¶

The following endpoints are supported by the oMLX integration profile:

Path	Description
`/health`	Health Check
`/v1/models`	List Models (OpenAI format; returns aliases where configured)
`/v1/models/status`	Loaded-model state (oMLX-specific: residency, size, last access)
`/v1/chat/completions`	Chat Completions (OpenAI format)
`/v1/completions`	Text Completions (OpenAI format)
`/v1/embeddings`	Embeddings API
`/v1/rerank`	Reranking API (Cohere/Jina-compatible)
`/v1/messages`	Anthropic Messages API (native passthrough)
`/v1/messages/count_tokens`	Anthropic token count -- forwarded to oMLX via this prefix; the `/olla/anthropic` route uses Olla's local estimator (`token_count: false`)
`/v1/responses`	OpenAI Responses API

Usage Examples¶

Chat Completion¶

curl -X POST http://localhost:40114/olla/omlx/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen2.5-7B-Instruct-4bit",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing in simple terms"}
    ],
    "temperature": 0.7,
    "max_tokens": 500
  }'

Streaming Response¶

curl -X POST http://localhost:40114/olla/omlx/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen2.5-7B-Instruct-4bit",
    "messages": [
      {"role": "user", "content": "Write a story about a robot"}
    ],
    "stream": true,
    "temperature": 0.8
  }'

Anthropic Messages API (Passthrough)¶

curl -X POST http://localhost:40114/olla/anthropic/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: not-needed" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "Qwen2.5-7B-Instruct-4bit",
    "max_tokens": 500,
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Reranking¶

curl -X POST http://localhost:40114/olla/omlx/v1/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/bge-reranker-base",
    "query": "What is unified memory?",
    "documents": [
      "Apple Silicon shares memory between CPU and GPU.",
      "MLX is an array framework for Apple Silicon.",
      "Discrete GPUs use separate VRAM."
    ],
    "top_n": 2
  }'

Loaded State, Models and Health¶

# List available models (aliases shown where configured)
curl http://localhost:40114/olla/omlx/v1/models

# Inspect which models are currently resident in memory
curl http://localhost:40114/olla/omlx/v1/models/status

# Check health status
curl http://localhost:40114/olla/omlx/health

oMLX Specifics¶

Multi-Model Serving¶

A single oMLX instance hosts many models at once and manages residency automatically. This is the key difference from single-model MLX servers and shapes how you configure Olla:

Lazy loading: models load on first request. Expect a cold-start delay the first time a model is used after startup or eviction.
LRU eviction: the least-recently-used model is unloaded automatically when memory runs low.
Model pinning: pin frequently used models in the admin panel so they stay resident.
Per-model TTL: set an idle timeout per model to auto-unload after inactivity.
Process memory enforcement: a total memory ceiling (default: system RAM minus 8GB) prevents system-wide out-of-memory conditions.

Because the server already discovers and exposes every model through /v1/models, you typically need one Olla endpoint per oMLX instance, not one per model.

Tiered KV Cache (Hot + Cold)¶

oMLX keeps KV cache blocks across two tiers: a hot tier in RAM and a cold tier on SSD (safetensors). When the hot cache fills, blocks spill to disk and are restored from a matching prefix on the next request instead of being recomputed -- even after a server restart. This makes multi-turn, tool-heavy workloads (such as coding agents) far cheaper to resume.

Model Naming and Aliases¶

oMLX discovers models from subdirectories, so model IDs are directory names such as Qwen2.5-7B-Instruct-4bit, or a custom alias configured per model in the admin panel.

Aliases vs directory names

/v1/models returns the alias when one is configured, and requests accept both the alias and the directory name. The oMLX-specific /v1/models/status endpoint reports residency keyed by directory name. Olla discovers models from /v1/models, so the names you see in unified models are the aliases.

Resource Configuration¶

The oMLX profile uses Apple Silicon-oriented defaults. Concurrency is deliberately conservative because several models may share unified memory simultaneously:

characteristics:
  timeout: 3m                 # Absorbs cold-start model loads
  max_concurrent_requests: 4
  streaming_support: true

resources:
  defaults:
    requires_gpu: false       # Unified memory, no discrete GPU

Memory Requirements¶

Unified memory is shared between macOS and every loaded model. Because oMLX holds several models at once, size your Mac for the sum of the models you intend to keep resident, plus headroom for the OS:

Mac unified memory	Comfortable resident set	Notes
16GB	One 7-8B (4bit) model	Pin one model; expect evictions
32GB	A 7-8B model plus an embedding/rerank model	Good for a single-user coding setup
64GB	Several 7-13B models, or one 30B (4bit)	Comfortable multi-model
128GB+	A 70B (4bit) model with room for helpers	Mac Studio territory

Starting oMLX Server¶

oMLX runs only on Apple Silicon Macs (M1/M2/M3/M4) with macOS 15.0+ (Sequoia) and Python 3.10+.

macOS App¶

Download the .dmg from Releases, drag oMLX to Applications, and launch it. The welcome flow walks through choosing a model directory, starting the server, and downloading a first model. The server listens on http://localhost:8000 by default.

Homebrew¶

brew tap jundot/omlx https://github.com/jundot/omlx
brew install omlx

# Run as a managed background service (auto-restarts on crash)
omlx start

From Source¶

git clone https://github.com/jundot/omlx.git
cd omlx
pip install -e .

# Foreground server attached to this terminal
omlx serve --model-dir ~/models

The server auto-discovers LLMs, VLMs, embedding models, and rerankers from subdirectories of the model directory. Any OpenAI-compatible client can then connect to http://localhost:8000/v1.

Profile Customisation¶

To customise oMLX behaviour, create config/profiles/omlx-custom.yaml. See Profile Configuration for detailed explanations of each section.

Example Customisation¶

name: omlx
version: "1.0"

# Add custom prefixes
routing:
  prefixes:
    - omlx
    - mlx       # Add an alternate prefix

# Allow longer cold starts for very large models
characteristics:
  timeout: 5m

# Raise concurrency on a high-memory Mac
resources:
  concurrency_limits:
    - min_memory_gb: 0
      max_concurrent: 8

See Profile Configuration for complete customisation options.

Troubleshooting¶

Apple Silicon and macOS Version¶

Issue: oMLX fails to start or install

Solution: oMLX requires an Apple Silicon Mac running macOS 15.0+ (Sequoia). It does not run on Intel Macs, Linux, or Windows. Verify your hardware:

sysctl -n machdep.cpu.brand_string   # Should show "Apple M1", "Apple M2", etc.
sw_vers -productVersion              # Should be 15.0 or later

Cold-Start Latency¶

Issue: The first request to a model is slow or times out

Solution: oMLX loads models lazily, so the first request after startup or eviction pays a load cost. Either raise the timeout or keep the model resident:

characteristics:
  timeout: 5m

resources:
  timeout_scaling:
    base_timeout_seconds: 300
    load_time_buffer: true

Pin the model in the oMLX admin panel so it stays loaded between requests.

Eviction Thrashing¶

Issue: Models keep unloading and reloading, hurting latency

Solution: Too many models are competing for unified memory. Reduce the resident set:

Pin only the models you use most
Set a per-model TTL so idle models unload cleanly
Choose smaller quantisations (e.g. 4bit) to fit more models
Raise the process memory ceiling only if you have headroom for macOS

Model Name Not Found¶

Issue: A request returns a model-not-found error even though the model exists

Solution: oMLX accepts both the alias and the directory name, but the name Olla advertises in /olla/models is whatever /v1/models returns (the alias when configured). List the models through Olla and use the name shown:

curl http://localhost:40114/olla/omlx/v1/models

Best Practices¶

1. One Endpoint per Instance¶

Because oMLX is multi-model, configure a single Olla endpoint per server and let oMLX manage model residency. Avoid creating one endpoint per model.

2. Pin Your Hot Models¶

Pin the models on your critical paths (for example, the model behind your coding agent) so they never pay a cold-start cost.

3. Size for the Resident Set¶

Plan memory around the sum of models you keep loaded, not the largest single model, and leave several GB of headroom for macOS.

4. Use the Olla Engine for Multiple Instances¶

When balancing across several Macs, use the olla engine with the priority load balancer to prefer your fastest hardware.

Integration with Tools¶

OpenAI SDK¶

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:40114/olla/omlx/v1",
    api_key="not-needed"  # oMLX does not require API keys by default
)

response = client.chat.completions.create(
    model="Qwen2.5-7B-Instruct-4bit",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

Claude Code¶

# Point Claude Code at Olla's Anthropic endpoint
export ANTHROPIC_BASE_URL="http://localhost:40114/olla/anthropic"

# Requests use passthrough mode to oMLX automatically
claude

LangChain¶

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="http://localhost:40114/olla/omlx/v1",
    api_key="not-needed",
    model="Qwen2.5-7B-Instruct-4bit",
    temperature=0.7
)

Next Steps¶

Profile Configuration - Customise oMLX behaviour
Model Unification - Understand model management
Load Balancing - Scale with multiple oMLX instances
API Translation - Anthropic passthrough and translation modes