Skip to content

oMLX Integration

Home omlx.ai (source: github.com/jundot/omlx)
Since Olla v0.0.28
Type omlx (use in endpoint configuration)
Profile omlx.yaml (see latest)
Features
  • Proxy Forwarding
  • Health Check (native)
  • Model Unification
  • Model Detection & Normalisation
  • OpenAI API Compatibility
  • Native Anthropic Messages API
  • Embeddings API
  • Reranking API
Unsupported
  • Native Token Counting (Olla uses the local estimator; oMLX's /v1/messages/count_tokens is not yet forwarded)
  • Model Load/Unload Control (oMLX manages residency itself; Olla does not proxy lifecycle endpoints)
  • Prometheus Metrics
Attributes
  • Apple Silicon Only (M1/M2/M3/M4, macOS 15.0+)
  • MLX Framework Acceleration
  • Unified Memory Architecture
  • Multi-Model Server (concurrent, lazy-loaded)
  • Tiered KV Cache (hot RAM + cold SSD)
  • LRU/TTL Eviction & Model Pinning
Prefixes
Endpoints See below

oMLX is a multi-model inference server for Apple Silicon, managed from the macOS menu bar. Unlike single-model MLX servers, a single oMLX instance serves many models concurrently, loading them on demand and evicting the least-recently-used ones when memory runs low. Because it is OpenAI-compatible on the wire, Olla reuses the standard OpenAI parser and forwards requests with no translation overhead.

Configuration

Basic Setup

Add oMLX to your Olla configuration. A single endpoint exposes every model the server has discovered:

discovery:
  static:
    endpoints:
      - url: "http://localhost:8000"
        name: "local-omlx"
        type: "omlx"
        priority: 75
        model_url: "/v1/models"
        health_check_url: "/health"
        check_interval: 5s
        check_timeout: 2s

Allow for cold starts

oMLX loads models lazily, so the first request for a model that is not resident triggers a load that can take several seconds. The profile defaults to a 3-minute timeout to absorb this. Pin frequently used models in the oMLX admin panel to avoid cold starts on hot paths.

Apple Silicon Network Setup

Place multiple Macs behind Olla and balance across them. Because each oMLX instance is multi-model, you do not need one endpoint per model:

discovery:
  static:
    endpoints:
      - url: "http://mac-studio:8000"
        name: "omlx-studio"
        type: "omlx"
        priority: 90
        model_url: "/v1/models"
        health_check_url: "/health"
        check_interval: 5s
        check_timeout: 2s

      - url: "http://mac-mini:8000"
        name: "omlx-mini"
        type: "omlx"
        priority: 80
        model_url: "/v1/models"
        health_check_url: "/health"
        check_interval: 5s
        check_timeout: 2s

proxy:
  engine: "olla"        # High-performance engine
  load_balancer: "priority"

Anthropic Messages API Support

oMLX natively implements the Anthropic Messages API, so Olla forwards Anthropic-format requests directly without the Anthropic-to-OpenAI-to-Anthropic translation round trip (passthrough mode).

When Olla detects native Anthropic support (via the anthropic_support section in config/profiles/omlx.yaml), it bypasses the translation pipeline and sends requests straight to /v1/messages on the backend.

Profile configuration (from config/profiles/omlx.yaml):

api:
  anthropic_support:
    enabled: true
    messages_path: /v1/messages
    token_count: false

Key details:

  • Passthrough mode is automatic -- no client-side configuration needed
  • Responses include the X-Olla-Mode: passthrough header when passthrough is active
  • Falls back to translation mode if passthrough conditions are not met
  • Token counting (/v1/messages/count_tokens): oMLX implements this natively, but Olla currently answers token-count requests with its local estimator rather than forwarding them, so token_count is left false

oMLX also ships a Claude Code context-scaling mode that rescales reported token counts so auto-compact fires at the right time on smaller-context models. This pairs well with pointing Claude Code at Olla's Anthropic endpoint.

For more information, see API Translation and Anthropic API Reference.

Endpoints Supported

The following endpoints are supported by the oMLX integration profile:

Path Description
/health Health Check
/v1/models List Models (OpenAI format; returns aliases where configured)
/v1/models/status Loaded-model state (oMLX-specific: residency, size, last access)
/v1/chat/completions Chat Completions (OpenAI format)
/v1/completions Text Completions (OpenAI format)
/v1/embeddings Embeddings API
/v1/rerank Reranking API (Cohere/Jina-compatible)
/v1/messages Anthropic Messages API (native passthrough)
/v1/messages/count_tokens Anthropic token count -- forwarded to oMLX via this prefix; the /olla/anthropic route uses Olla's local estimator (token_count: false)
/v1/responses OpenAI Responses API

Usage Examples

Chat Completion

curl -X POST http://localhost:40114/olla/omlx/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen2.5-7B-Instruct-4bit",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing in simple terms"}
    ],
    "temperature": 0.7,
    "max_tokens": 500
  }'

Streaming Response

curl -X POST http://localhost:40114/olla/omlx/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen2.5-7B-Instruct-4bit",
    "messages": [
      {"role": "user", "content": "Write a story about a robot"}
    ],
    "stream": true,
    "temperature": 0.8
  }'

Anthropic Messages API (Passthrough)

curl -X POST http://localhost:40114/olla/anthropic/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: not-needed" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "Qwen2.5-7B-Instruct-4bit",
    "max_tokens": 500,
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Reranking

curl -X POST http://localhost:40114/olla/omlx/v1/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/bge-reranker-base",
    "query": "What is unified memory?",
    "documents": [
      "Apple Silicon shares memory between CPU and GPU.",
      "MLX is an array framework for Apple Silicon.",
      "Discrete GPUs use separate VRAM."
    ],
    "top_n": 2
  }'

Loaded State, Models and Health

# List available models (aliases shown where configured)
curl http://localhost:40114/olla/omlx/v1/models

# Inspect which models are currently resident in memory
curl http://localhost:40114/olla/omlx/v1/models/status

# Check health status
curl http://localhost:40114/olla/omlx/health

oMLX Specifics

Multi-Model Serving

A single oMLX instance hosts many models at once and manages residency automatically. This is the key difference from single-model MLX servers and shapes how you configure Olla:

  • Lazy loading: models load on first request. Expect a cold-start delay the first time a model is used after startup or eviction.
  • LRU eviction: the least-recently-used model is unloaded automatically when memory runs low.
  • Model pinning: pin frequently used models in the admin panel so they stay resident.
  • Per-model TTL: set an idle timeout per model to auto-unload after inactivity.
  • Process memory enforcement: a total memory ceiling (default: system RAM minus 8GB) prevents system-wide out-of-memory conditions.

Because the server already discovers and exposes every model through /v1/models, you typically need one Olla endpoint per oMLX instance, not one per model.

Tiered KV Cache (Hot + Cold)

oMLX keeps KV cache blocks across two tiers: a hot tier in RAM and a cold tier on SSD (safetensors). When the hot cache fills, blocks spill to disk and are restored from a matching prefix on the next request instead of being recomputed -- even after a server restart. This makes multi-turn, tool-heavy workloads (such as coding agents) far cheaper to resume.

Model Naming and Aliases

oMLX discovers models from subdirectories, so model IDs are directory names such as Qwen2.5-7B-Instruct-4bit, or a custom alias configured per model in the admin panel.

Aliases vs directory names

/v1/models returns the alias when one is configured, and requests accept both the alias and the directory name. The oMLX-specific /v1/models/status endpoint reports residency keyed by directory name. Olla discovers models from /v1/models, so the names you see in unified models are the aliases.

Resource Configuration

The oMLX profile uses Apple Silicon-oriented defaults. Concurrency is deliberately conservative because several models may share unified memory simultaneously:

characteristics:
  timeout: 3m                 # Absorbs cold-start model loads
  max_concurrent_requests: 4
  streaming_support: true

resources:
  defaults:
    requires_gpu: false       # Unified memory, no discrete GPU

Memory Requirements

Unified memory is shared between macOS and every loaded model. Because oMLX holds several models at once, size your Mac for the sum of the models you intend to keep resident, plus headroom for the OS:

Mac unified memory Comfortable resident set Notes
16GB One 7-8B (4bit) model Pin one model; expect evictions
32GB A 7-8B model plus an embedding/rerank model Good for a single-user coding setup
64GB Several 7-13B models, or one 30B (4bit) Comfortable multi-model
128GB+ A 70B (4bit) model with room for helpers Mac Studio territory

Starting oMLX Server

oMLX runs only on Apple Silicon Macs (M1/M2/M3/M4) with macOS 15.0+ (Sequoia) and Python 3.10+.

macOS App

Download the .dmg from Releases, drag oMLX to Applications, and launch it. The welcome flow walks through choosing a model directory, starting the server, and downloading a first model. The server listens on http://localhost:8000 by default.

Homebrew

brew tap jundot/omlx https://github.com/jundot/omlx
brew install omlx

# Run as a managed background service (auto-restarts on crash)
omlx start

From Source

git clone https://github.com/jundot/omlx.git
cd omlx
pip install -e .

# Foreground server attached to this terminal
omlx serve --model-dir ~/models

The server auto-discovers LLMs, VLMs, embedding models, and rerankers from subdirectories of the model directory. Any OpenAI-compatible client can then connect to http://localhost:8000/v1.

Profile Customisation

To customise oMLX behaviour, create config/profiles/omlx-custom.yaml. See Profile Configuration for detailed explanations of each section.

Example Customisation

name: omlx
version: "1.0"

# Add custom prefixes
routing:
  prefixes:
    - omlx
    - mlx       # Add an alternate prefix

# Allow longer cold starts for very large models
characteristics:
  timeout: 5m

# Raise concurrency on a high-memory Mac
resources:
  concurrency_limits:
    - min_memory_gb: 0
      max_concurrent: 8

See Profile Configuration for complete customisation options.

Troubleshooting

Apple Silicon and macOS Version

Issue: oMLX fails to start or install

Solution: oMLX requires an Apple Silicon Mac running macOS 15.0+ (Sequoia). It does not run on Intel Macs, Linux, or Windows. Verify your hardware:

sysctl -n machdep.cpu.brand_string   # Should show "Apple M1", "Apple M2", etc.
sw_vers -productVersion              # Should be 15.0 or later

Cold-Start Latency

Issue: The first request to a model is slow or times out

Solution: oMLX loads models lazily, so the first request after startup or eviction pays a load cost. Either raise the timeout or keep the model resident:

characteristics:
  timeout: 5m

resources:
  timeout_scaling:
    base_timeout_seconds: 300
    load_time_buffer: true

Pin the model in the oMLX admin panel so it stays loaded between requests.

Eviction Thrashing

Issue: Models keep unloading and reloading, hurting latency

Solution: Too many models are competing for unified memory. Reduce the resident set:

  1. Pin only the models you use most
  2. Set a per-model TTL so idle models unload cleanly
  3. Choose smaller quantisations (e.g. 4bit) to fit more models
  4. Raise the process memory ceiling only if you have headroom for macOS

Model Name Not Found

Issue: A request returns a model-not-found error even though the model exists

Solution: oMLX accepts both the alias and the directory name, but the name Olla advertises in /olla/models is whatever /v1/models returns (the alias when configured). List the models through Olla and use the name shown:

curl http://localhost:40114/olla/omlx/v1/models

Best Practices

1. One Endpoint per Instance

Because oMLX is multi-model, configure a single Olla endpoint per server and let oMLX manage model residency. Avoid creating one endpoint per model.

2. Pin Your Hot Models

Pin the models on your critical paths (for example, the model behind your coding agent) so they never pay a cold-start cost.

3. Size for the Resident Set

Plan memory around the sum of models you keep loaded, not the largest single model, and leave several GB of headroom for macOS.

4. Use the Olla Engine for Multiple Instances

When balancing across several Macs, use the olla engine with the priority load balancer to prefer your fastest hardware.

Integration with Tools

OpenAI SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:40114/olla/omlx/v1",
    api_key="not-needed"  # oMLX does not require API keys by default
)

response = client.chat.completions.create(
    model="Qwen2.5-7B-Instruct-4bit",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

Claude Code

# Point Claude Code at Olla's Anthropic endpoint
export ANTHROPIC_BASE_URL="http://localhost:40114/olla/anthropic"

# Requests use passthrough mode to oMLX automatically
claude

LangChain

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="http://localhost:40114/olla/omlx/v1",
    api_key="not-needed",
    model="Qwen2.5-7B-Instruct-4bit",
    temperature=0.7
)

Next Steps