vLLM-MLX API¶

Proxy endpoints for vLLM-MLX inference servers running on Apple Silicon. Available through the /olla/vllm-mlx/ prefix.

vLLM-MLX serves a single model per instance using MLX-format weights from HuggingFace (e.g. mlx-community/Llama-3.2-3B-Instruct-4bit). It exposes a standard OpenAI-compatible API without guided generation or advanced vLLM features.

Endpoints Overview¶

Method	URI	Description
GET	`/olla/vllm-mlx/health`	Health check
GET	`/olla/vllm-mlx/v1/models`	List available models
POST	`/olla/vllm-mlx/v1/chat/completions`	Chat completion
POST	`/olla/vllm-mlx/v1/completions`	Text completion
POST	`/olla/vllm-mlx/v1/embeddings`	Generate embeddings

GET /olla/vllm-mlx/health¶

Check vLLM-MLX server health status.

Request¶

curl -X GET http://localhost:40114/olla/vllm-mlx/health

Response¶

{
  "status": "healthy"
}

GET /olla/vllm-mlx/v1/models¶

List the model available on the vLLM-MLX server. Each instance serves a single model.

Request¶

curl -X GET http://localhost:40114/olla/vllm-mlx/v1/models

Response¶

{
  "object": "list",
  "data": [
    {
      "id": "mlx-community/Llama-3.2-3B-Instruct-4bit",
      "object": "model",
      "created": 1705334400,
      "owned_by": "vllm-mlx",
      "root": "mlx-community/Llama-3.2-3B-Instruct-4bit",
      "parent": null,
      "permission": []
    }
  ]
}

POST /olla/vllm-mlx/v1/chat/completions¶

OpenAI-compatible chat completion.

Request¶

curl -X POST http://localhost:40114/olla/vllm-mlx/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Llama-3.2-3B-Instruct-4bit",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful AI assistant."
      },
      {
        "role": "user",
        "content": "What is MLX?"
      }
    ],
    "temperature": 0.7,
    "max_tokens": 300,
    "stream": false
  }'

Response¶

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1705334400,
  "model": "mlx-community/Llama-3.2-3B-Instruct-4bit",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "MLX is an array framework for machine learning on Apple Silicon, developed by Apple's machine learning research team. It provides efficient GPU-accelerated computation using the unified memory architecture of Apple's M-series chips."
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 42,
    "total_tokens": 67
  }
}

Streaming Response¶

When "stream": true:

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1705334400,"model":"mlx-community/Llama-3.2-3B-Instruct-4bit","choices":[{"index":0,"delta":{"role":"assistant"},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1705334400,"model":"mlx-community/Llama-3.2-3B-Instruct-4bit","choices":[{"index":0,"delta":{"content":"MLX"},"logprobs":null,"finish_reason":null}]}

...

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1705334401,"model":"mlx-community/Llama-3.2-3B-Instruct-4bit","choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}]}

data: [DONE]

POST /olla/vllm-mlx/v1/completions¶

Text completion.

Request¶

curl -X POST http://localhost:40114/olla/vllm-mlx/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Llama-3.2-3B-Instruct-4bit",
    "prompt": "Apple Silicon is designed for",
    "max_tokens": 200,
    "temperature": 0.8,
    "top_p": 0.95,
    "stream": false
  }'

Response¶

{
  "id": "cmpl-xyz789",
  "object": "text_completion",
  "created": 1705334400,
  "model": "mlx-community/Llama-3.2-3B-Instruct-4bit",
  "choices": [
    {
      "text": " high-performance computing with exceptional energy efficiency. The unified memory architecture allows the CPU, GPU, and Neural Engine to share the same memory pool, eliminating the overhead of copying data between processors.",
      "index": 0,
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 6,
    "completion_tokens": 38,
    "total_tokens": 44
  }
}

POST /olla/vllm-mlx/v1/embeddings¶

Generate embeddings (if the loaded model supports embeddings).

Request¶

curl -X POST http://localhost:40114/olla/vllm-mlx/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Llama-3.2-3B-Instruct-4bit",
    "input": "MLX is optimised for Apple Silicon",
    "encoding_format": "float"
  }'

Response¶

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.0234, -0.0567, 0.0891, ...]
    }
  ],
  "model": "mlx-community/Llama-3.2-3B-Instruct-4bit",
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 8
  }
}

Sampling Parameters¶

Standard OpenAI-compatible sampling parameters are supported.

Parameter	Type	Default	Description
`temperature`	float	1.0	Sampling temperature
`top_p`	float	1.0	Nucleus sampling threshold
`max_tokens`	integer	-	Maximum tokens to generate
`stop`	string/array	-	Stop sequences
`stream`	boolean	false	Enable streaming response
`frequency_penalty`	float	0.0	Frequency penalty
`presence_penalty`	float	0.0	Presence penalty

Configuration Example¶

endpoints:
  - url: "http://192.168.0.100:8000"
    name: "vllm-mlx-server"
    type: "vllm-mlx"
    priority: 80
    model_url: "/v1/models"
    health_check_url: "/health"
    check_interval: 5s
    check_timeout: 2s

Request Headers¶

All requests are forwarded with:

X-Olla-Request-ID - Unique request identifier
X-Forwarded-For - Client IP address
Custom headers from endpoint configuration

Response Headers¶

All responses include:

X-Olla-Endpoint - Backend endpoint name (e.g., "vllm-mlx-server")
X-Olla-Model - Model used for the request
X-Olla-Backend-Type - Always "vllm-mlx" for these endpoints
X-Olla-Response-Time - Total processing time