Ollama API¶
Proxy endpoints for Ollama instances. All Ollama API endpoints are available through the /olla/ollama/
prefix.
Endpoints Overview¶
Method | URI | Description |
---|---|---|
GET | /olla/ollama/ | Health check |
POST | /olla/ollama/api/generate | Generate completion |
POST | /olla/ollama/api/chat | Chat completion |
POST | /olla/ollama/api/embeddings | Generate embeddings |
GET | /olla/ollama/api/tags | List local models |
POST | /olla/ollama/api/show | Show model information |
POST | /olla/ollama/api/pull | Pull a model |
POST | /olla/ollama/api/push | Push a model |
POST | /olla/ollama/api/copy | Copy a model |
DELETE | /olla/ollama/api/delete | Delete a model |
POST | /olla/ollama/api/create | Create a model |
GET | /olla/ollama/api/ps | List running models |
OpenAI-Compatible Endpoints¶
Method | URI | Description |
---|---|---|
GET | /olla/ollama/v1/models | List models (OpenAI format) |
POST | /olla/ollama/v1/chat/completions | Chat completion (OpenAI format) |
POST | /olla/ollama/v1/completions | Text completion (OpenAI format) |
POST | /olla/ollama/v1/embeddings | Generate embeddings (OpenAI format) |
POST /olla/ollama/api/generate¶
Generate text completion using Ollama's native API.
Request¶
curl -X POST http://localhost:40114/olla/ollama/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:latest",
"prompt": "Why is the sky blue?",
"stream": false,
"options": {
"temperature": 0.7,
"top_p": 0.9,
"max_tokens": 200
}
}'
Response¶
{
"model": "llama3.2:latest",
"created_at": "2024-01-15T10:30:00Z",
"response": "The sky appears blue because of a phenomenon called Rayleigh scattering. When sunlight enters Earth's atmosphere, it collides with gas molecules. Blue light waves are shorter and scatter more than other colors, making the sky appear blue to our eyes.",
"done": true,
"context": [1234, 5678, ...],
"total_duration": 5250000000,
"load_duration": 1200000000,
"prompt_eval_count": 8,
"prompt_eval_duration": 350000000,
"eval_count": 45,
"eval_duration": 3700000000
}
Streaming Response¶
When "stream": true
:
{"model":"llama3.2:latest","created_at":"2024-01-15T10:30:00Z","response":"The","done":false}
{"model":"llama3.2:latest","created_at":"2024-01-15T10:30:00Z","response":" sky","done":false}
{"model":"llama3.2:latest","created_at":"2024-01-15T10:30:00Z","response":" appears","done":false}
...
{"model":"llama3.2:latest","created_at":"2024-01-15T10:30:01Z","response":"","done":true,"total_duration":5250000000}
POST /olla/ollama/api/chat¶
Chat completion with conversation history.
Request¶
curl -X POST http://localhost:40114/olla/ollama/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:latest",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
],
"stream": false,
"options": {
"temperature": 0.7
}
}'
Response¶
{
"model": "llama3.2:latest",
"created_at": "2024-01-15T10:30:00Z",
"message": {
"role": "assistant",
"content": "The capital of France is Paris."
},
"done": true,
"total_duration": 2150000000,
"load_duration": 1200000000,
"prompt_eval_count": 10,
"prompt_eval_duration": 150000000,
"eval_count": 8,
"eval_duration": 800000000
}
POST /olla/ollama/v1/chat/completions¶
OpenAI-compatible chat completion endpoint.
Request¶
curl -X POST http://localhost:40114/olla/ollama/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:latest",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is 2+2?"
}
],
"temperature": 0.7,
"max_tokens": 100,
"stream": false
}'
Response¶
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1705334400,
"model": "llama3.2:latest",
"system_fingerprint": "fp_abc123",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "2 + 2 equals 4."
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 8,
"total_tokens": 33
}
}
Streaming Response¶
When "stream": true
:
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1705334400,"model":"llama3.2:latest","choices":[{"index":0,"delta":{"role":"assistant","content":"2"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1705334400,"model":"llama3.2:latest","choices":[{"index":0,"delta":{"content":" +"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1705334400,"model":"llama3.2:latest","choices":[{"index":0,"delta":{"content":" 2"},"finish_reason":null}]}
data: [DONE]
POST /olla/ollama/api/embeddings¶
Generate embeddings for text input.
Request¶
curl -X POST http://localhost:40114/olla/ollama/api/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "nomic-embed-text",
"prompt": "The quick brown fox jumps over the lazy dog"
}'
Response¶
{
"embedding": [0.123, -0.456, 0.789, ...],
"model": "nomic-embed-text",
"total_duration": 150000000,
"load_duration": 50000000,
"prompt_eval_count": 9
}
GET /olla/ollama/api/tags¶
List all locally available models.
Request¶
Response¶
{
"models": [
{
"name": "llama3.2:latest",
"model": "llama3.2:latest",
"modified_at": "2024-01-15T08:00:00Z",
"size": 4089352192,
"digest": "sha256:abc123def456",
"details": {
"parent_model": "",
"format": "gguf",
"family": "llama",
"families": ["llama"],
"parameter_size": "3.2B",
"quantization_level": "Q4_0"
}
},
{
"name": "mistral:latest",
"model": "mistral:latest",
"modified_at": "2024-01-14T12:00:00Z",
"size": 4113859584,
"digest": "sha256:def789ghi012",
"details": {
"parent_model": "",
"format": "gguf",
"family": "mistral",
"families": ["mistral"],
"parameter_size": "7B",
"quantization_level": "Q4_0"
}
}
]
}
POST /olla/ollama/api/show¶
Get detailed information about a specific model.
Request¶
curl -X POST http://localhost:40114/olla/ollama/api/show \
-H "Content-Type: application/json" \
-d '{
"name": "llama3.2:latest"
}'
Response¶
{
"modelfile": "FROM llama3.2.Q4_0.gguf\nPARAMETER temperature 0.7\nPARAMETER top_p 0.9",
"parameters": "temperature 0.7\ntop_p 0.9",
"template": "{{ .System }}\nUser: {{ .Prompt }}\nAssistant:",
"details": {
"parent_model": "",
"format": "gguf",
"family": "llama",
"families": ["llama"],
"parameter_size": "3.2B",
"quantization_level": "Q4_0"
},
"model_info": {
"general.architecture": "llama",
"general.file_type": "Q4_0",
"general.parameter_count": 3200000000,
"general.quantization_version": 2
}
}
Model Management¶
The following endpoints are proxied but typically used for model management:
- POST /olla/ollama/api/pull - Download a model from Ollama registry
- POST /olla/ollama/api/push - Upload a model to Ollama registry
- POST /olla/ollama/api/copy - Create a copy of a model
- DELETE /olla/ollama/api/delete - Remove a model
- POST /olla/ollama/api/create - Create a model from a Modelfile
- GET /olla/ollama/api/ps - List currently loaded models
Request Headers¶
All requests are forwarded with these additional headers:
X-Olla-Request-ID
- Unique request identifierX-Forwarded-For
- Client IP addressX-Forwarded-Host
- Original hostX-Forwarded-Proto
- Original protocol
Response Headers¶
All responses include:
X-Olla-Endpoint
- Backend endpoint name (e.g., "local-ollama")X-Olla-Model
- Model used for the requestX-Olla-Backend-Type
- Always "ollama" for these endpointsX-Olla-Response-Time
- Total processing time