Ollama Integration¶
| Home | github.com/ollama/ollama |
|---|---|
| Since | Olla v0.0.1 |
| Type | ollama (use in endpoint configuration) |
| Profile | ollama.yaml (see latest) |
| Features |
|
| Unsupported |
|
| Attributes |
|
| Prefixes |
|
| Endpoints | See below |
Configuration¶
Basic Setup¶
Add Ollama to your Olla configuration:
discovery:
static:
endpoints:
- url: "http://localhost:11434"
name: "local-ollama"
type: "ollama"
priority: 100
model_url: "/api/tags"
health_check_url: "/"
check_interval: 2s
check_timeout: 1s
Multiple Ollama Instances¶
Configure multiple Ollama servers for load balancing:
discovery:
static:
endpoints:
# Primary GPU server
- url: "http://gpu-server:11434"
name: "ollama-gpu"
type: "ollama"
priority: 100
# Secondary server
- url: "http://backup-server:11434"
name: "ollama-backup"
type: "ollama"
priority: 75
# Development machine
- url: "http://dev-machine:11434"
name: "ollama-dev"
type: "ollama"
priority: 50
Remote Ollama Configuration¶
For remote Ollama servers:
discovery:
static:
endpoints:
- url: "https://ollama.example.com"
name: "ollama-cloud"
type: "ollama"
priority: 80
check_interval: 10s
check_timeout: 5s
Authentication Not Supported
Olla does not currently support authentication headers for endpoints. If your Ollama server requires authentication, you'll need to use a reverse proxy or wait for this feature to be added.
Anthropic Messages API Support¶
Ollama v0.14.0+ natively supports the Anthropic Messages API, enabling Olla to forward Anthropic-format requests directly without translation overhead (passthrough mode).
When Olla detects that an Ollama endpoint supports native Anthropic format (via the anthropic_support section in config/profiles/ollama.yaml), it will bypass the Anthropic-to-OpenAI translation pipeline and forward requests directly to /v1/messages on the backend.
Profile configuration (from config/profiles/ollama.yaml):
api:
anthropic_support:
enabled: true
messages_path: /v1/messages
token_count: false
min_version: "0.14.0"
limitations:
- token_counting_404
Key details:
- Minimum Ollama version: v0.14.0
- Token counting (
/v1/messages/count_tokens): Not supported (returns 404) - Passthrough mode is automatic -- no client-side configuration needed
- Responses include
X-Olla-Mode: passthroughheader when passthrough is active - Falls back to translation mode if passthrough conditions are not met
Ollama Anthropic Compatibility
For details on Ollama's Anthropic compatibility, see the Ollama Anthropic compatibility documentation.
For more information, see API Translation and Anthropic API Reference.
Endpoints Supported¶
The following endpoints are supported by the Ollama integration profile:
| Path | Description |
|---|---|
/ | Health Check |
/api/generate | Text Completion (Ollama format) |
/api/chat | Chat Completion (Ollama format) |
/api/embeddings | Generate Embeddings |
/api/tags | List Local Models |
/api/show | Show Model Information |
/v1/models | List Models (OpenAI format) |
/v1/chat/completions | Chat Completions (OpenAI format) |
/v1/completions | Text Completions (OpenAI format) |
/v1/embeddings | Embeddings (OpenAI format) |
Usage Examples¶
Chat Completion (Ollama Format)¶
curl -X POST http://localhost:40114/olla/ollama/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:latest",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the meaning of life?"}
],
"stream": false
}'
Text Generation (Ollama Format)¶
curl -X POST http://localhost:40114/olla/ollama/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "mistral:latest",
"prompt": "Once upon a time",
"options": {
"temperature": 0.8,
"num_predict": 100
}
}'
Streaming Response¶
curl -X POST http://localhost:40114/olla/ollama/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:latest",
"messages": [
{"role": "user", "content": "Write a haiku about programming"}
],
"stream": true
}'
OpenAI Compatibility¶
curl -X POST http://localhost:40114/olla/ollama/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:latest",
"messages": [
{"role": "user", "content": "Hello!"}
],
"temperature": 0.7,
"max_tokens": 150
}'
Embeddings¶
curl -X POST http://localhost:40114/olla/ollama/api/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "nomic-embed-text:latest",
"prompt": "The quick brown fox jumps over the lazy dog"
}'
List Available Models¶
# Ollama format
curl http://localhost:40114/olla/ollama/api/tags
# OpenAI format
curl http://localhost:40114/olla/ollama/v1/models
Model Information¶
curl -X POST http://localhost:40114/olla/ollama/api/show \
-H "Content-Type: application/json" \
-d '{"name": "llama3.2:latest"}'
Ollama Specifics¶
Model Loading Behaviour¶
Ollama has unique model loading characteristics:
- Dynamic Loading: Models load on first request
- Memory Management: Unloads models after idle timeout
- Loading Delay: First request to a model can be slow
- Concurrent Models: Limited by available memory
Model Naming Convention¶
Ollama uses a specific naming format:
Examples: - llama3.2:latest - llama3.2:3b - mistral:7b-instruct-q4_0 - library/codellama:13b
Quantisation Levels¶
Ollama supports various quantisation levels:
| Quantisation | Memory Usage | Performance | Quality |
|---|---|---|---|
| Q4_0 | ~50% | Fast | Good |
| Q4_1 | ~55% | Fast | Better |
| Q5_0 | ~60% | Moderate | Better |
| Q5_1 | ~65% | Moderate | Better |
| Q8_0 | ~85% | Slower | Best |
| F16 | 100% | Slowest | Highest |
Options Parameters¶
Ollama-specific generation options:
{
"options": {
"temperature": 0.8, // Randomness (0-1)
"top_k": 40, // Top K sampling
"top_p": 0.9, // Nucleus sampling
"num_predict": 128, // Max tokens to generate
"stop": ["\\n", "User:"], // Stop sequences
"seed": 42, // Reproducible generation
"num_ctx": 2048, // Context window size
"repeat_penalty": 1.1, // Repetition penalty
"mirostat": 2, // Mirostat sampling
"mirostat_tau": 5.0, // Mirostat target entropy
"mirostat_eta": 0.1 // Mirostat learning rate
}
}
Starting Ollama¶
Local Installation¶
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Start Ollama service
ollama serve
# Pull a model
ollama pull llama3.2:latest
# Test directly
ollama run llama3.2:latest "Hello"
Docker Deployment¶
# CPU only
docker run -d \
--name ollama \
-p 11434:11434 \
-v ollama:/root/.ollama \
ollama/ollama
# With GPU support
docker run -d \
--gpus all \
--name ollama \
-p 11434:11434 \
-v ollama:/root/.ollama \
ollama/ollama
Docker Compose¶
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0
- OLLAMA_KEEP_ALIVE=5m
- OLLAMA_MAX_LOADED_MODELS=2
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
ollama_data:
driver: local
Profile Customisation¶
To customise Ollama behaviour, create config/profiles/ollama-custom.yaml. See Profile Configuration for detailed explanations of each section.
Example Customisation¶
name: ollama
version: "1.0"
# Add custom routing prefixes
routing:
prefixes:
- ollama
- ai # Add custom prefix
# Adjust for slow model loading
characteristics:
timeout: 10m # Increase from 5m for large models
# Model capability detection
models:
capability_patterns:
vision:
- "*llava*"
- "*bakllava*"
- "vision*"
embeddings:
- "*embed*"
- "nomic-embed-text*"
- "mxbai-embed*"
code:
- "*code*"
- "codellama*"
- "deepseek-coder*"
- "qwen*coder*"
# Context window detection
context_patterns:
- pattern: "*-32k*"
context: 32768
- pattern: "*-16k*"
context: 16384
- pattern: "llama3*"
context: 8192
See Profile Configuration for complete customisation options.
Environment Variables¶
Ollama behaviour can be controlled via environment variables:
| Variable | Description | Default |
|---|---|---|
OLLAMA_HOST | Bind address | 127.0.0.1:11434 |
OLLAMA_MODELS | Model storage path | ~/.ollama/models |
OLLAMA_KEEP_ALIVE | Model idle timeout | 5m |
OLLAMA_MAX_LOADED_MODELS | Max concurrent models | Unlimited |
OLLAMA_NUM_PARALLEL | Parallel request handling | 1 |
OLLAMA_MAX_QUEUE | Max queued requests | 512 |
OLLAMA_DEBUG | Enable debug logging | false |
Multi-Modal Support¶
Vision Models (LLaVA)¶
Ollama supports vision models for image analysis:
# Pull a vision model
ollama pull llava:latest
# Use with image
curl -X POST http://localhost:40114/olla/ollama/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llava:latest",
"prompt": "What is in this image?",
"images": ["base64_encoded_image_data"]
}'
Supported Vision Models¶
llava:latest- General vision modelllava:13b- Larger vision modelbakllava:latest- Alternative vision model
Troubleshooting¶
Model Not Found¶
Issue: "model not found" error
Solution: 1. Ensure model is pulled:
- Verify model name format:
Slow First Request¶
Issue: First request to a model is very slow
Solution: 1. Pre-load models:
-
Increase keep-alive:
-
Adjust timeout in Olla:
Out of Memory¶
Issue: "out of memory" errors
Solution: 1. Limit concurrent models:
-
Use smaller quantisation:
-
Configure memory limits:
Connection Refused¶
Issue: Cannot connect to Ollama
Solution: 1. Check Ollama is running:
-
Verify bind address:
-
Check firewall:
Best Practices¶
1. Use Model Unification¶
With multiple Ollama instances, enable unification:
This provides a single model catalogue across all instances.
2. Configure Appropriate Timeouts¶
Account for model loading times:
proxy:
response_timeout: 600s # 10 minutes for large models (default)
connection_timeout: 30s # Default connection timeout
discovery:
static:
endpoints:
- url: "http://localhost:11434"
check_timeout: 5s # Allow time for health checks
3. Optimise for Your Hardware¶
For GPU Servers¶
endpoints:
- url: "http://gpu-server:11434"
name: "ollama-gpu"
priority: 100 # Prefer GPU
resources:
concurrency_limits:
- min_memory_gb: 0
max_concurrent: 4 # GPU can handle multiple
For CPU Servers¶
endpoints:
- url: "http://cpu-server:11434"
name: "ollama-cpu"
priority: 50 # Lower priority
resources:
concurrency_limits:
- min_memory_gb: 0
max_concurrent: 1 # CPU limited to one
4. Monitor Performance¶
Use Olla's status endpoints:
# Check health
curl http://localhost:40114/internal/health
# View endpoint status
curl http://localhost:40114/internal/status/endpoints
# Monitor model availability
curl http://localhost:40114/internal/status/models
Integration with Tools¶
OpenAI SDK¶
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:40114/olla/ollama/v1",
api_key="not-needed" # Ollama doesn't require API keys
)
response = client.chat.completions.create(
model="llama3.2:latest",
messages=[
{"role": "user", "content": "Hello!"}
]
)
LangChain¶
from langchain_community.llms import Ollama
llm = Ollama(
base_url="http://localhost:40114/olla/ollama",
model="llama3.2:latest"
)
response = llm.invoke("Tell me a joke")
Continue.dev¶
Configure Continue to use Olla with Ollama:
{
"models": [{
"title": "Ollama via Olla",
"provider": "ollama",
"model": "llama3.2:latest",
"apiBase": "http://localhost:40114/olla/ollama"
}]
}
Aider¶
# Use with Aider
aider --openai-api-base http://localhost:40114/olla/ollama/v1 \
--model llama3.2:latest
Next Steps¶
- Profile Configuration - Customise Ollama behaviour
- Model Unification - Understand model management
- Load Balancing - Configure multi-instance setups
- OpenWebUI Integration - Set up web interface