Ollama Integration¶
Home | github.com/ollama/ollama |
---|---|
Type | ollama (use in endpoint configuration) |
Profile | ollama.yaml (see latest) |
Features |
|
Unsupported |
|
Attributes |
|
Prefixes |
|
Endpoints | See below |
Configuration¶
Basic Setup¶
Add Ollama to your Olla configuration:
discovery:
static:
endpoints:
- url: "http://localhost:11434"
name: "local-ollama"
type: "ollama"
priority: 100
model_url: "/api/tags"
health_check_url: "/"
check_interval: 2s
check_timeout: 1s
Multiple Ollama Instances¶
Configure multiple Ollama servers for load balancing:
discovery:
static:
endpoints:
# Primary GPU server
- url: "http://gpu-server:11434"
name: "ollama-gpu"
type: "ollama"
priority: 100
# Secondary server
- url: "http://backup-server:11434"
name: "ollama-backup"
type: "ollama"
priority: 75
# Development machine
- url: "http://dev-machine:11434"
name: "ollama-dev"
type: "ollama"
priority: 50
Remote Ollama Configuration¶
For remote Ollama servers:
discovery:
static:
endpoints:
- url: "https://ollama.example.com"
name: "ollama-cloud"
type: "ollama"
priority: 80
check_interval: 10s
check_timeout: 5s
Authentication Not Supported
Olla does not currently support authentication headers for endpoints. If your Ollama server requires authentication, you'll need to use a reverse proxy or wait for this feature to be added.
Endpoints Supported¶
The following endpoints are supported by the Ollama integration profile:
Path | Description |
---|---|
/ | Health Check |
/api/generate | Text Completion (Ollama format) |
/api/chat | Chat Completion (Ollama format) |
/api/embeddings | Generate Embeddings |
/api/tags | List Local Models |
/api/show | Show Model Information |
/v1/models | List Models (OpenAI format) |
/v1/chat/completions | Chat Completions (OpenAI format) |
/v1/completions | Text Completions (OpenAI format) |
/v1/embeddings | Embeddings (OpenAI format) |
Usage Examples¶
Chat Completion (Ollama Format)¶
curl -X POST http://localhost:40114/olla/ollama/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:latest",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the meaning of life?"}
],
"stream": false
}'
Text Generation (Ollama Format)¶
curl -X POST http://localhost:40114/olla/ollama/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "mistral:latest",
"prompt": "Once upon a time",
"options": {
"temperature": 0.8,
"num_predict": 100
}
}'
Streaming Response¶
curl -X POST http://localhost:40114/olla/ollama/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:latest",
"messages": [
{"role": "user", "content": "Write a haiku about programming"}
],
"stream": true
}'
OpenAI Compatibility¶
curl -X POST http://localhost:40114/olla/ollama/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:latest",
"messages": [
{"role": "user", "content": "Hello!"}
],
"temperature": 0.7,
"max_tokens": 150
}'
Embeddings¶
curl -X POST http://localhost:40114/olla/ollama/api/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "nomic-embed-text:latest",
"prompt": "The quick brown fox jumps over the lazy dog"
}'
List Available Models¶
# Ollama format
curl http://localhost:40114/olla/ollama/api/tags
# OpenAI format
curl http://localhost:40114/olla/ollama/v1/models
Model Information¶
curl -X POST http://localhost:40114/olla/ollama/api/show \
-H "Content-Type: application/json" \
-d '{"name": "llama3.2:latest"}'
Ollama Specifics¶
Model Loading Behaviour¶
Ollama has unique model loading characteristics:
- Dynamic Loading: Models load on first request
- Memory Management: Unloads models after idle timeout
- Loading Delay: First request to a model can be slow
- Concurrent Models: Limited by available memory
Model Naming Convention¶
Ollama uses a specific naming format:
Examples: - llama3.2:latest
- llama3.2:3b
- mistral:7b-instruct-q4_0
- library/codellama:13b
Quantisation Levels¶
Ollama supports various quantisation levels:
Quantisation | Memory Usage | Performance | Quality |
---|---|---|---|
Q4_0 | ~50% | Fast | Good |
Q4_1 | ~55% | Fast | Better |
Q5_0 | ~60% | Moderate | Better |
Q5_1 | ~65% | Moderate | Better |
Q8_0 | ~85% | Slower | Best |
F16 | 100% | Slowest | Highest |
Options Parameters¶
Ollama-specific generation options:
{
"options": {
"temperature": 0.8, // Randomness (0-1)
"top_k": 40, // Top K sampling
"top_p": 0.9, // Nucleus sampling
"num_predict": 128, // Max tokens to generate
"stop": ["\\n", "User:"], // Stop sequences
"seed": 42, // Reproducible generation
"num_ctx": 2048, // Context window size
"repeat_penalty": 1.1, // Repetition penalty
"mirostat": 2, // Mirostat sampling
"mirostat_tau": 5.0, // Mirostat target entropy
"mirostat_eta": 0.1 // Mirostat learning rate
}
}
Starting Ollama¶
Local Installation¶
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Start Ollama service
ollama serve
# Pull a model
ollama pull llama3.2:latest
# Test directly
ollama run llama3.2:latest "Hello"
Docker Deployment¶
# CPU only
docker run -d \
--name ollama \
-p 11434:11434 \
-v ollama:/root/.ollama \
ollama/ollama
# With GPU support
docker run -d \
--gpus all \
--name ollama \
-p 11434:11434 \
-v ollama:/root/.ollama \
ollama/ollama
Docker Compose¶
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0
- OLLAMA_KEEP_ALIVE=5m
- OLLAMA_MAX_LOADED_MODELS=2
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
ollama_data:
driver: local
Profile Customisation¶
To customise Ollama behaviour, create config/profiles/ollama-custom.yaml
. See Profile Configuration for detailed explanations of each section.
Example Customisation¶
name: ollama
version: "1.0"
# Add custom routing prefixes
routing:
prefixes:
- ollama
- ai # Add custom prefix
# Adjust for slow model loading
characteristics:
timeout: 10m # Increase from 5m for large models
# Model capability detection
models:
capability_patterns:
vision:
- "*llava*"
- "*bakllava*"
- "vision*"
embeddings:
- "*embed*"
- "nomic-embed-text*"
- "mxbai-embed*"
code:
- "*code*"
- "codellama*"
- "deepseek-coder*"
- "qwen*coder*"
# Context window detection
context_patterns:
- pattern: "*-32k*"
context: 32768
- pattern: "*-16k*"
context: 16384
- pattern: "llama3*"
context: 8192
See Profile Configuration for complete customisation options.
Environment Variables¶
Ollama behaviour can be controlled via environment variables:
Variable | Description | Default |
---|---|---|
OLLAMA_HOST | Bind address | 127.0.0.1:11434 |
OLLAMA_MODELS | Model storage path | ~/.ollama/models |
OLLAMA_KEEP_ALIVE | Model idle timeout | 5m |
OLLAMA_MAX_LOADED_MODELS | Max concurrent models | Unlimited |
OLLAMA_NUM_PARALLEL | Parallel request handling | 1 |
OLLAMA_MAX_QUEUE | Max queued requests | 512 |
OLLAMA_DEBUG | Enable debug logging | false |
Multi-Modal Support¶
Vision Models (LLaVA)¶
Ollama supports vision models for image analysis:
# Pull a vision model
ollama pull llava:latest
# Use with image
curl -X POST http://localhost:40114/olla/ollama/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llava:latest",
"prompt": "What is in this image?",
"images": ["base64_encoded_image_data"]
}'
Supported Vision Models¶
llava:latest
- General vision modelllava:13b
- Larger vision modelbakllava:latest
- Alternative vision model
Troubleshooting¶
Model Not Found¶
Issue: "model not found" error
Solution: 1. Ensure model is pulled:
- Verify model name format:
Slow First Request¶
Issue: First request to a model is very slow
Solution: 1. Pre-load models:
-
Increase keep-alive:
-
Adjust timeout in Olla:
Out of Memory¶
Issue: "out of memory" errors
Solution: 1. Limit concurrent models:
-
Use smaller quantisation:
-
Configure memory limits:
Connection Refused¶
Issue: Cannot connect to Ollama
Solution: 1. Check Ollama is running:
-
Verify bind address:
-
Check firewall:
Best Practices¶
1. Use Model Unification¶
With multiple Ollama instances, enable unification:
This provides a single model catalogue across all instances.
2. Configure Appropriate Timeouts¶
Account for model loading times:
proxy:
response_timeout: 600s # 10 minutes for large models (default)
connection_timeout: 30s # Default connection timeout
discovery:
static:
endpoints:
- url: "http://localhost:11434"
check_timeout: 5s # Allow time for health checks
3. Optimise for Your Hardware¶
For GPU Servers¶
endpoints:
- url: "http://gpu-server:11434"
name: "ollama-gpu"
priority: 100 # Prefer GPU
resources:
concurrency_limits:
- min_memory_gb: 0
max_concurrent: 4 # GPU can handle multiple
For CPU Servers¶
endpoints:
- url: "http://cpu-server:11434"
name: "ollama-cpu"
priority: 50 # Lower priority
resources:
concurrency_limits:
- min_memory_gb: 0
max_concurrent: 1 # CPU limited to one
4. Monitor Performance¶
Use Olla's status endpoints:
# Check health
curl http://localhost:40114/internal/health
# View endpoint status
curl http://localhost:40114/internal/status/endpoints
# Monitor model availability
curl http://localhost:40114/internal/status/models
Integration with Tools¶
OpenAI SDK¶
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:40114/olla/ollama/v1",
api_key="not-needed" # Ollama doesn't require API keys
)
response = client.chat.completions.create(
model="llama3.2:latest",
messages=[
{"role": "user", "content": "Hello!"}
]
)
LangChain¶
from langchain_community.llms import Ollama
llm = Ollama(
base_url="http://localhost:40114/olla/ollama",
model="llama3.2:latest"
)
response = llm.invoke("Tell me a joke")
Continue.dev¶
Configure Continue to use Olla with Ollama:
{
"models": [{
"title": "Ollama via Olla",
"provider": "ollama",
"model": "llama3.2:latest",
"apiBase": "http://localhost:40114/olla/ollama"
}]
}
Aider¶
# Use with Aider
aider --openai-api-base http://localhost:40114/olla/ollama/v1 \
--model llama3.2:latest
Next Steps¶
- Profile Configuration - Customise Ollama behaviour
- Model Unification - Understand model management
- Load Balancing - Configure multi-instance setups
- OpenWebUI Integration - Set up web interface