Ollama Integration¶

Home	github.com/ollama/ollama
Since	Olla `v0.0.1`
Type	`ollama` (use in endpoint configuration)
Profile	`ollama.yaml` (see latest)
Features	Proxy Forwarding Health Check (native) Model Unification Model Detection & Normalisation OpenAI API Compatibility GGUF Model Support
Unsupported	Model Management (pull/push/delete) Instance Management Model Creation (FROM commands)
Attributes	OpenAI Compatible Dynamic Model Loading Multi-Modal Support (LLaVA) Quantisation Support
Prefixes	`/ollama` (see Routing Prefixes)
Endpoints	See below

Configuration¶

Basic Setup¶

Add Ollama to your Olla configuration:

discovery:
  static:
    endpoints:
      - url: "http://localhost:11434"
        name: "local-ollama"
        type: "ollama"
        priority: 100
        model_url: "/api/tags"
        health_check_url: "/"
        check_interval: 2s
        check_timeout: 1s

Multiple Ollama Instances¶

Configure multiple Ollama servers for load balancing:

discovery:
  static:
    endpoints:
      # Primary GPU server
      - url: "http://gpu-server:11434"
        name: "ollama-gpu"
        type: "ollama"
        priority: 100

      # Secondary server
      - url: "http://backup-server:11434"
        name: "ollama-backup"
        type: "ollama"
        priority: 75

      # Development machine
      - url: "http://dev-machine:11434"
        name: "ollama-dev"
        type: "ollama"
        priority: 50

Remote Ollama Configuration¶

For remote Ollama servers:

discovery:
  static:
    endpoints:
      - url: "https://ollama.example.com"
        name: "ollama-cloud"
        type: "ollama"
        priority: 80
        check_interval: 10s
        check_timeout: 5s

Authentication Not Supported

Olla does not currently support authentication headers for endpoints. If your Ollama server requires authentication, you'll need to use a reverse proxy or wait for this feature to be added.

Endpoints Supported¶

The following endpoints are supported by the Ollama integration profile:

Path	Description
`/`	Health Check
`/api/generate`	Text Completion (Ollama format)
`/api/chat`	Chat Completion (Ollama format)
`/api/embeddings`	Generate Embeddings
`/api/tags`	List Local Models
`/api/show`	Show Model Information
`/v1/models`	List Models (OpenAI format)
`/v1/chat/completions`	Chat Completions (OpenAI format)
`/v1/completions`	Text Completions (OpenAI format)
`/v1/embeddings`	Embeddings (OpenAI format)

Usage Examples¶

Chat Completion (Ollama Format)¶

curl -X POST http://localhost:40114/olla/ollama/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:latest",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the meaning of life?"}
    ],
    "stream": false
  }'

Text Generation (Ollama Format)¶

curl -X POST http://localhost:40114/olla/ollama/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral:latest",
    "prompt": "Once upon a time",
    "options": {
      "temperature": 0.8,
      "num_predict": 100
    }
  }'

Streaming Response¶

curl -X POST http://localhost:40114/olla/ollama/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:latest",
    "messages": [
      {"role": "user", "content": "Write a haiku about programming"}
    ],
    "stream": true
  }'

OpenAI Compatibility¶

curl -X POST http://localhost:40114/olla/ollama/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:latest",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "temperature": 0.7,
    "max_tokens": 150
  }'

Embeddings¶

curl -X POST http://localhost:40114/olla/ollama/api/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed-text:latest",
    "prompt": "The quick brown fox jumps over the lazy dog"
  }'

List Available Models¶

# Ollama format
curl http://localhost:40114/olla/ollama/api/tags

# OpenAI format
curl http://localhost:40114/olla/ollama/v1/models

Model Information¶

curl -X POST http://localhost:40114/olla/ollama/api/show \
  -H "Content-Type: application/json" \
  -d '{"name": "llama3.2:latest"}'

Ollama Specifics¶

Model Loading Behaviour¶

Ollama has unique model loading characteristics:

Dynamic Loading: Models load on first request
Memory Management: Unloads models after idle timeout
Loading Delay: First request to a model can be slow
Concurrent Models: Limited by available memory

Model Naming Convention¶

Ollama uses a specific naming format:

model:tag
model:version
namespace/model:tag

Examples: - llama3.2:latest - llama3.2:3b - mistral:7b-instruct-q4_0 - library/codellama:13b

Quantisation Levels¶

Ollama supports various quantisation levels:

Quantisation	Memory Usage	Performance	Quality
Q4_0	~50%	Fast	Good
Q4_1	~55%	Fast	Better
Q5_0	~60%	Moderate	Better
Q5_1	~65%	Moderate	Better
Q8_0	~85%	Slower	Best
F16	100%	Slowest	Highest

Options Parameters¶

Ollama-specific generation options:

{
  "options": {
    "temperature": 0.8,      // Randomness (0-1)
    "top_k": 40,            // Top K sampling
    "top_p": 0.9,           // Nucleus sampling
    "num_predict": 128,     // Max tokens to generate
    "stop": ["\\n", "User:"], // Stop sequences
    "seed": 42,             // Reproducible generation
    "num_ctx": 2048,        // Context window size
    "repeat_penalty": 1.1,  // Repetition penalty
    "mirostat": 2,          // Mirostat sampling
    "mirostat_tau": 5.0,    // Mirostat target entropy
    "mirostat_eta": 0.1     // Mirostat learning rate
  }
}

Starting Ollama¶

Local Installation¶

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama service
ollama serve

# Pull a model
ollama pull llama3.2:latest

# Test directly
ollama run llama3.2:latest "Hello"

Docker Deployment¶

# CPU only
docker run -d \
  --name ollama \
  -p 11434:11434 \
  -v ollama:/root/.ollama \
  ollama/ollama

# With GPU support
docker run -d \
  --gpus all \
  --name ollama \
  -p 11434:11434 \
  -v ollama:/root/.ollama \
  ollama/ollama

Docker Compose¶

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
      - OLLAMA_KEEP_ALIVE=5m
      - OLLAMA_MAX_LOADED_MODELS=2
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  ollama_data:
    driver: local

Profile Customisation¶

To customise Ollama behaviour, create config/profiles/ollama-custom.yaml. See Profile Configuration for detailed explanations of each section.

Example Customisation¶

name: ollama
version: "1.0"

# Add custom routing prefixes
routing:
  prefixes:
    - ollama
    - ai        # Add custom prefix

# Adjust for slow model loading
characteristics:
  timeout: 10m   # Increase from 5m for large models

# Model capability detection
models:
  capability_patterns:
    vision:
      - "*llava*"
      - "*bakllava*"
      - "vision*"
    embeddings:
      - "*embed*"
      - "nomic-embed-text*"
      - "mxbai-embed*"
    code:
      - "*code*"
      - "codellama*"
      - "deepseek-coder*"
      - "qwen*coder*"

# Context window detection
  context_patterns:
    - pattern: "*-32k*"
      context: 32768
    - pattern: "*-16k*"
      context: 16384
    - pattern: "llama3*"
      context: 8192

See Profile Configuration for complete customisation options.

Environment Variables¶

Ollama behaviour can be controlled via environment variables:

Variable	Description	Default
`OLLAMA_HOST`	Bind address	`127.0.0.1:11434`
`OLLAMA_MODELS`	Model storage path	`~/.ollama/models`
`OLLAMA_KEEP_ALIVE`	Model idle timeout	`5m`
`OLLAMA_MAX_LOADED_MODELS`	Max concurrent models	Unlimited
`OLLAMA_NUM_PARALLEL`	Parallel request handling	`1`
`OLLAMA_MAX_QUEUE`	Max queued requests	`512`
`OLLAMA_DEBUG`	Enable debug logging	`false`

Vision Models (LLaVA)¶

Ollama supports vision models for image analysis:

# Pull a vision model
ollama pull llava:latest

# Use with image
curl -X POST http://localhost:40114/olla/ollama/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llava:latest",
    "prompt": "What is in this image?",
    "images": ["base64_encoded_image_data"]
  }'

Supported Vision Models¶

llava:latest - General vision model
llava:13b - Larger vision model
bakllava:latest - Alternative vision model

Troubleshooting¶

Model Not Found¶

Issue: "model not found" error

Solution: 1. Ensure model is pulled:

ollama list  # Check available models
ollama pull llama3.2:latest  # Pull if missing

Verify model name format:

# Correct
"model": "llama3.2:latest"

# Incorrect
"model": "llama3.2"  # Missing tag

Slow First Request¶

Issue: First request to a model is very slow

Solution: 1. Pre-load models:

ollama run llama3.2:latest ""  # Load into memory

Increase keep-alive:
```
OLLAMA_KEEP_ALIVE=24h ollama serve
```

Adjust timeout in Olla:

characteristics:
  timeout: 10m  # Allow for model loading

Out of Memory¶

Issue: "out of memory" errors

Solution: 1. Limit concurrent models:

OLLAMA_MAX_LOADED_MODELS=1 ollama serve

Use smaller quantisation:

ollama pull llama3.2:3b-q4_0  # Smaller variant

Configure memory limits:

resources:
  model_sizes:
    - patterns: ["7b"]
      min_memory_gb: 8
      max_concurrent: 1

Connection Refused¶

Issue: Cannot connect to Ollama

Solution: 1. Check Ollama is running:

ps aux | grep ollama
systemctl status ollama  # If using systemd

Verify bind address:
```
OLLAMA_HOST=0.0.0.0:11434 ollama serve
```
Check firewall:
```
sudo ufw allow 11434  # Ubuntu/Debian
```

Best Practices¶

1. Use Model Unification¶

With multiple Ollama instances, enable unification:

model_registry:
  enable_unifier: true
  unification:
    enabled: true

This provides a single model catalogue across all instances.

2. Configure Appropriate Timeouts¶

Account for model loading times:

proxy:
  response_timeout: 600s  # 10 minutes for large models (default)
  connection_timeout: 30s  # Default connection timeout

discovery:
  static:
    endpoints:
      - url: "http://localhost:11434"
        check_timeout: 5s  # Allow time for health checks

3. Optimise for Your Hardware¶

For GPU Servers¶

endpoints:
  - url: "http://gpu-server:11434"
    name: "ollama-gpu"
    priority: 100  # Prefer GPU

resources:
  concurrency_limits:
    - min_memory_gb: 0
      max_concurrent: 4  # GPU can handle multiple

For CPU Servers¶

endpoints:
  - url: "http://cpu-server:11434"
    name: "ollama-cpu"
    priority: 50  # Lower priority

resources:
  concurrency_limits:
    - min_memory_gb: 0
      max_concurrent: 1  # CPU limited to one

4. Monitor Performance¶

Use Olla's status endpoints:

# Check health
curl http://localhost:40114/internal/health

# View endpoint status
curl http://localhost:40114/internal/status/endpoints

# Monitor model availability
curl http://localhost:40114/internal/status/models

Integration with Tools¶

OpenAI SDK¶

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:40114/olla/ollama/v1",
    api_key="not-needed"  # Ollama doesn't require API keys
)

response = client.chat.completions.create(
    model="llama3.2:latest",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

LangChain¶

from langchain_community.llms import Ollama

llm = Ollama(
    base_url="http://localhost:40114/olla/ollama",
    model="llama3.2:latest"
)

response = llm.invoke("Tell me a joke")

Continue.dev¶

Configure Continue to use Olla with Ollama:

{
  "models": [{
    "title": "Ollama via Olla",
    "provider": "ollama",
    "model": "llama3.2:latest",
    "apiBase": "http://localhost:40114/olla/ollama"
  }]
}

Aider¶

# Use with Aider
aider --openai-api-base http://localhost:40114/olla/ollama/v1 \
      --model llama3.2:latest

Next Steps¶

Profile Configuration - Customise Ollama behaviour
Model Unification - Understand model management
Load Balancing - Configure multi-instance setups
OpenWebUI Integration - Set up web interface