Skip to content

Ollama Integration

Home github.com/ollama/ollama
Type ollama (use in endpoint configuration)
Profile ollama.yaml (see latest)
Features
  • Proxy Forwarding
  • Health Check (native)
  • Model Unification
  • Model Detection & Normalisation
  • OpenAI API Compatibility
  • GGUF Model Support
Unsupported
  • Model Management (pull/push/delete)
  • Instance Management
  • Model Creation (FROM commands)
Attributes
  • OpenAI Compatible
  • Dynamic Model Loading
  • Multi-Modal Support (LLaVA)
  • Quantisation Support
Prefixes
Endpoints See below

Configuration

Basic Setup

Add Ollama to your Olla configuration:

discovery:
  static:
    endpoints:
      - url: "http://localhost:11434"
        name: "local-ollama"
        type: "ollama"
        priority: 100
        model_url: "/api/tags"
        health_check_url: "/"
        check_interval: 2s
        check_timeout: 1s

Multiple Ollama Instances

Configure multiple Ollama servers for load balancing:

discovery:
  static:
    endpoints:
      # Primary GPU server
      - url: "http://gpu-server:11434"
        name: "ollama-gpu"
        type: "ollama"
        priority: 100

      # Secondary server
      - url: "http://backup-server:11434"
        name: "ollama-backup"
        type: "ollama"
        priority: 75

      # Development machine
      - url: "http://dev-machine:11434"
        name: "ollama-dev"
        type: "ollama"
        priority: 50

Remote Ollama Configuration

For remote Ollama servers:

discovery:
  static:
    endpoints:
      - url: "https://ollama.example.com"
        name: "ollama-cloud"
        type: "ollama"
        priority: 80
        check_interval: 10s
        check_timeout: 5s

Authentication Not Supported

Olla does not currently support authentication headers for endpoints. If your Ollama server requires authentication, you'll need to use a reverse proxy or wait for this feature to be added.

Endpoints Supported

The following endpoints are supported by the Ollama integration profile:

Path Description
/ Health Check
/api/generate Text Completion (Ollama format)
/api/chat Chat Completion (Ollama format)
/api/embeddings Generate Embeddings
/api/tags List Local Models
/api/show Show Model Information
/v1/models List Models (OpenAI format)
/v1/chat/completions Chat Completions (OpenAI format)
/v1/completions Text Completions (OpenAI format)
/v1/embeddings Embeddings (OpenAI format)

Usage Examples

Chat Completion (Ollama Format)

curl -X POST http://localhost:40114/olla/ollama/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:latest",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the meaning of life?"}
    ],
    "stream": false
  }'

Text Generation (Ollama Format)

curl -X POST http://localhost:40114/olla/ollama/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral:latest",
    "prompt": "Once upon a time",
    "options": {
      "temperature": 0.8,
      "num_predict": 100
    }
  }'

Streaming Response

curl -X POST http://localhost:40114/olla/ollama/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:latest",
    "messages": [
      {"role": "user", "content": "Write a haiku about programming"}
    ],
    "stream": true
  }'

OpenAI Compatibility

curl -X POST http://localhost:40114/olla/ollama/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:latest",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "temperature": 0.7,
    "max_tokens": 150
  }'

Embeddings

curl -X POST http://localhost:40114/olla/ollama/api/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed-text:latest",
    "prompt": "The quick brown fox jumps over the lazy dog"
  }'

List Available Models

# Ollama format
curl http://localhost:40114/olla/ollama/api/tags

# OpenAI format
curl http://localhost:40114/olla/ollama/v1/models

Model Information

curl -X POST http://localhost:40114/olla/ollama/api/show \
  -H "Content-Type: application/json" \
  -d '{"name": "llama3.2:latest"}'

Ollama Specifics

Model Loading Behaviour

Ollama has unique model loading characteristics:

  • Dynamic Loading: Models load on first request
  • Memory Management: Unloads models after idle timeout
  • Loading Delay: First request to a model can be slow
  • Concurrent Models: Limited by available memory

Model Naming Convention

Ollama uses a specific naming format:

model:tag
model:version
namespace/model:tag

Examples: - llama3.2:latest - llama3.2:3b - mistral:7b-instruct-q4_0 - library/codellama:13b

Quantisation Levels

Ollama supports various quantisation levels:

Quantisation Memory Usage Performance Quality
Q4_0 ~50% Fast Good
Q4_1 ~55% Fast Better
Q5_0 ~60% Moderate Better
Q5_1 ~65% Moderate Better
Q8_0 ~85% Slower Best
F16 100% Slowest Highest

Options Parameters

Ollama-specific generation options:

{
  "options": {
    "temperature": 0.8,      // Randomness (0-1)
    "top_k": 40,            // Top K sampling
    "top_p": 0.9,           // Nucleus sampling
    "num_predict": 128,     // Max tokens to generate
    "stop": ["\\n", "User:"], // Stop sequences
    "seed": 42,             // Reproducible generation
    "num_ctx": 2048,        // Context window size
    "repeat_penalty": 1.1,  // Repetition penalty
    "mirostat": 2,          // Mirostat sampling
    "mirostat_tau": 5.0,    // Mirostat target entropy
    "mirostat_eta": 0.1     // Mirostat learning rate
  }
}

Starting Ollama

Local Installation

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama service
ollama serve

# Pull a model
ollama pull llama3.2:latest

# Test directly
ollama run llama3.2:latest "Hello"

Docker Deployment

# CPU only
docker run -d \
  --name ollama \
  -p 11434:11434 \
  -v ollama:/root/.ollama \
  ollama/ollama

# With GPU support
docker run -d \
  --gpus all \
  --name ollama \
  -p 11434:11434 \
  -v ollama:/root/.ollama \
  ollama/ollama

Docker Compose

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
      - OLLAMA_KEEP_ALIVE=5m
      - OLLAMA_MAX_LOADED_MODELS=2
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  ollama_data:
    driver: local

Profile Customisation

To customise Ollama behaviour, create config/profiles/ollama-custom.yaml. See Profile Configuration for detailed explanations of each section.

Example Customisation

name: ollama
version: "1.0"

# Add custom routing prefixes
routing:
  prefixes:
    - ollama
    - ai        # Add custom prefix

# Adjust for slow model loading
characteristics:
  timeout: 10m   # Increase from 5m for large models

# Model capability detection
models:
  capability_patterns:
    vision:
      - "*llava*"
      - "*bakllava*"
      - "vision*"
    embeddings:
      - "*embed*"
      - "nomic-embed-text*"
      - "mxbai-embed*"
    code:
      - "*code*"
      - "codellama*"
      - "deepseek-coder*"
      - "qwen*coder*"

# Context window detection
  context_patterns:
    - pattern: "*-32k*"
      context: 32768
    - pattern: "*-16k*"
      context: 16384
    - pattern: "llama3*"
      context: 8192

See Profile Configuration for complete customisation options.

Environment Variables

Ollama behaviour can be controlled via environment variables:

Variable Description Default
OLLAMA_HOST Bind address 127.0.0.1:11434
OLLAMA_MODELS Model storage path ~/.ollama/models
OLLAMA_KEEP_ALIVE Model idle timeout 5m
OLLAMA_MAX_LOADED_MODELS Max concurrent models Unlimited
OLLAMA_NUM_PARALLEL Parallel request handling 1
OLLAMA_MAX_QUEUE Max queued requests 512
OLLAMA_DEBUG Enable debug logging false

Multi-Modal Support

Vision Models (LLaVA)

Ollama supports vision models for image analysis:

# Pull a vision model
ollama pull llava:latest

# Use with image
curl -X POST http://localhost:40114/olla/ollama/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llava:latest",
    "prompt": "What is in this image?",
    "images": ["base64_encoded_image_data"]
  }'

Supported Vision Models

  • llava:latest - General vision model
  • llava:13b - Larger vision model
  • bakllava:latest - Alternative vision model

Troubleshooting

Model Not Found

Issue: "model not found" error

Solution: 1. Ensure model is pulled:

ollama list  # Check available models
ollama pull llama3.2:latest  # Pull if missing

  1. Verify model name format:
    # Correct
    "model": "llama3.2:latest"
    
    # Incorrect
    "model": "llama3.2"  # Missing tag
    

Slow First Request

Issue: First request to a model is very slow

Solution: 1. Pre-load models:

ollama run llama3.2:latest ""  # Load into memory

  1. Increase keep-alive:

    OLLAMA_KEEP_ALIVE=24h ollama serve
    

  2. Adjust timeout in Olla:

    characteristics:
      timeout: 10m  # Allow for model loading
    

Out of Memory

Issue: "out of memory" errors

Solution: 1. Limit concurrent models:

OLLAMA_MAX_LOADED_MODELS=1 ollama serve

  1. Use smaller quantisation:

    ollama pull llama3.2:3b-q4_0  # Smaller variant
    

  2. Configure memory limits:

    resources:
      model_sizes:
        - patterns: ["7b"]
          min_memory_gb: 8
          max_concurrent: 1
    

Connection Refused

Issue: Cannot connect to Ollama

Solution: 1. Check Ollama is running:

ps aux | grep ollama
systemctl status ollama  # If using systemd

  1. Verify bind address:

    OLLAMA_HOST=0.0.0.0:11434 ollama serve
    

  2. Check firewall:

    sudo ufw allow 11434  # Ubuntu/Debian
    

Best Practices

1. Use Model Unification

With multiple Ollama instances, enable unification:

model_registry:
  enable_unifier: true
  unification:
    enabled: true

This provides a single model catalogue across all instances.

2. Configure Appropriate Timeouts

Account for model loading times:

proxy:
  response_timeout: 600s  # 10 minutes for large models (default)
  connection_timeout: 30s  # Default connection timeout

discovery:
  static:
    endpoints:
      - url: "http://localhost:11434"
        check_timeout: 5s  # Allow time for health checks

3. Optimise for Your Hardware

For GPU Servers

endpoints:
  - url: "http://gpu-server:11434"
    name: "ollama-gpu"
    priority: 100  # Prefer GPU

resources:
  concurrency_limits:
    - min_memory_gb: 0
      max_concurrent: 4  # GPU can handle multiple

For CPU Servers

endpoints:
  - url: "http://cpu-server:11434"
    name: "ollama-cpu"
    priority: 50  # Lower priority

resources:
  concurrency_limits:
    - min_memory_gb: 0
      max_concurrent: 1  # CPU limited to one

4. Monitor Performance

Use Olla's status endpoints:

# Check health
curl http://localhost:40114/internal/health

# View endpoint status
curl http://localhost:40114/internal/status/endpoints

# Monitor model availability
curl http://localhost:40114/internal/status/models

Integration with Tools

OpenAI SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:40114/olla/ollama/v1",
    api_key="not-needed"  # Ollama doesn't require API keys
)

response = client.chat.completions.create(
    model="llama3.2:latest",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

LangChain

from langchain_community.llms import Ollama

llm = Ollama(
    base_url="http://localhost:40114/olla/ollama",
    model="llama3.2:latest"
)

response = llm.invoke("Tell me a joke")

Continue.dev

Configure Continue to use Olla with Ollama:

{
  "models": [{
    "title": "Ollama via Olla",
    "provider": "ollama",
    "model": "llama3.2:latest",
    "apiBase": "http://localhost:40114/olla/ollama"
  }]
}

Aider

# Use with Aider
aider --openai-api-base http://localhost:40114/olla/ollama/v1 \
      --model llama3.2:latest

Next Steps