Skip to content

Ollama Integration

Home github.com/ollama/ollama
Since Olla v0.0.1
Type ollama (use in endpoint configuration)
Profile ollama.yaml (see latest)
Features
  • Proxy Forwarding
  • Health Check (native)
  • Model Unification
  • Model Detection & Normalisation
  • OpenAI API Compatibility
  • GGUF Model Support
  • Native Anthropic Messages API (v0.14.0+)
Unsupported
  • Model Management (pull/push/delete)
  • Instance Management
  • Model Creation (FROM commands)
Attributes
  • OpenAI Compatible
  • Dynamic Model Loading
  • Multi-Modal Support (LLaVA)
  • Quantisation Support
Prefixes
Endpoints See below

Configuration

Basic Setup

Add Ollama to your Olla configuration:

discovery:
  static:
    endpoints:
      - url: "http://localhost:11434"
        name: "local-ollama"
        type: "ollama"
        priority: 100
        model_url: "/api/tags"
        health_check_url: "/"
        check_interval: 2s
        check_timeout: 1s

Multiple Ollama Instances

Configure multiple Ollama servers for load balancing:

discovery:
  static:
    endpoints:
      # Primary GPU server
      - url: "http://gpu-server:11434"
        name: "ollama-gpu"
        type: "ollama"
        priority: 100

      # Secondary server
      - url: "http://backup-server:11434"
        name: "ollama-backup"
        type: "ollama"
        priority: 75

      # Development machine
      - url: "http://dev-machine:11434"
        name: "ollama-dev"
        type: "ollama"
        priority: 50

Remote Ollama Configuration

For remote Ollama servers:

discovery:
  static:
    endpoints:
      - url: "https://ollama.example.com"
        name: "ollama-cloud"
        type: "ollama"
        priority: 80
        check_interval: 10s
        check_timeout: 5s

Authentication Not Supported

Olla does not currently support authentication headers for endpoints. If your Ollama server requires authentication, you'll need to use a reverse proxy or wait for this feature to be added.

Anthropic Messages API Support

Ollama v0.14.0+ natively supports the Anthropic Messages API, enabling Olla to forward Anthropic-format requests directly without translation overhead (passthrough mode).

When Olla detects that an Ollama endpoint supports native Anthropic format (via the anthropic_support section in config/profiles/ollama.yaml), it will bypass the Anthropic-to-OpenAI translation pipeline and forward requests directly to /v1/messages on the backend.

Profile configuration (from config/profiles/ollama.yaml):

api:
  anthropic_support:
    enabled: true
    messages_path: /v1/messages
    token_count: false
    min_version: "0.14.0"
    limitations:
      - token_counting_404

Key details:

  • Minimum Ollama version: v0.14.0
  • Token counting (/v1/messages/count_tokens): Not supported (returns 404)
  • Passthrough mode is automatic -- no client-side configuration needed
  • Responses include X-Olla-Mode: passthrough header when passthrough is active
  • Falls back to translation mode if passthrough conditions are not met

Ollama Anthropic Compatibility

For details on Ollama's Anthropic compatibility, see the Ollama Anthropic compatibility documentation.

For more information, see API Translation and Anthropic API Reference.

Endpoints Supported

The following endpoints are supported by the Ollama integration profile:

Path Description
/ Health Check
/api/generate Text Completion (Ollama format)
/api/chat Chat Completion (Ollama format)
/api/embeddings Generate Embeddings
/api/tags List Local Models
/api/show Show Model Information
/v1/models List Models (OpenAI format)
/v1/chat/completions Chat Completions (OpenAI format)
/v1/completions Text Completions (OpenAI format)
/v1/embeddings Embeddings (OpenAI format)

Usage Examples

Chat Completion (Ollama Format)

curl -X POST http://localhost:40114/olla/ollama/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:latest",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the meaning of life?"}
    ],
    "stream": false
  }'

Text Generation (Ollama Format)

curl -X POST http://localhost:40114/olla/ollama/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral:latest",
    "prompt": "Once upon a time",
    "options": {
      "temperature": 0.8,
      "num_predict": 100
    }
  }'

Streaming Response

curl -X POST http://localhost:40114/olla/ollama/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:latest",
    "messages": [
      {"role": "user", "content": "Write a haiku about programming"}
    ],
    "stream": true
  }'

OpenAI Compatibility

curl -X POST http://localhost:40114/olla/ollama/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:latest",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "temperature": 0.7,
    "max_tokens": 150
  }'

Embeddings

curl -X POST http://localhost:40114/olla/ollama/api/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed-text:latest",
    "prompt": "The quick brown fox jumps over the lazy dog"
  }'

List Available Models

# Ollama format
curl http://localhost:40114/olla/ollama/api/tags

# OpenAI format
curl http://localhost:40114/olla/ollama/v1/models

Model Information

curl -X POST http://localhost:40114/olla/ollama/api/show \
  -H "Content-Type: application/json" \
  -d '{"name": "llama3.2:latest"}'

Ollama Specifics

Model Loading Behaviour

Ollama has unique model loading characteristics:

  • Dynamic Loading: Models load on first request
  • Memory Management: Unloads models after idle timeout
  • Loading Delay: First request to a model can be slow
  • Concurrent Models: Limited by available memory

Model Naming Convention

Ollama uses a specific naming format:

model:tag
model:version
namespace/model:tag

Examples: - llama3.2:latest - llama3.2:3b - mistral:7b-instruct-q4_0 - library/codellama:13b

Quantisation Levels

Ollama supports various quantisation levels:

Quantisation Memory Usage Performance Quality
Q4_0 ~50% Fast Good
Q4_1 ~55% Fast Better
Q5_0 ~60% Moderate Better
Q5_1 ~65% Moderate Better
Q8_0 ~85% Slower Best
F16 100% Slowest Highest

Options Parameters

Ollama-specific generation options:

{
  "options": {
    "temperature": 0.8,      // Randomness (0-1)
    "top_k": 40,            // Top K sampling
    "top_p": 0.9,           // Nucleus sampling
    "num_predict": 128,     // Max tokens to generate
    "stop": ["\\n", "User:"], // Stop sequences
    "seed": 42,             // Reproducible generation
    "num_ctx": 2048,        // Context window size
    "repeat_penalty": 1.1,  // Repetition penalty
    "mirostat": 2,          // Mirostat sampling
    "mirostat_tau": 5.0,    // Mirostat target entropy
    "mirostat_eta": 0.1     // Mirostat learning rate
  }
}

Starting Ollama

Local Installation

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama service
ollama serve

# Pull a model
ollama pull llama3.2:latest

# Test directly
ollama run llama3.2:latest "Hello"

Docker Deployment

# CPU only
docker run -d \
  --name ollama \
  -p 11434:11434 \
  -v ollama:/root/.ollama \
  ollama/ollama

# With GPU support
docker run -d \
  --gpus all \
  --name ollama \
  -p 11434:11434 \
  -v ollama:/root/.ollama \
  ollama/ollama

Docker Compose

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
      - OLLAMA_KEEP_ALIVE=5m
      - OLLAMA_MAX_LOADED_MODELS=2
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  ollama_data:
    driver: local

Profile Customisation

To customise Ollama behaviour, create config/profiles/ollama-custom.yaml. See Profile Configuration for detailed explanations of each section.

Example Customisation

name: ollama
version: "1.0"

# Add custom routing prefixes
routing:
  prefixes:
    - ollama
    - ai        # Add custom prefix

# Adjust for slow model loading
characteristics:
  timeout: 10m   # Increase from 5m for large models

# Model capability detection
models:
  capability_patterns:
    vision:
      - "*llava*"
      - "*bakllava*"
      - "vision*"
    embeddings:
      - "*embed*"
      - "nomic-embed-text*"
      - "mxbai-embed*"
    code:
      - "*code*"
      - "codellama*"
      - "deepseek-coder*"
      - "qwen*coder*"

# Context window detection
  context_patterns:
    - pattern: "*-32k*"
      context: 32768
    - pattern: "*-16k*"
      context: 16384
    - pattern: "llama3*"
      context: 8192

See Profile Configuration for complete customisation options.

Environment Variables

Ollama behaviour can be controlled via environment variables:

Variable Description Default
OLLAMA_HOST Bind address 127.0.0.1:11434
OLLAMA_MODELS Model storage path ~/.ollama/models
OLLAMA_KEEP_ALIVE Model idle timeout 5m
OLLAMA_MAX_LOADED_MODELS Max concurrent models Unlimited
OLLAMA_NUM_PARALLEL Parallel request handling 1
OLLAMA_MAX_QUEUE Max queued requests 512
OLLAMA_DEBUG Enable debug logging false

Multi-Modal Support

Vision Models (LLaVA)

Ollama supports vision models for image analysis:

# Pull a vision model
ollama pull llava:latest

# Use with image
curl -X POST http://localhost:40114/olla/ollama/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llava:latest",
    "prompt": "What is in this image?",
    "images": ["base64_encoded_image_data"]
  }'

Supported Vision Models

  • llava:latest - General vision model
  • llava:13b - Larger vision model
  • bakllava:latest - Alternative vision model

Troubleshooting

Model Not Found

Issue: "model not found" error

Solution: 1. Ensure model is pulled:

ollama list  # Check available models
ollama pull llama3.2:latest  # Pull if missing

  1. Verify model name format:
    # Correct
    "model": "llama3.2:latest"
    
    # Incorrect
    "model": "llama3.2"  # Missing tag
    

Slow First Request

Issue: First request to a model is very slow

Solution: 1. Pre-load models:

ollama run llama3.2:latest ""  # Load into memory

  1. Increase keep-alive:

    OLLAMA_KEEP_ALIVE=24h ollama serve
    

  2. Adjust timeout in Olla:

    characteristics:
      timeout: 10m  # Allow for model loading
    

Out of Memory

Issue: "out of memory" errors

Solution: 1. Limit concurrent models:

OLLAMA_MAX_LOADED_MODELS=1 ollama serve

  1. Use smaller quantisation:

    ollama pull llama3.2:3b-q4_0  # Smaller variant
    

  2. Configure memory limits:

    resources:
      model_sizes:
        - patterns: ["7b"]
          min_memory_gb: 8
          max_concurrent: 1
    

Connection Refused

Issue: Cannot connect to Ollama

Solution: 1. Check Ollama is running:

ps aux | grep ollama
systemctl status ollama  # If using systemd

  1. Verify bind address:

    OLLAMA_HOST=0.0.0.0:11434 ollama serve
    

  2. Check firewall:

    sudo ufw allow 11434  # Ubuntu/Debian
    

Best Practices

1. Use Model Unification

With multiple Ollama instances, enable unification:

model_registry:
  enable_unifier: true
  unification:
    enabled: true

This provides a single model catalogue across all instances.

2. Configure Appropriate Timeouts

Account for model loading times:

proxy:
  response_timeout: 600s  # 10 minutes for large models (default)
  connection_timeout: 30s  # Default connection timeout

discovery:
  static:
    endpoints:
      - url: "http://localhost:11434"
        check_timeout: 5s  # Allow time for health checks

3. Optimise for Your Hardware

For GPU Servers

endpoints:
  - url: "http://gpu-server:11434"
    name: "ollama-gpu"
    priority: 100  # Prefer GPU

resources:
  concurrency_limits:
    - min_memory_gb: 0
      max_concurrent: 4  # GPU can handle multiple

For CPU Servers

endpoints:
  - url: "http://cpu-server:11434"
    name: "ollama-cpu"
    priority: 50  # Lower priority

resources:
  concurrency_limits:
    - min_memory_gb: 0
      max_concurrent: 1  # CPU limited to one

4. Monitor Performance

Use Olla's status endpoints:

# Check health
curl http://localhost:40114/internal/health

# View endpoint status
curl http://localhost:40114/internal/status/endpoints

# Monitor model availability
curl http://localhost:40114/internal/status/models

Integration with Tools

OpenAI SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:40114/olla/ollama/v1",
    api_key="not-needed"  # Ollama doesn't require API keys
)

response = client.chat.completions.create(
    model="llama3.2:latest",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

LangChain

from langchain_community.llms import Ollama

llm = Ollama(
    base_url="http://localhost:40114/olla/ollama",
    model="llama3.2:latest"
)

response = llm.invoke("Tell me a joke")

Continue.dev

Configure Continue to use Olla with Ollama:

{
  "models": [{
    "title": "Ollama via Olla",
    "provider": "ollama",
    "model": "llama3.2:latest",
    "apiBase": "http://localhost:40114/olla/ollama"
  }]
}

Aider

# Use with Aider
aider --openai-api-base http://localhost:40114/olla/ollama/v1 \
      --model llama3.2:latest

Next Steps