Model Unification - Single Catalogue per Provider Type¶

Default Configuration
discovery:
  model_discovery:
    enabled: true
    interval: 5m
    concurrent_workers: 3
Supported Settings:

enabled (default: true) - Enable automatic model discovery

interval (default: 5m) - How often to refresh model lists

concurrent_workers (default: 3) - Parallel discovery workers

Environment Variables: - OLLA_DISCOVERY_MODEL_DISCOVERY_ENABLED - OLLA_DISCOVERY_MODEL_DISCOVERY_INTERVAL

Model unification creates a consolidated view of models available across multiple endpoints of the same type. When you have multiple Ollama instances or multiple LM Studio servers, Olla deduplicates and merges the model lists to show you what's available and where for each type.

Key Concept: Per-Provider Unification¶

Important: Model unification happens within each provider type, not across different providers:

Multiple Ollama instances → Unified Ollama model catalogue
Multiple LM Studio instances → Unified LM Studio model catalogue
Multiple vLLM servers → Unified vLLM model catalogue

Models are NOT unified across different provider types. This means llama3.2 on Ollama remains separate from meta/llama3.2 or llama3.2 on LM Studio, as they may have different formats, quantizations, or capabilities.

Why Per-Provider Unification?¶

Different providers handle models differently:

Format differences: Ollama uses its own format, LM Studio uses GGUF, vLLM uses HuggingFace format
API differences: Each provider has unique API endpoints and capabilities
Metadata differences: Model information varies significantly between providers
Performance characteristics: Same model may perform differently on different platforms

How It Works¶

1. Model Discovery¶

Each endpoint of the same type reports its models:

# Two Ollama instances
ollama-server-1:
  - llama3.2:latest
  - mistral:7b
  - codellama:13b

ollama-server-2:
  - llama3.2:latest  # Duplicate
  - mixtral:8x7b
  - phi3:mini

2. Deduplication¶

Models with the same name or digest are identified:

# After deduplication
unified-ollama-models:
  - llama3.2:latest (available on: ollama-server-1, ollama-server-2)
  - mistral:7b (available on: ollama-server-1)
  - codellama:13b (available on: ollama-server-1)
  - mixtral:8x7b (available on: ollama-server-2)
  - phi3:mini (available on: ollama-server-2)

3. Unified Access¶

Request any model, and Olla routes to an available endpoint:

# Request llama3.2 via Ollama endpoints - Olla picks from server-1 or server-2
curl -X POST http://localhost:40114/olla/ollama/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:latest", "messages": [{"role": "user", "content": "Hello"}]}'

# Or use the native Ollama API
curl -X POST http://localhost:40114/olla/ollama/api/chat \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:latest", "messages": [{"role": "user", "content": "Hello"}]}'

Configuration¶

Enable model unification in your configuration:

model_registry:
  type: "memory"
  enable_unifier: true
  unification:
    enabled: true
    stale_threshold: 24h   # Remove models not seen for 24 hours
    cleanup_interval: 10m  # Check for stale models every 10 minutes

Example: Multiple Ollama Instances¶

Configuration¶

discovery:
  static:
    endpoints:
      # Production Ollama cluster
      - url: "http://ollama-prod-1:11434"
        name: "prod-1"
        type: "ollama"
        priority: 100

      - url: "http://ollama-prod-2:11434"
        name: "prod-2"
        type: "ollama"
        priority: 100

      - url: "http://ollama-prod-3:11434"
        name: "prod-3"
        type: "ollama"
        priority: 100

Result¶

With unification, you get a single model list:

{
  "models": [
    {
      "id": "llama3.2:latest",
      "available_on": ["prod-1", "prod-2", "prod-3"],
      "load_balanced": true
    },
    {
      "id": "mistral:7b",
      "available_on": ["prod-1", "prod-3"],
      "load_balanced": true
    },
    {
      "id": "specialized-model:latest",
      "available_on": ["prod-2"],
      "load_balanced": false
    }
  ]
}

Mixed Provider Types¶

When you have different provider types, each maintains its own unified catalogue:

discovery:
  static:
    endpoints:
      # Ollama instances (unified together)
      - url: "http://ollama-1:11434"
        type: "ollama"
      - url: "http://ollama-2:11434"
        type: "ollama"

      # LM Studio instances (unified together, separate from Ollama)
      - url: "http://lmstudio-1:1234"
        type: "lm-studio"
      - url: "http://lmstudio-2:1234"
        type: "lm-studio"

Result: Two separate unified catalogues: - Unified Ollama models from ollama-1 and ollama-2 - Unified LM Studio models from lmstudio-1 and lmstudio-2

Deduplication Strategies¶

Digest Matching (Most Reliable)¶

For providers that expose model digests (like Ollama):

# Same model file = same digest = unified
server-1: llama3.2 (sha256:abc123...)
server-2: llama3.2 (sha256:abc123...)
Result: Single model entry with 2 endpoints

Name Matching (Fallback)¶

When digests aren't available:

# Same name = potentially same model
server-1: mistral-7b-instruct
server-2: mistral-7b-instruct
Result: Unified if other parameters match

Load Balancing Unified Models¶

When a model is available on multiple endpoints of the same type:

Priority-based: Route to highest priority endpoint with the model
Round-robin: Distribute requests across all endpoints with the model
Least-connections: Route to least busy endpoint with the model

proxy:
  load_balancer: "round-robin"  # Distribute across all endpoints with the model

Monitoring¶

Model Listing Endpoints¶

Olla provides multiple ways to retrieve model information:

Unified Models Endpoint¶

The /olla/models endpoint returns all models across all providers with format support:

# Default unified format - comprehensive model information
curl http://localhost:40114/olla/models

# OpenAI-compatible format
curl http://localhost:40114/olla/models?format=openai

# Ollama native format
curl http://localhost:40114/olla/models?format=ollama

# LM Studio format
curl http://localhost:40114/olla/models?format=lmstudio

# vLLM format
curl http://localhost:40114/olla/models?format=vllm

Provider-Specific Endpoints¶

Each provider has its own model listing endpoints:

# Ollama models
curl http://localhost:40114/olla/ollama/api/tags         # Native Ollama format
curl http://localhost:40114/olla/ollama/v1/models        # OpenAI-compatible format

# LM Studio models
curl http://localhost:40114/olla/lmstudio/v1/models      # OpenAI format
curl http://localhost:40114/olla/lmstudio/api/v0/models  # Enhanced LM Studio format

# OpenAI models
curl http://localhost:40114/olla/openai/v1/models        # Standard OpenAI format

# vLLM models
curl http://localhost:40114/olla/vllm/v1/models          # OpenAI-compatible format

Internal Status Endpoints¶

For monitoring and debugging:

# View all models and their endpoints
curl http://localhost:40114/internal/status/models

# View detailed model information
curl http://localhost:40114/internal/status/models?detailed=true

Response Format Examples¶

Unified Format (default):

{
  "models": [
    {
      "id": "llama3.2:latest",
      "name": "llama3.2",
      "provider": "ollama",
      "endpoints": ["ollama-1", "ollama-2"],
      "capabilities": {
        "chat": true,
        "completion": true,
        "embeddings": false,
        "vision": false
      },
      "context_length": 8192,
      "created": "2024-01-15T10:00:00Z"
    }
  ]
}

OpenAI Format (?format=openai):

{
  "object": "list",
  "data": [
    {
      "id": "llama3.2:latest",
      "object": "model",
      "created": 1705316400,
      "owned_by": "olla"
    }
  ]
}

Ollama Format (?format=ollama):

{
  "models": [
    {
      "name": "llama3.2:latest",
      "model": "llama3.2:latest",
      "modified_at": "2024-01-15T10:00:00Z",
      "size": 4000000000,
      "digest": "sha256:abc123...",
      "details": {
        "family": "llama",
        "parameter_size": "7B",
        "quantization_level": "Q4_K_M"
      }
    }
  ]
}

Benefits¶

Resource Efficiency¶

Deduplication: No redundant model entries in your catalogue
Smart routing: Requests go to available endpoints automatically
Failover: If one endpoint fails, requests route to others with the same model

Operational Simplicity¶

Single catalogue: One model list per provider type instead of many
Transparent access: Users don't need to know which endpoint has which model
Dynamic updates: Model list updates as endpoints come and go

Scalability¶

Horizontal scaling: Add more endpoints without changing client configuration
Model distribution: Spread different models across endpoints
Load distribution: Balance requests across endpoints with the same model

Common Patterns¶

High Availability¶

Deploy the same models on multiple endpoints:

# All servers have core models for redundancy
ollama-1: [llama3.2, mistral, codellama]
ollama-2: [llama3.2, mistral, codellama]
ollama-3: [llama3.2, mistral, codellama]

Specialised Endpoints¶

Different endpoints serve different models:

# Specialized model distribution
ollama-gpu-1: [llama3.2:70b, mixtral:8x7b]  # Large models
ollama-gpu-2: [codellama, starcoder]        # Code models
ollama-cpu-1: [phi3:mini, tinyllama]        # Small models

Development vs Production¶

Separate model sets by environment:

# Development has experimental models
ollama-dev: [llama3.2, experimental-model, test-model]

# Production has stable models only
ollama-prod: [llama3.2, mistral, codellama]

Limitations¶

No cross-provider unification: Ollama models stay separate from LM Studio models
Name conflicts: Models with same name but different actual files may be incorrectly unified
Metadata sync: Model metadata updates may take time to propagate

Next Steps¶

Configure Load Balancing for unified models
Set up Health Checking for endpoint monitoring
Review Configuration Examples for common setups