Profile Configuration¶

Default Configuration
# Built-in profiles (auto-loaded)
# Ollama: config/profiles/ollama.yaml
# LM Studio: config/profiles/lmstudio.yaml
# vLLM: config/profiles/vllm.yaml
# OpenAI: config/profiles/openai.yaml

endpoints:
  - type: "ollama"      # Uses ollama profile
  - type: "lm-studio"   # Uses lmstudio profile
  - type: "vllm"        # Uses vllm profile
Key Features:

Profiles are auto-loaded from config/profiles/

Custom profiles override built-in ones by name

Selected via endpoint type field

Custom Profiles: Place YAML files in config/profiles/ to add or override

Profiles are the core of Olla's backend integration system. They define how Olla communicates with different LLM platforms, what APIs are exposed, how requests are routed and how responses are parsed.

Overview¶

Each backend type (Ollama, LM Studio, vLLM, OpenAI Compatibility) has a profile that controls:

URL Routing - Which URL prefixes map to this backend
API Filtering - Which API paths are allowed through the proxy
Model Discovery - How to find and parse available models
Request Handling - How to parse and route requests
Native Intercepts - Special handlers for platform-specific endpoints
Resource Management - Memory requirements and concurrency limits
Capability Detection - Identifying model features (vision, embeddings, code)

Profile Loading¶

Profiles are loaded during Olla startup in a specific order:

Native profiles are shipped with Olla in the config/profiles directory
Custom profiles found in config/profiles/*.yaml override or extend built-ins
Profile selection happens based on the type field in endpoint configuration

Profile Overrides

A custom profile with the same name as a built-in will completely replace the built-in profile.

For example, if you want to extend vLLM, creating a copy of vllm.yaml as vllm-custom.yaml will override the other profile.

Core Concepts¶

Basic Meta¶

Basic meta data about an LLM Backend is provided in these fields.

For example, the vllm.yaml file contains:

name: ollama
version: "1.0"
display_name: "Ollama"
description: "Local Ollama instance for running GGUF models"

name - Name of this particular profile, any subsequent profiles can specify the same name and be overridden.
version - Version of the Profile format, we're currently in 1.0
display_name - A nicer way to display information about the profile
description - A short description for the interface

Routing Prefixes¶

The routing.prefixes section defines URL paths that route to this backend.

routing:
  prefixes:
    - ollama      # Routes /olla/ollama/* to this backend
    - ma          # Routes /olla/ma/* to this backend

How it works:

Each prefix creates a URL namespace under /olla/
Requests to /olla/{prefix}/* are routed to endpoints with matching profile
Multiple prefixes allow flexibility (e.g., lmstudio, lm-studio, lm_studio)
The prefix is stripped from requests before forwarding to the backend (Eg. /olla/ma/v1/chat => /v1/chat sent to backend)

For convenience, some profiles (like lmstudio.yaml) specify multiple variations of its name:

routing:
  prefixes:
    - lmstudio
    - lm-studio
    - lm_studio

You can observe the various prefixes URIs being created at Olla's startup for the above:

ROUTE                             | METHOD | DESCRIPTION
/olla/lmstudio/v1/models          | GET    | lmstudio models (OpenAI format)
/olla/lmstudio/api/v1/models      | GET    | lmstudio models (OpenAI format alt path)
/olla/lmstudio/api/v0/models      | GET    | lmstudio enhanced models
/olla/lmstudio/                   |        | lmstudio proxy
/olla/lm-studio/v1/models         | GET    | lm-studio models (OpenAI format)
/olla/lm-studio/api/v1/models     | GET    | lm-studio models (OpenAI format alt path)
/olla/lm-studio/api/v0/models     | GET    | lm-studio enhanced models
/olla/lm-studio/                  |        | lm-studio proxy
/olla/lm_studio/v1/models         | GET    | lm_studio models (OpenAI format)
/olla/lm_studio/api/v1/models     | GET    | lm_studio models (OpenAI format alt path)
/olla/lm_studio/api/v0/models     | GET    | lm_studio enhanced models
/olla/lm_studio/                  |        | lm_studio proxy
...

{#prefix-cull} To reduce the number, you can comment ones you don't wish to use:

routing:
  prefixes:
    - lmstudio
    #- lm-studio
    #- lm_studio

API Path Filtering¶

The api.paths section acts as an allowlist - only these paths can be proxied to the backend.

api:
  paths:
    - /                    # 0: Health check
    - /api/generate        # 1: Text completion
    - /api/chat            # 2: Chat completion
    - /v1/models           # 3: OpenAI models list
    - /v1/chat/completions # 4: OpenAI chat

How it works:

Only paths in this list are forwarded to the backend
Unlisted paths return 404 or 501 (Not Implemented)
Paths are exact matches (no wildcards)
Order matters for path_indices references

Security benefit:

Prevents access to administrative or dangerous endpoints
Limits attack surface to known-safe operations
Blocks model management operations (pull/push/delete)
Blocks unsupported or unknown endpoints to general users

Native Intercepts¶

Some profiles have native handlers that intercept specific endpoints instead of proxying them.

Built-in intercepts:

Profile	Endpoint	Purpose
ollama	`/api/tags`	Aggregates models across instances
ollama	`/v1/models`	Converts to OpenAI format
lmstudio	`/v1/models`	Handles LM Studio format
all	`/api/pull`	Blocks model management

Generally, any time a backend returns models, Olla intercepts the call and returns a unified representation of all models in instances of the backend.

How it works:

Request arrives at /olla/ollama/api/tags
Olla checks if a native handler exists
If yes, handler aggregates data from all healthy endpoints
If no, request is proxied to a single endpoint

Why intercepts exist:

Model aggregation - Combine model lists from multiple instances
Format conversion - Transform between Ollama/OpenAI formats
Safety - Block dangerous operations (model deletion)
Optimisation - Cache responses across instances

Model Discovery¶

The api.model_discovery_path defines how Olla finds available models.

api:
  model_discovery_path: /api/tags  # Ollama
  # or
  model_discovery_path: /v1/models # OpenAI-compatible

Discovery process:

Health check verifies endpoint is online
GET request to {endpoint_url}{model_discovery_path}
Response parsed according to request.response_format
Models stored in registry with endpoint attribution
Unified view created across same-type endpoints

Response Parsing¶

The request.response_format determines how model responses are parsed.

request:
  response_format: "ollama"  # or "openai", "lmstudio", "vllm"

Format mappings:

Format	Models Field Path	Parser
ollama	`models`	Ollama JSON structure
openai	`data`	OpenAI models array
lmstudio	`data`	OpenAI-compatible
vllm	`data`	OpenAI-compatible

Capability Detection¶

The models.capability_patterns section uses glob patterns to detect model features.

models:
  capability_patterns:
    vision:
      - "*llava*"        # Matches llava, llava-13b, etc.
      - "*vision*"       # Matches gpt-4-vision
    embeddings:
      - "*embed*"        # Matches embed, embedding, text-embed
      - "nomic-embed-text" # Exact match
    code:
      - "*code*"         # Matches codellama, deepseek-coder
      - "qwen*coder*"    # Matches qwen-coder, qwencoder-7b

Pattern matching:

Uses Go filepath.Match glob patterns
* matches any characters
Case-sensitive matching
First matching pattern wins

Context Window Detection¶

The models.context_patterns maps model names to context sizes.

models:
  context_patterns:
    - pattern: "*llama-3.1*"
      context: 131072    # 128K context
    - pattern: "*-32k*"
      context: 32768     # 32K context
    - pattern: "*-16k*"
      context: 16384     # 16K context
    - pattern: "llama3*"
      context: 8192      # Default 8K

Matching order:

Patterns evaluated top to bottom
First match sets context size
No match uses platform default

Complete Profile Structure¶

# Profile identification
name: ollama              # Unique identifier (required)
version: "1.0"            # Profile version
display_name: "Ollama"    # Human-readable name
description: "Local Ollama instance"

# URL routing
routing:
  prefixes:               # URL prefixes for this backend
    - ollama
    - ai

# API configuration
api:
  openai_compatible: true        # Supports OpenAI API format
  paths:                          # Allowed API paths (allowlist)
    - /
    - /api/generate
    - /api/chat
    - /api/tags
    - /v1/models
    - /v1/chat/completions
  model_discovery_path: /api/tags # Where to find models
  health_check_path: /            # Health check endpoint
  metrics_path: /metrics          # Prometheus metrics (vLLM)

# Platform characteristics
characteristics:
  timeout: 5m                     # Request timeout
  max_concurrent_requests: 10    # Concurrent request limit
  default_priority: 100           # Default endpoint priority
  streaming_support: true         # Supports streaming responses

# Request/response handling
request:
  model_field_paths:              # Where to find model name in request
    - "model"
    - "model_name"
  response_format: "ollama"       # Response parser type
  parsing_rules:
    chat_completions_path: "/api/chat"
    completions_path: "/api/generate"
    model_field_name: "model"
    supports_streaming: true

# Model handling
models:
  name_format: "{{.Name}}"        # Model name template
  capability_patterns:            # Feature detection patterns
    vision:
      - "*llava*"
    embeddings:
      - "*embed*"
    code:
      - "*code*"
  context_patterns:               # Context size detection
    - pattern: "*-32k*"
      context: 32768

# Auto-detection hints
detection:
  user_agent_patterns:            # User-Agent headers
    - "ollama/"
  headers:                        # Response headers
    - "X-Ollama-Version"
  path_indicators:                # Unique API paths
    - "/api/tags"
  default_ports:                  # Common ports
    - 11434

# Resource management
resources:
  model_sizes:                    # Memory requirements by model size
    - patterns: ["70b", "72b"]
      min_memory_gb: 40
      recommended_memory_gb: 48
      min_gpu_memory_gb: 40
      estimated_load_time_ms: 300000

  quantization:                   # Quantisation memory multipliers
    multipliers:
      q4: 0.5
      q5: 0.625
      q8: 0.875

  defaults:                       # Default requirements
    min_memory_gb: 4
    recommended_memory_gb: 8
    requires_gpu: false
    estimated_load_time_ms: 5000

  concurrency_limits:             # Dynamic concurrency
    - min_memory_gb: 30
      max_concurrent: 1
    - min_memory_gb: 0
      max_concurrent: 8

  timeout_scaling:                # Dynamic timeout adjustment
    base_timeout_seconds: 30
    load_time_buffer: true

# Path indices (optional)
path_indices:                     # Map names to path array indices
  health: 0                       # paths[0] = /
  completions: 1                  # paths[1] = /api/generate
  chat_completions: 2             # paths[2] = /api/chat

# Platform-specific features (optional)
features:
  metrics:                        # vLLM Prometheus metrics
    enabled: true
    prefix: "vllm:"
  tokenization:                   # vLLM tokenisation API
    enabled: true
    endpoints:
      - /tokenize
      - /detokenize

Creating Custom Profiles¶

Basic Custom Profile¶

To support a new LLM platform, create config/profiles/myplatform.yaml:

name: myplatform
version: "1.0"
display_name: "My Platform"

routing:
  prefixes:
    - myplatform
    - mp

api:
  openai_compatible: false
  paths:
    - /health
    - /models
    - /generate
  model_discovery_path: /models
  health_check_path: /health

characteristics:
  timeout: 2m
  max_concurrent_requests: 5
  streaming_support: true

Extending Existing Profiles¶

To modify an existing profile, create a file with the same name:

# config/profiles/ollama.yaml
name: ollama  # Same name replaces built-in
version: "1.1"

routing:
  prefixes:
    - ollama
    - ai        # Add custom prefix
    - llm       # Add another prefix

characteristics:
  timeout: 10m  # Increase timeout for large models

Adding New Endpoints¶

To allow additional API paths:

api:
  paths:
    - /                    # Existing
    - /api/generate        # Existing
    - /api/chat            # Existing
    - /api/experimental    # New endpoint
    - /v2/chat             # New version

Security Risk

Only add paths you understand and trust. Unknown endpoints could expose dangerous operations.

Troubleshooting¶

Profile Not Loading¶

Issue: Custom profile not being used

Diagnosis:

# Check Olla logs during startup
docker logs olla | grep "Loading profile"

# Verify profile is valid YAML
yamllint config/profiles/myprofile.yaml

Common causes:

Invalid YAML syntax
Missing required field (name)
File not in config/profiles/ directory
Permission issues reading file

Routes Not Working¶

Issue: URLs return 404 despite profile configuration

Diagnosis:

# Check registered routes
curl http://localhost:40114/internal/status | jq .routes

# Verify prefix registration
docker logs olla | grep "Registering routes for provider"

Common causes:

Profile missing routing.prefixes
Endpoint type doesn't match profile name
Path not in api.paths allowlist

Models Not Discovered¶

Issue: No models appearing from backend

Diagnosis:

# Test discovery endpoint directly
curl http://backend:11434/api/tags

# Check discovery in Olla
curl http://localhost:40114/internal/status/models

Common causes:

Wrong model_discovery_path
Incorrect response_format
Backend returning unexpected JSON structure
Health check failing

Native Intercepts Not Working¶

Issue: Requests being proxied instead of intercepted

Diagnosis:

# Check response headers
curl -I http://localhost:40114/olla/ollama/api/tags

# Look for X-Olla-Endpoint header (should be absent for intercepts)

Common causes:

Profile not recognised as native type
Handler not registered for specific path
Custom profile overriding built-in

Best Practices¶

1. Minimal Path Exposure¶

Only include paths your application needs:

api:
  paths:
    - /v1/chat/completions  # Chat only
    - /v1/models            # Model discovery
    # Don't include admin or management endpoints

2. Appropriate Timeouts¶

Set timeouts based on model characteristics:

characteristics:
  timeout: 10m  # Large models need more time

resources:
  timeout_scaling:
    base_timeout_seconds: 180
    load_time_buffer: true  # Add model load time

3. Accurate Capability Detection¶

Use specific patterns for capability detection:

models:
  capability_patterns:
    vision:
      - "*llava*"           # Ollama vision models
      - "gpt-4-vision*"     # OpenAI vision
      - "claude-3*"         # Anthropic vision

4. Resource Limits¶

Configure realistic resource requirements:

resources:
  concurrency_limits:
    - min_memory_gb: 0
      max_concurrent: 1  # LM Studio limitation

5. Version Your Profiles¶

Track your custom profile changes:

name: myplatform
version: "1.2"  # Increment for changes
# Changelog:
# 1.2 - Added /v2/chat endpoint
# 1.1 - Increased timeout to 10m
# 1.0 - Initial version

Don't update versions in natively supported profiles however.

Profile Reference¶

Built-in Profiles¶

Profile	Type Value	Prefixes	Native Intercepts
ollama	`ollama`	`ollama`	`/api/tags`, `/v1/models`
lmstudio	`lm-studio`	`lmstudio`, `lm-studio`, `lm_studio`	`/v1/models`
lemonade	`lemonade`	`lemonade`	`/v1/models`
llamacpp	`llamacpp`	`llamacpp`, `llama-cpp`, `llama_cpp`	None
vllm	`vllm`	`vllm`	None
sglang	`sglang`	`sglang`	None
openai-compatible	`openai`	`openai`, `openai-compatible`	None

Required Fields¶

Field	Type	Description
`name`	string	Unique profile identifier
`api.paths`	array	Allowed API paths

Optional Fields¶

All other fields have sensible defaults and are optional.

Next Steps¶

Configuration Reference - Complete configuration options
Security Considerations - Profile security