Skip to content

Remote Backend Auth (Experimental)

Not officially supported

Olla is designed for local, self-hosted inference backends. Remote cloud APIs are not a first-class use case. The recipes below work today for users who want to experiment, but we make no guarantees about continued compatibility, and issues specific to cloud providers will not be prioritised.

If you want to use hosted APIs, consider LiteLLM as an intermediary. It handles the provider-specific quirks, and Olla then talks to LiteLLM as a local OpenAI-compatible endpoint.

Why Cloud APIs Are Not First-Class

Cloud inference APIs have operational characteristics that Olla does not currently handle:

  • Rate limit headers (x-ratelimit-*, retry-after): Olla does not parse or propagate provider-specific rate limit signalling beyond honouring 429 for health state.
  • Path-prefix base URLs: Some APIs require a base path in the URL (e.g. https://api.groq.com/openai/v1). See below for how this interacts with health and model discovery.
  • Cold-start latency: Serverless-backed providers can have high first-token latency that exceeds Olla's default health check timeouts.
  • Model namespacing: Many cloud APIs use provider/model-name format. Olla's model discovery and unification are tuned for local naming conventions.
  • No local health check: Cloud APIs do not expose a /health endpoint. Health checks against /v1/models incur real API calls and may consume quota.

URL Construction for Path-Prefixed Bases

Olla joins discovery paths onto the base URL path using path.Join. For a base like https://api.groq.com/openai/v1, the default health or model path /v1/models gets joined as /openai/v1/v1/models -- a doubled prefix that silently breaks health checks and model discovery.

Set explicit absolute health_check_url and model_url values to bypass the join entirely. ResolveURLPath returns absolute URLs as-is, so https://api.groq.com/openai/v1/models goes to the wire unchanged. This only affects discovery; proxy-time URL building is controlled separately by preserve_path.

What We Don't Promise

  • Health check accuracy for cloud endpoints
  • Correct model listing or unification across local and remote endpoints
  • Retry or backoff behaviour that respects provider-specific rate limiting
  • Compatibility with provider authentication changes

Recipes

These configurations work at the time of writing. Treat them as starting points, not production-tested deployments.

Ollama Cloud

Ollama Cloud (https://ollama.com) accepts bearer authentication. Set your API key from ollama.com/settings/keys.

discovery:
  static:
    endpoints:
      - url: "https://ollama.com"
        name: "ollama-cloud"
        type: "ollama"
        priority: 10          # lower than local instances
        check_interval: 60s   # avoid hammering cloud health checks
        check_timeout: 10s
        auth:
          type: bearer
          token: "${OLLAMA_CLOUD_API_KEY}"

Known limitations:

  • The Ollama Cloud API surface may differ from local Ollama. Model names include the namespace (e.g. hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF).
  • Health check hits /, which works on the Ollama Cloud base URL.

OpenRouter

OpenRouter exposes an OpenAI-compatible API at https://openrouter.ai/api/v1. The /api/v1 prefix path means you need preserve_path: true to prevent Olla from stripping it.

discovery:
  static:
    endpoints:
      - url: "https://openrouter.ai/api/v1"
        name: "openrouter"
        type: "openai-compatible"
        priority: 10
        preserve_path: true   # required: prevents stripping the /api/v1 prefix
        health_check_url: "https://openrouter.ai/api/v1/models"
        model_url: "https://openrouter.ai/api/v1/models"
        check_interval: 120s
        check_timeout: 15s
        auth:
          type: bearer
          token: "${OPENROUTER_API_KEY}"

Known limitations:

  • Health checks probe /api/v1/models which incurs an API call. Set check_interval high to avoid burning quota.
  • OpenRouter requires an HTTP-Referer header for attribution on some tiers. Use headers: to set it:
      headers:
        HTTP-Referer: "https://your-app.example.com"
        X-Title: "Your App Name"
  • Model names include the provider prefix (e.g. openai/gpt-4o, anthropic/claude-3-5-sonnet). These will not unify with local model names.

Groq

Groq provides a fast OpenAI-compatible inference API.

discovery:
  static:
    endpoints:
      - url: "https://api.groq.com/openai/v1"
        name: "groq"
        type: "openai-compatible"
        priority: 10
        preserve_path: true
        health_check_url: "https://api.groq.com/openai/v1/models"
        model_url: "https://api.groq.com/openai/v1/models"
        check_interval: 120s
        check_timeout: 10s
        auth:
          type: bearer
          token: "${GROQ_API_KEY}"

Known limitations:

  • Same health check cost caveat as OpenRouter.
  • Groq's rate limits are aggressive on the free tier. A misconfigured health interval can exhaust rate limits before any inference requests are made.

Mixing Local and Remote

You can combine local and remote endpoints. Set priorities so local endpoints are strongly preferred and remote endpoints act as overflow:

discovery:
  static:
    endpoints:
      # Local, always preferred
      - url: "http://localhost:8000"
        name: "local-vllm"
        type: "vllm"
        priority: 100

      # Remote fallback
      - url: "https://api.groq.com/openai/v1"
        name: "groq-fallback"
        type: "openai-compatible"
        priority: 5
        preserve_path: true
        health_check_url: "https://api.groq.com/openai/v1/models"
        model_url: "https://api.groq.com/openai/v1/models"
        check_interval: 120s
        auth:
          type: bearer
          token: "${GROQ_API_KEY}"

With load_balancer: priority, requests only reach the remote endpoint when all local endpoints are unhealthy.

Community Contributions

If you build cloud-specific profile YAML files or improve health check behaviour for cloud APIs, PRs are welcome. See Contributing.

See Also