LiteLLM Integration¶

Home	github.com/BerriAI/litellm
Since	Olla `v0.0.17`
Type	`litellm` (use in endpoint configuration)
Profile	`litellm.yaml` (see latest)
Features	Proxy Forwarding Health Check (native) Model Unification Model Detection & Normalisation OpenAI API Compatibility 100+ Provider Support Automatic API Translation Cost Tracking Response Caching
Unsupported	Direct Model Management Model Download/Upload GPU Management
Attributes	OpenAI Compatible Multi-Provider Gateway Automatic Fallbacks Load Balancing Spend Management
Prefixes	`/litellm`(default) `/lite` (disabled, enable in profile) (see Routing Prefixes)
Endpoints	See below

Configuration¶

Basic Setup¶

Add LiteLLM to your Olla configuration:

discovery:
  static:
    endpoints:
      - url: "http://localhost:4000"
        name: "litellm-gateway"
        type: "litellm"
        priority: 75
        model_url: "/v1/models"
        health_check_url: "/health"
        check_interval: 5s
        check_timeout: 2s

Multiple LiteLLM Instances¶

Configure multiple LiteLLM servers for high availability:

discovery:
  static:
    endpoints:
      # Primary LiteLLM instance
      - url: "http://litellm-primary:4000"
        name: "litellm-primary"
        type: "litellm"
        priority: 90
        model_url: "/v1/models"
        health_check_url: "/health"
        check_interval: 5s
        check_timeout: 2s

      # Secondary LiteLLM instance  
      - url: "http://litellm-secondary:4000"
        name: "litellm-secondary"
        type: "litellm"
        priority: 70
        model_url: "/v1/models"
        health_check_url: "/health"
        check_interval: 5s
        check_timeout: 2s

Starting LiteLLM¶

LiteLLM can run in two modes:

Basic Proxy Mode (Most Common)¶

Simple API translation without database - suitable for most use cases:

# Install LiteLLM
pip install 'litellm[proxy]'

# Start with environment variables
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...

# Run proxy
litellm --model gpt-3.5-turbo \
        --model claude-3-haiku-20240307 \
        --port 4000

Configuration File Mode¶

Create a litellm_config.yaml:

model_list:
  # OpenAI models
  - model_name: gpt-4
    litellm_params:
      model: openai/gpt-4
      api_key: ${OPENAI_API_KEY}

  - model_name: gpt-3.5-turbo
    litellm_params:
      model: openai/gpt-3.5-turbo
      api_key: ${OPENAI_API_KEY}

  # Anthropic models
  - model_name: claude-3-opus
    litellm_params:
      model: anthropic/claude-3-opus-20240229
      api_key: ${ANTHROPIC_API_KEY}

  - model_name: claude-3-sonnet
    litellm_params:
      model: anthropic/claude-3-sonnet-20240229
      api_key: ${ANTHROPIC_API_KEY}

  # AWS Bedrock models
  - model_name: claude-3-bedrock
    litellm_params:
      model: bedrock/anthropic.claude-3-sonnet
      aws_access_key_id: ${AWS_ACCESS_KEY_ID}
      aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
      aws_region_name: us-east-1

  # Google models
  - model_name: gemini-pro
    litellm_params:
      model: gemini/gemini-pro
      api_key: ${GEMINI_API_KEY}

  # Together AI models
  - model_name: llama-70b
    litellm_params:
      model: together_ai/meta-llama/Llama-3-70b-chat-hf
      api_key: ${TOGETHER_API_KEY}

# Optional: Caching configuration
litellm_settings:
  cache: true
  cache_ttl: 3600

  # Optional: Spend tracking
  max_budget: 100  # $100 budget
  budget_duration: 30d  # 30 days

Start with configuration:

litellm --config litellm_config.yaml --port 4000

Endpoints Supported¶

Core Endpoints (Always Available)¶

These endpoints work in all LiteLLM deployments:

Endpoint	Method	Description	Prefix Required
`/v1/chat/completions`	POST	Chat completions	`/olla/litellm`
`/v1/completions`	POST	Text completions	`/olla/litellm`
`/v1/embeddings`	POST	Generate embeddings	`/olla/litellm`
`/v1/models`	GET	List available models	`/olla/litellm`
`/health`	GET	Health check	`/olla/litellm`

Advanced Endpoints (Database Required)¶

These endpoints only work with PostgreSQL database backend:

Endpoint	Method	Description	Requirements
`/key/generate`	POST	Generate API key	Database + Admin auth
`/user/info`	GET	User information	Database
`/team/info`	GET	Team information	Database
`/spend/calculate`	GET	Calculate spend	Database

Note: Most users run LiteLLM in basic proxy mode without database, so these endpoints won't be available by default in the profile in Olla, you will have to add them in.

Supported Providers¶

LiteLLM provides access to 100+ LLM providers:

Major Cloud Providers¶

OpenAI: GPT-5, GPT-4, GPT-3.5, Embeddings
Anthropic: Claude 4.x / 3.x (Opus, Sonnet, Haiku), Claude 2
Google: Gemini Pro, PaLM, Vertex AI
AWS Bedrock: Claude, Llama, Mistral, Titan
Azure: Azure OpenAI Service
Cohere: Command, Embed

Open Model Platforms¶

Together AI: Llama, Mixtral, Qwen
Replicate: Various open models
Hugging Face: Inference API & Endpoints
Anyscale: Llama, Mistral
Perplexity: pplx models
Groq: Fast inference

Specialized Providers¶

Voyage AI: Embeddings
AI21: Jurassic models
NLP Cloud: Various models
Aleph Alpha: Luminous models
Databricks: DBRX
DeepInfra: Open models

Usage Examples¶

Basic Chat Completion¶

# Using LiteLLM through Olla
curl -X POST http://localhost:40114/olla/litellm/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Provider-Prefixed Models¶

LiteLLM supports provider-prefixed model names:

import openai

client = openai.OpenAI(base_url="http://localhost:40114/olla/litellm/v1")

# Routes to OpenAI via LiteLLM
response = client.chat.completions.create(
    model="openai/gpt-4",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Routes to Anthropic via LiteLLM
response = client.chat.completions.create(
    model="anthropic/claude-3-opus-20240229",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Routes to AWS Bedrock via LiteLLM
response = client.chat.completions.create(
    model="bedrock/anthropic.claude-3-sonnet",
    messages=[{"role": "user", "content": "Hello!"}]
)

List Available Models¶

# Get all models from LiteLLM
curl http://localhost:40114/olla/litellm/v1/models

# Response includes all configured models
{
  "data": [
    {"id": "gpt-4", "object": "model"},
    {"id": "claude-3-opus", "object": "model"},
    {"id": "gemini-pro", "object": "model"},
    {"id": "llama-70b", "object": "model"}
  ]
}

Advanced Configuration¶

Cost-Optimised Routing¶

Configure Olla to route to cheaper providers first:

endpoints:
  # Local models (free)
  - url: "http://localhost:11434"
    name: "local-ollama"
    type: "ollama"
    priority: 100

  # LiteLLM with budget models
  - url: "http://litellm-budget:4000"
    name: "litellm-budget"
    type: "litellm"
    priority: 75

  # LiteLLM with premium models
  - url: "http://litellm-premium:4000"
    name: "litellm-premium"
    type: "litellm"
    priority: 50

Multi-Region Setup¶

endpoints:
  # US East region
  - url: "http://litellm-us-east:4000"
    name: "litellm-us-east"
    type: "litellm"
    priority: 100

  # EU West region
  - url: "http://litellm-eu-west:4000"
    name: "litellm-eu-west"
    type: "litellm"
    priority: 100

  # Load balance across regions
proxy:
  load_balancer: "least-connections"

Model Capabilities¶

LiteLLM models are automatically categorised by Olla:

Chat Models¶

GPT-4, GPT-3.5 (OpenAI)
Claude 3 Opus, Sonnet, Haiku (Anthropic)
Gemini Pro (Google)
Llama 3 (Meta via various providers)
Mistral, Mixtral (Mistral AI)

Embedding Models¶

text-embedding-ada-002 (OpenAI)
voyage-* (Voyage AI)
embed-* (Cohere)
titan-embed (AWS Bedrock)

Vision Models¶

GPT-4 Vision (OpenAI)
Claude 3 models (Anthropic)
Gemini Pro Vision (Google)

Code Models¶

GPT-4 (OpenAI)
Claude 3 (Anthropic)
CodeLlama (Meta via various providers)
DeepSeek Coder (via Together AI)

Response Headers¶

When requests go through LiteLLM, Olla adds tracking headers:

X-Olla-Endpoint: litellm-gateway
X-Olla-Backend-Type: litellm
X-Olla-Model: gpt-4
X-Olla-Request-ID: req_abc123
X-Olla-Response-Time: 2.341s

Health Monitoring¶

Olla continuously monitors LiteLLM health:

# Check LiteLLM status
curl http://localhost:40114/internal/status/endpoints

# Response
{
  "endpoints": [
    {
      "name": "litellm-gateway",
      "url": "http://localhost:4000",
      "status": "healthy",
      "type": "litellm",
      "models_count": 25
    }
  ]
}

Troubleshooting¶

LiteLLM Not Starting¶

Check Python installation: python --version
Install LiteLLM: pip install litellm
Verify API keys are set in environment
Check port availability: lsof -i :4000

Model Not Found¶

Verify model is configured in LiteLLM config
Check model name matches exactly
Ensure API key for provider is valid
Check provider-specific requirements

Slow Response Times¶

Check LiteLLM logs for rate limiting
Monitor provider API status
Enable caching in LiteLLM config
Consider using fallback models

Connection Errors¶

Verify LiteLLM is running: curl http://localhost:4000/health
Check firewall rules
Verify network connectivity
Check Olla can reach LiteLLM endpoint

Best Practices¶

Environment Variables: Store API keys securely

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...

Enable Caching (Optional): Reduce costs with in-memory response caching

litellm_settings:
  cache: true  # In-memory cache (no database required)
  cache_ttl: 3600

Use Fallbacks: Configure backup models for reliability

model_list:
  - model_name: primary-gpt4
    litellm_params:
      model: openai/gpt-4
      fallbacks: [claude-3-opus, gpt-3.5-turbo]

Monitor Health: Check LiteLLM status

curl http://localhost:40114/olla/litellm/health

Note: Budget limits and spend tracking require database backend.

Docker Deployment¶

Run LiteLLM with Docker:

# docker-compose.yml
services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
    volumes:
      - ./litellm_config.yaml:/app/config.yaml
    command: --config /app/config.yaml --port 4000

Integration with Other Providers¶

LiteLLM works seamlessly with other Olla providers:

endpoints:
  # Local Ollama (highest priority)
  - url: "http://localhost:11434"
    type: "ollama"
    priority: 100

  # LM Studio (medium priority)
  - url: "http://localhost:1234"
    type: "lm-studio"
    priority: 75

  # LiteLLM for cloud (lower priority)
  - url: "http://localhost:4000"
    type: "litellm"
    priority: 50

This setup provides: 1. Local model preference for speed and cost 2. Automatic fallback to cloud APIs 3. Unified API across all providers 4. Single endpoint for all models