Monitoring Best Practices¶

This guide covers monitoring and observability for Olla deployments.

Default Monitoring Configuration
# Built-in endpoints (always enabled)
# /internal/health - Basic health check
# /internal/status - Detailed status
# /internal/status/endpoints - Endpoint details
# /internal/stats/models - Model statistics

logging:
  level: "info"
  format: "json"
Key Features:

Health endpoints enabled by default

JSON logging for structured monitoring

No external dependencies required

Environment Variables: OLLA_LOG_LEVEL, OLLA_LOG_FORMAT

Monitoring Overview¶

Effective monitoring helps you:

Detect issues before users do
Understand system performance
Plan capacity and scaling
Troubleshoot problems quickly
Track SLA compliance

Built-in Monitoring¶

Health Endpoint¶

Basic health check:

curl http://localhost:40114/internal/health

Response:

{
  "status": "healthy",
  "endpoints_healthy": 3,
  "endpoints_total": 3
}

Use for:

Load balancer health checks
Kubernetes liveness probes
Basic availability monitoring

Status Endpoint¶

Detailed system status:

curl http://localhost:40114/internal/status

Response includes:

Endpoint health details
Request statistics
Model registry information
Circuit breaker states
Performance metrics

Key Metrics¶

Golden Signals¶

Monitor these four golden signals:

1. Latency¶

Track response time percentiles:

# Get average latency
curl http://localhost:40114/internal/status | jq '.system.avg_latency'

Key metrics:

P50 (median): Normal performance
P95: Most requests
P99: Worst-case scenarios

Alerting thresholds:

Percentile	Good	Warning	Critical
P50	<100ms	<500ms	>1s
P95	<500ms	<2s	>5s
P99	<2s	<5s	>10s

2. Traffic¶

Monitor request rates:

# Total requests
curl http://localhost:40114/internal/status | jq '.system.total_requests'

Track:

Requests per second
Request patterns over time
Peak vs average traffic

3. Errors¶

Track error rates:

# Error statistics
curl http://localhost:40114/internal/status | jq '.system.total_failures'

Monitor:

HTTP error codes (4xx, 5xx)
Circuit breaker trips
Timeout errors
Connection failures

Alert on:

Error rate > 1%
5xx errors > 0.1%
Circuit breaker trips

4. Saturation¶

Monitor resource usage:

CPU utilisation
Memory usage
Connection pool saturation
Queue depths

Response Headers¶

Olla adds monitoring headers to responses:

curl -I http://localhost:40114/olla/ollama/v1/models

# Headers:
X-Olla-Endpoint: local-ollama
X-Olla-Model: llama3.2
X-Olla-Backend-Type: ollama
X-Olla-Request-ID: 550e8400-e29b-41d4-a716-446655440000
X-Olla-Response-Time: 125ms

Use these for:

Request tracing
Performance analysis
Debugging routing decisions

Logging¶

Log Configuration¶

Configure appropriate logging:

logging:
  level: "info"    # info for production, debug for troubleshooting
  format: "json"   # Structured logs for parsing
  output: "stdout" # Or file path

Log Levels¶

Level	Use Case	Volume
debug	Development/troubleshooting	Very high
info	Normal operations	Moderate
warn	Potential issues	Low
error	Actual problems	Very low

Structured Logging¶

JSON format enables parsing:

{
  "level": "info",
  "time": "2024-01-15T10:30:00Z",
  "msg": "Request completed",
  "endpoint": "local-ollama",
  "method": "POST",
  "path": "/v1/chat/completions",
  "status": 200,
  "duration_ms": 125,
  "request_id": "550e8400-e29b-41d4-a716-446655440000"
}

Prometheus Metrics¶

Exporting Metrics¶

While Olla doesn't have built-in Prometheus support, you can scrape the status endpoint:

# prometheus_exporter.py
import requests
import time
from prometheus_client import start_http_server, Gauge, Histogram

# Define metrics
requests_total = Gauge('olla_requests_total', 'Total requests')
errors_total = Gauge('olla_errors_total', 'Total errors')
endpoints_healthy = Gauge('olla_endpoints_healthy', 'Healthy endpoints')
response_time_p50 = Gauge('olla_response_time_p50', 'P50 response time')

# Provider metrics (from extracted data)
tokens_per_second = Histogram('olla_tokens_per_second', 'Token generation speed', 
                              ['endpoint', 'model'])
prompt_tokens = Histogram('olla_prompt_tokens', 'Prompt token count',
                          ['endpoint', 'model'])
completion_tokens = Histogram('olla_completion_tokens', 'Completion token count',
                              ['endpoint', 'model'])

def collect_metrics():
    while True:
        try:
            resp = requests.get('http://localhost:40114/internal/status')
            data = resp.json()

            requests_total.set(data['system']['total_requests'])
            errors_total.set(data['system']['total_failures'])
            endpoints_healthy.set(len([e for e in data['endpoints'] if e['status'] == 'healthy']))

            # Extract provider metrics if available
            for endpoint_name, stats in data.get('proxy', {}).get('endpoints', {}).items():
                if 'avg_tokens_per_second' in stats:
                    tokens_per_second.labels(
                        endpoint=endpoint_name,
                        model=stats.get('primary_model', 'unknown')
                    ).observe(stats['avg_tokens_per_second'])
        except:
            pass
        time.sleep(15)

if __name__ == '__main__':
    start_http_server(8000)
    collect_metrics()

Grafana Dashboard¶

Key panels for Grafana:

Request Rate: rate(olla_requests_total[5m])
Error Rate: rate(olla_errors_total[5m])
Latency: olla_response_time_p50, p95, p99
Endpoint Health: olla_endpoints_healthy
Success Rate: 1 - (rate(errors) / rate(requests))
Token Generation Speed: olla_tokens_per_second (from provider metrics)
Token Usage: olla_prompt_tokens + olla_completion_tokens

Provider Metrics¶

Olla automatically extracts performance metrics from LLM provider responses:

Available Metrics¶

Metric	Description	Source
`tokens_per_second`	Generation speed	Ollama, LM Studio
`prompt_tokens`	Input token count	All providers
`completion_tokens`	Output token count	All providers
`total_duration_ms`	End-to-end time	Ollama
`eval_duration_ms`	Generation time	Ollama
`time_per_token_ms`	Per-token latency	Calculated

Accessing Provider Metrics¶

Debug Logs: Metrics appear in debug-level logs
Status Endpoint: Aggregated in /internal/status
Custom Extraction: Parse from debug logs

See Provider Metrics Documentation for configuration details.

Health Monitoring¶

Endpoint Health¶

Monitor individual endpoint health:

# Check endpoint status
curl http://localhost:40114/internal/status | jq '.endpoints'

Track:

Health status per endpoint
Last check time
Failure counts
Circuit breaker state

Circuit Breaker Monitoring¶

Circuit breaker state affects endpoint status. When tripped, endpoints show as unhealthy in the status response.

Alert when:

Circuit opens (immediate)
Circuit remains open > 5 minutes
Multiple circuits open simultaneously

Performance Monitoring¶

Latency Tracking¶

Monitor latency at different levels:

Olla Overhead: Time added by proxy
Backend Latency: Time spent at backend
Network Latency: Connection time

# Response headers show total time
X-Olla-Response-Time: 125ms

Throughput Monitoring¶

Track requests per second:

# Active connections
curl http://localhost:40114/internal/status | \
  jq '.system.active_connections'

Historical tracking:

Peak throughput times
Average vs peak ratios
Throughput per endpoint

Resource Usage¶

Monitor system resources:

# Memory usage
ps aux | grep olla | awk '{print $6}'

# CPU usage
top -p $(pgrep olla) -n 1

# Connection count
netstat -an | grep 40114 | wc -l

Alerting Strategy¶

Critical Alerts (Page immediately)¶

Service down (health check fails)
All endpoints unhealthy
Error rate > 5%
P99 latency > 10s

Warning Alerts (Notify team)¶

Single endpoint unhealthy
Error rate > 1%
P95 latency > 5s
Memory usage > 80%

Info Alerts (Log for review)¶

Circuit breaker trips
Rate limit violations
Configuration reloads
Model discovery failures

Log Aggregation¶

ELK Stack Integration¶

Ship logs to Elasticsearch:

# filebeat.yml
filebeat.inputs:
- type: container
  paths:
    - '/var/lib/docker/containers/*/*.log'
  processors:
    - decode_json_fields:
        fields: ["message"]
        target: "olla"

output.elasticsearch:
  hosts: ["elasticsearch:9200"]

Useful Queries¶

Elasticsearch queries for analysis:

// Error rate by endpoint
{
  "query": {
    "bool": {
      "must": [
        {"term": {"level": "error"}},
        {"range": {"@timestamp": {"gte": "now-1h"}}}
      ]
    }
  },
  "aggs": {
    "by_endpoint": {
      "terms": {"field": "endpoint.keyword"}
    }
  }
}

// Slow requests
{
  "query": {
    "range": {
      "duration_ms": {"gte": 1000}
    }
  }
}

Distributed Tracing¶

Request Correlation¶

Use request IDs for tracing:

# Request ID in headers
X-Olla-Request-ID: 550e8400-e29b-41d4-a716-446655440000

Track requests across:

Olla proxy
Backend services
Frontend applications

OpenTelemetry Integration¶

While not built-in, you can add tracing:

// Wrap handlers with OpenTelemetry
import "go.opentelemetry.io/otel"

tracer := otel.Tracer("olla")
ctx, span := tracer.Start(ctx, "proxy_request")
defer span.End()

Capacity Planning¶

Metrics for Planning¶

Track over time:

Request Growth: Month-over-month increase
Peak Traffic: Daily/weekly patterns
Resource Usage: CPU/memory trends
Response Times: Degradation patterns

Scaling Indicators¶

Scale when:

CPU > 70% sustained
Memory > 80% sustained
P95 latency increasing
Error rate increasing
Queue depth growing

Monitoring Tools¶

Command-Line Monitoring¶

Quick status checks:

# Watch status in real-time
watch -n 5 'curl -s http://localhost:40114/internal/status | jq .'

# Monitor logs
tail -f /var/log/olla.log | jq '.'

# Track specific errors
grep ERROR /var/log/olla.log | jq '.msg'

Monitoring Scripts¶

Health check script:

#!/bin/bash
# check_olla.sh

STATUS=$(curl -s http://localhost:40114/internal/health | jq -r '.status')

if [ "$STATUS" != "healthy" ]; then
    echo "CRITICAL: Olla is unhealthy"
    exit 2
fi

ENDPOINTS=$(curl -s http://localhost:40114/internal/status | jq '[.endpoints[] | select(.status == "healthy")] | length')
TOTAL=$(curl -s http://localhost:40114/internal/status | jq '.endpoints | length')

if [ "$ENDPOINTS" -lt "$TOTAL" ]; then
    echo "WARNING: Only $ENDPOINTS/$TOTAL endpoints healthy"
    exit 1
fi

echo "OK: Olla healthy with $ENDPOINTS endpoints"
exit 0

Troubleshooting with Monitoring¶

High Latency Investigation¶

Check P50 vs P99 spread
Identify slow endpoints
Review circuit breaker states
Check backend health directly
Analyse request patterns

Error Spike Investigation¶

Check error types (4xx vs 5xx)
Identify affected endpoints
Review recent changes
Check rate limit violations
Analyse request logs

Monitoring Checklist¶

Production monitoring setup:

Next Steps¶

Security Best Practices - Security monitoring
Performance Tuning - Performance metrics
Configuration Reference - Monitoring configuration