Monitoring Best Practices¶
This guide covers monitoring and observability for Olla deployments.
Default Monitoring Configuration
Key Features:# Built-in endpoints (always enabled) # /internal/health - Basic health check # /internal/status - Detailed status # /internal/status/endpoints - Endpoint details # /internal/stats/models - Model statistics logging: level: "info" format: "json"
- Health endpoints enabled by default
- JSON logging for structured monitoring
- No external dependencies required
Environment Variables:
OLLA_LOG_LEVEL
,OLLA_LOG_FORMAT
Monitoring Overview¶
Effective monitoring helps you:
- Detect issues before users do
- Understand system performance
- Plan capacity and scaling
- Troubleshoot problems quickly
- Track SLA compliance
Built-in Monitoring¶
Health Endpoint¶
Basic health check:
Response:
Use for:
- Load balancer health checks
- Kubernetes liveness probes
- Basic availability monitoring
Status Endpoint¶
Detailed system status:
Response includes:
- Endpoint health details
- Request statistics
- Model registry information
- Circuit breaker states
- Performance metrics
Key Metrics¶
Golden Signals¶
Monitor these four golden signals:
1. Latency¶
Track response time percentiles:
Key metrics:
- P50 (median): Normal performance
- P95: Most requests
- P99: Worst-case scenarios
Alerting thresholds:
Percentile | Good | Warning | Critical |
---|---|---|---|
P50 | <100ms | <500ms | >1s |
P95 | <500ms | <2s | >5s |
P99 | <2s | <5s | >10s |
2. Traffic¶
Monitor request rates:
Track:
- Requests per second
- Request patterns over time
- Peak vs average traffic
3. Errors¶
Track error rates:
Monitor:
- HTTP error codes (4xx, 5xx)
- Circuit breaker trips
- Timeout errors
- Connection failures
Alert on:
- Error rate > 1%
- 5xx errors > 0.1%
- Circuit breaker trips
4. Saturation¶
Monitor resource usage:
- CPU utilisation
- Memory usage
- Connection pool saturation
- Queue depths
Response Headers¶
Olla adds monitoring headers to responses:
curl -I http://localhost:40114/olla/ollama/v1/models
# Headers:
X-Olla-Endpoint: local-ollama
X-Olla-Model: llama3.2
X-Olla-Backend-Type: ollama
X-Olla-Request-ID: 550e8400-e29b-41d4-a716-446655440000
X-Olla-Response-Time: 125ms
Use these for:
- Request tracing
- Performance analysis
- Debugging routing decisions
Logging¶
Log Configuration¶
Configure appropriate logging:
logging:
level: "info" # info for production, debug for troubleshooting
format: "json" # Structured logs for parsing
output: "stdout" # Or file path
Log Levels¶
Level | Use Case | Volume |
---|---|---|
debug | Development/troubleshooting | Very high |
info | Normal operations | Moderate |
warn | Potential issues | Low |
error | Actual problems | Very low |
Structured Logging¶
JSON format enables parsing:
{
"level": "info",
"time": "2024-01-15T10:30:00Z",
"msg": "Request completed",
"endpoint": "local-ollama",
"method": "POST",
"path": "/v1/chat/completions",
"status": 200,
"duration_ms": 125,
"request_id": "550e8400-e29b-41d4-a716-446655440000"
}
Prometheus Metrics¶
Exporting Metrics¶
While Olla doesn't have built-in Prometheus support, you can scrape the status endpoint:
# prometheus_exporter.py
import requests
import time
from prometheus_client import start_http_server, Gauge
# Define metrics
requests_total = Gauge('olla_requests_total', 'Total requests')
errors_total = Gauge('olla_errors_total', 'Total errors')
endpoints_healthy = Gauge('olla_endpoints_healthy', 'Healthy endpoints')
response_time_p50 = Gauge('olla_response_time_p50', 'P50 response time')
def collect_metrics():
while True:
try:
resp = requests.get('http://localhost:40114/internal/status')
data = resp.json()
requests_total.set(data['system']['total_requests'])
errors_total.set(data['system']['total_failures'])
endpoints_healthy.set(len([e for e in data['endpoints'] if e['status'] == 'healthy']))
# Note: Percentile latencies not available in status endpoint
except:
pass
time.sleep(15)
if __name__ == '__main__':
start_http_server(8000)
collect_metrics()
Grafana Dashboard¶
Key panels for Grafana:
- Request Rate:
rate(olla_requests_total[5m])
- Error Rate:
rate(olla_errors_total[5m])
- Latency:
olla_response_time_p50
,p95
,p99
- Endpoint Health:
olla_endpoints_healthy
- Success Rate:
1 - (rate(errors) / rate(requests))
Health Monitoring¶
Endpoint Health¶
Monitor individual endpoint health:
Track:
- Health status per endpoint
- Last check time
- Failure counts
- Circuit breaker state
Circuit Breaker Monitoring¶
Circuit breaker state affects endpoint status. When tripped, endpoints show as unhealthy in the status response.
Alert when:
- Circuit opens (immediate)
- Circuit remains open > 5 minutes
- Multiple circuits open simultaneously
Performance Monitoring¶
Latency Tracking¶
Monitor latency at different levels:
- Olla Overhead: Time added by proxy
- Backend Latency: Time spent at backend
- Network Latency: Connection time
Throughput Monitoring¶
Track requests per second:
# Active connections
curl http://localhost:40114/internal/status | \
jq '.system.active_connections'
Historical tracking:
- Peak throughput times
- Average vs peak ratios
- Throughput per endpoint
Resource Usage¶
Monitor system resources:
# Memory usage
ps aux | grep olla | awk '{print $6}'
# CPU usage
top -p $(pgrep olla) -n 1
# Connection count
netstat -an | grep 40114 | wc -l
Alerting Strategy¶
Critical Alerts (Page immediately)¶
- Service down (health check fails)
- All endpoints unhealthy
- Error rate > 5%
- P99 latency > 10s
Warning Alerts (Notify team)¶
- Single endpoint unhealthy
- Error rate > 1%
- P95 latency > 5s
- Memory usage > 80%
Info Alerts (Log for review)¶
- Circuit breaker trips
- Rate limit violations
- Configuration reloads
- Model discovery failures
Log Aggregation¶
ELK Stack Integration¶
Ship logs to Elasticsearch:
# filebeat.yml
filebeat.inputs:
- type: container
paths:
- '/var/lib/docker/containers/*/*.log'
processors:
- decode_json_fields:
fields: ["message"]
target: "olla"
output.elasticsearch:
hosts: ["elasticsearch:9200"]
Useful Queries¶
Elasticsearch queries for analysis:
// Error rate by endpoint
{
"query": {
"bool": {
"must": [
{"term": {"level": "error"}},
{"range": {"@timestamp": {"gte": "now-1h"}}}
]
}
},
"aggs": {
"by_endpoint": {
"terms": {"field": "endpoint.keyword"}
}
}
}
// Slow requests
{
"query": {
"range": {
"duration_ms": {"gte": 1000}
}
}
}
Distributed Tracing¶
Request Correlation¶
Use request IDs for tracing:
Track requests across:
- Olla proxy
- Backend services
- Frontend applications
OpenTelemetry Integration¶
While not built-in, you can add tracing:
// Wrap handlers with OpenTelemetry
import "go.opentelemetry.io/otel"
tracer := otel.Tracer("olla")
ctx, span := tracer.Start(ctx, "proxy_request")
defer span.End()
Capacity Planning¶
Metrics for Planning¶
Track over time:
- Request Growth: Month-over-month increase
- Peak Traffic: Daily/weekly patterns
- Resource Usage: CPU/memory trends
- Response Times: Degradation patterns
Scaling Indicators¶
Scale when:
- CPU > 70% sustained
- Memory > 80% sustained
- P95 latency increasing
- Error rate increasing
- Queue depth growing
Monitoring Tools¶
Command-Line Monitoring¶
Quick status checks:
# Watch status in real-time
watch -n 5 'curl -s http://localhost:40114/internal/status | jq .'
# Monitor logs
tail -f /var/log/olla.log | jq '.'
# Track specific errors
grep ERROR /var/log/olla.log | jq '.msg'
Monitoring Scripts¶
Health check script:
#!/bin/bash
# check_olla.sh
STATUS=$(curl -s http://localhost:40114/internal/health | jq -r '.status')
if [ "$STATUS" != "healthy" ]; then
echo "CRITICAL: Olla is unhealthy"
exit 2
fi
ENDPOINTS=$(curl -s http://localhost:40114/internal/status | jq '[.endpoints[] | select(.status == "healthy")] | length')
TOTAL=$(curl -s http://localhost:40114/internal/status | jq '.endpoints | length')
if [ "$ENDPOINTS" -lt "$TOTAL" ]; then
echo "WARNING: Only $ENDPOINTS/$TOTAL endpoints healthy"
exit 1
fi
echo "OK: Olla healthy with $ENDPOINTS endpoints"
exit 0
Troubleshooting with Monitoring¶
High Latency Investigation¶
- Check P50 vs P99 spread
- Identify slow endpoints
- Review circuit breaker states
- Check backend health directly
- Analyse request patterns
Error Spike Investigation¶
- Check error types (4xx vs 5xx)
- Identify affected endpoints
- Review recent changes
- Check rate limit violations
- Analyse request logs
Monitoring Checklist¶
Production monitoring setup:
- Health endpoint monitoring
- Status endpoint collection
- Log aggregation configured
- Alert rules defined
- Dashboard created
- Request ID correlation
- Error rate tracking
- Latency percentiles
- Resource monitoring
- Circuit breaker alerts
- Capacity planning metrics
Next Steps¶
- Security Best Practices - Security monitoring
- Performance Tuning - Performance metrics
- Configuration Reference - Monitoring configuration