llama.cpp API¶
Proxy endpoints for llama.cpp inference servers. llama.cpp is a high-performance C++ inference engine for GGUF models, offering both OpenAI-compatible and native endpoints.
Available through the /olla/llamacpp/
(default), /olla/llama-cpp/
(disabled) and /olla/llama_cpp/
(disabled) prefixes.
Key Features:
- OpenAI Compatibility: Full OpenAI API compatibility for drop-in replacements
- GGUF Format: Exclusive support for GGUF quantized models (Q2 to F32)
- Single Model Architecture: Dedicated resources for one model per instance
- CPU Inference: Full-featured inference without GPU requirements
- Code Infill: Fill-In-the-Middle (FIM) support for IDE integration
- Tokenization API: Direct access to model tokenizer
For integration guides and configuration examples, see the llama.cpp Integration Guide.
Compatibility with mainline LlamaCpp
Primary development was done for compatibility with the original llamacpp and tested on forks like ik_llama but we may not support the wider forks yet.
Endpoints Overview¶
The following inference endpoints are available through the Olla proxy:
Method | URI | Description |
---|---|---|
GET | /olla/llamacpp/v1/models | List available models (OpenAI) |
POST | /olla/llamacpp/completion | Native completion (llamacpp format) |
POST | /olla/llamacpp/v1/completions | Text completion (OpenAI) |
POST | /olla/llamacpp/v1/chat/completions | Chat completion (OpenAI) |
POST | /olla/llamacpp/embedding | Native embedding (llamacpp format) |
POST | /olla/llamacpp/v1/embeddings | Generate embeddings (OpenAI) |
POST | /olla/llamacpp/tokenize | Tokenize text (llamacpp-specific) |
POST | /olla/llamacpp/detokenize | Detokenize tokens (llamacpp-specific) |
POST | /olla/llamacpp/infill | Code infill/FIM (llamacpp-specific) |
Base URL & Authentication¶
Base URL: http://localhost:40114/olla/llamacpp
Alternative: http://localhost:40114/olla/llama-cpp
Alternative: http://localhost:40114/olla/llama_cpp
Authentication: Not required (or API key if configured)
All three routing prefixes are functionally equivalent and route to the same llama.cpp endpoints.
Model Management¶
GET /olla/llamacpp/v1/models¶
List models available on the llama.cpp server (OpenAI-compatible).
Note: llama.cpp typically serves a single model per instance. The response will contain one model - the GGUF model loaded at server startup.
Request¶
Response¶
{
"object": "list",
"data": [
{
"id": "llama-3.1-8b-instruct-q4_k_m.gguf",
"object": "model",
"created": 1704067200,
"owned_by": "meta-llama"
}
]
}
Text Generation¶
POST /olla/llamacpp/completion¶
Native llama.cpp completion endpoint.
Request¶
curl -X POST http://localhost:40114/olla/llamacpp/completion \
-H "Content-Type: application/json" \
-d '{
"prompt": "The future of AI is",
"n_predict": 200,
"temperature": 0.7,
"top_k": 40,
"top_p": 0.9,
"repeat_penalty": 1.1,
"stream": false
}'
Parameters¶
Parameter | Type | Default | Description |
---|---|---|---|
prompt | string | required | Input text to complete |
n_predict | integer | 512 | Maximum tokens to generate |
temperature | float | 0.8 | Sampling temperature (0.0-2.0) |
top_k | integer | 40 | Top-k sampling |
top_p | float | 0.95 | Top-p (nucleus) sampling |
min_p | float | 0.05 | Minimum probability threshold |
repeat_penalty | float | 1.1 | Repetition penalty (1.0 = no penalty) |
repeat_last_n | integer | 64 | Last n tokens to penalize |
penalize_nl | boolean | true | Penalize newline tokens |
stop | array | [] | Stop sequences |
stream | boolean | false | Enable streaming response |
seed | integer | -1 | Random seed (-1 = random) |
grammar | string | "" | GBNF grammar for constrained output |
Response¶
{
"content": "The future of AI is incredibly promising and transformative. We're seeing rapid advances in natural language processing, computer vision, and autonomous systems. Machine learning models are becoming more capable, efficient, and accessible. Key trends include:\n\n1. Multimodal AI that can process text, images, audio, and video\n2. Smaller, more efficient models that run on edge devices\n3. Enhanced reasoning and problem-solving capabilities\n4. Better alignment with human values and safety\n5. Integration into everyday tools and workflows\n\nThese developments will revolutionise healthcare, education, scientific research, and countless other fields.",
"stop": true,
"generation_settings": {
"n_ctx": 4096,
"n_predict": 200,
"model": "llama-3.1-8b-instruct-q4_k_m.gguf",
"seed": 1234567890,
"temperature": 0.7,
"top_k": 40,
"top_p": 0.9,
"repeat_penalty": 1.1
},
"model": "llama-3.1-8b-instruct-q4_k_m.gguf",
"prompt": "The future of AI is",
"stopped_eos": true,
"stopped_word": false,
"stopped_limit": false,
"stopping_word": "",
"tokens_predicted": 142,
"tokens_evaluated": 5,
"truncated": false,
"timings": {
"prompt_n": 5,
"prompt_ms": 45.2,
"prompt_per_token_ms": 9.04,
"prompt_per_second": 110.6,
"predicted_n": 142,
"predicted_ms": 1850.5,
"predicted_per_token_ms": 13.03,
"predicted_per_second": 76.7
}
}
Streaming Response¶
When "stream": true
, responses are sent as Server-Sent Events (SSE):
data: {"content":"The","stop":false}
data: {"content":" future","stop":false}
data: {"content":" of","stop":false}
...
data: {"content":"","stop":true,"stopped_eos":true,"timings":{...}}
POST /olla/llamacpp/v1/completions¶
OpenAI-compatible text completion.
Request¶
curl -X POST http://localhost:40114/olla/llamacpp/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.1-8b-instruct-q4_k_m.gguf",
"prompt": "Explain GGUF quantization benefits:",
"max_tokens": 200,
"temperature": 0.7,
"top_p": 0.9,
"stream": false
}'
Response¶
{
"id": "cmpl-llamacpp-abc123",
"object": "text_completion",
"created": 1704067200,
"model": "llama-3.1-8b-instruct-q4_k_m.gguf",
"choices": [
{
"text": "\n\n1. **Reduced Memory Footprint**: GGUF quantization compresses model weights from 16-bit (F16) to 4-bit (Q4), reducing memory requirements by approximately 75%. This allows running larger models on consumer hardware.\n\n2. **Faster Inference**: Lower precision arithmetic operations are faster on CPUs and GPUs, improving inference speed by 2-3x compared to full precision.\n\n3. **Maintained Quality**: Q4_K_M quantization carefully preserves model accuracy, with minimal quality degradation compared to F16 models.\n\n4. **Format Standardisation**: GGUF provides a universal format for quantized models, ensuring compatibility across llama.cpp, Ollama, and other inference engines.\n\n5. **Flexible Quantization**: Multiple quantization levels (Q2 to Q8) allow trading memory/speed for quality based on deployment requirements.",
"index": 0,
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 8,
"completion_tokens": 145,
"total_tokens": 153
}
}
Streaming Response¶
When "stream": true
:
data: {"id":"cmpl-llamacpp-abc123","object":"text_completion","created":1704067200,"choices":[{"text":"\n\n","index":0,"logprobs":null,"finish_reason":null}],"model":"llama-3.1-8b-instruct-q4_k_m.gguf"}
data: {"id":"cmpl-llamacpp-abc123","object":"text_completion","created":1704067200,"choices":[{"text":"1","index":0,"logprobs":null,"finish_reason":null}],"model":"llama-3.1-8b-instruct-q4_k_m.gguf"}
...
data: {"id":"cmpl-llamacpp-abc123","object":"text_completion","created":1704067201,"choices":[{"text":"","index":0,"logprobs":null,"finish_reason":"stop"}],"model":"llama-3.1-8b-instruct-q4_k_m.gguf","usage":{"prompt_tokens":8,"completion_tokens":145,"total_tokens":153}}
data: [DONE]
POST /olla/llamacpp/v1/chat/completions¶
OpenAI-compatible chat completion with conversation history.
Request¶
curl -X POST http://localhost:40114/olla/llamacpp/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.1-8b-instruct-q4_k_m.gguf",
"messages": [
{
"role": "system",
"content": "You are a helpful AI assistant specialising in efficient LLM deployment."
},
{
"role": "user",
"content": "What are the best practices for deploying llama.cpp in production?"
}
],
"temperature": 0.7,
"max_tokens": 300,
"stream": false
}'
Response¶
{
"id": "chatcmpl-llamacpp-xyz789",
"object": "chat.completion",
"created": 1704067200,
"model": "llama-3.1-8b-instruct-q4_k_m.gguf",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Here are key best practices for deploying llama.cpp in production:\n\n**1. Model Selection & Quantization**\n- Use Q4_K_M or Q5_K_M quantization for optimal quality/performance balance\n- Match model size to available hardware (8B models for 8-16GB RAM)\n- Test quantization levels against your quality requirements\n\n**2. Slot Configuration**\n- Configure slots based on expected concurrency (typically 4-8 slots)\n- Monitor slot usage via /slots endpoint\n- Implement queue management for slot exhaustion scenarios\n\n**3. Resource Management**\n- Allocate sufficient context window size (n_ctx) for your use case\n- Enable GPU acceleration (CUDA/Metal) when available\n- Consider CPU-only deployment for edge/serverless environments\n\n**4. Monitoring & Observability**\n- Integrate Prometheus metrics endpoint for monitoring\n- Track TTFT, throughput, and slot utilisation\n- Set up alerts for slot exhaustion and high latency\n\n**5. High Availability**\n- Deploy multiple llama.cpp instances behind Olla proxy\n- Use health checks for automatic failover\n- Implement request retries with exponential backoff\n\n**6. Security**\n- Deploy behind reverse proxy with authentication\n- Bind to localhost for local-only access\n- Implement rate limiting to prevent abuse"
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 45,
"completion_tokens": 245,
"total_tokens": 290
}
}
Streaming Response¶
When "stream": true
:
data: {"id":"chatcmpl-llamacpp-xyz789","object":"chat.completion.chunk","created":1704067200,"model":"llama-3.1-8b-instruct-q4_k_m.gguf","choices":[{"index":0,"delta":{"role":"assistant"},"logprobs":null,"finish_reason":null}]}
data: {"id":"chatcmpl-llamacpp-xyz789","object":"chat.completion.chunk","created":1704067200,"model":"llama-3.1-8b-instruct-q4_k_m.gguf","choices":[{"index":0,"delta":{"content":"Here"},"logprobs":null,"finish_reason":null}]}
data: {"id":"chatcmpl-llamacpp-xyz789","object":"chat.completion.chunk","created":1704067200,"model":"llama-3.1-8b-instruct-q4_k_m.gguf","choices":[{"index":0,"delta":{"content":" are"},"logprobs":null,"finish_reason":null}]}
...
data: {"id":"chatcmpl-llamacpp-xyz789","object":"chat.completion.chunk","created":1704067201,"model":"llama-3.1-8b-instruct-q4_k_m.gguf","choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}]}
data: [DONE]
Embeddings¶
POST /olla/llamacpp/embedding¶
Native llama.cpp embedding endpoint.
Request¶
curl -X POST http://localhost:40114/olla/llamacpp/embedding \
-H "Content-Type: application/json" \
-d '{
"content": "llama.cpp enables efficient LLM inference with GGUF quantization"
}'
Response¶
{
"embedding": [0.0234, -0.0567, 0.0891, -0.1203, 0.0456, ...],
"model": "nomic-embed-text-v1.5.Q4_K_M.gguf"
}
Note: Requires an embedding model (e.g., nomic-embed-text, bge-large).
POST /olla/llamacpp/v1/embeddings¶
OpenAI-compatible embeddings endpoint.
Request¶
curl -X POST http://localhost:40114/olla/llamacpp/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "nomic-embed-text-v1.5.Q4_K_M.gguf",
"input": "llama.cpp enables efficient LLM inference with GGUF quantization",
"encoding_format": "float"
}'
Response¶
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [0.0234, -0.0567, 0.0891, -0.1203, 0.0456, ...]
}
],
"model": "nomic-embed-text-v1.5.Q4_K_M.gguf",
"usage": {
"prompt_tokens": 12,
"total_tokens": 12
}
}
Batch Embeddings¶
The input parameter supports arrays for batch processing:
curl -X POST http://localhost:40114/olla/llamacpp/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "nomic-embed-text-v1.5.Q4_K_M.gguf",
"input": [
"First document to embed",
"Second document to embed",
"Third document to embed"
]
}'
Tokenization (llamacpp-specific)¶
POST /olla/llamacpp/tokenize¶
Encode text to token IDs using the model's tokenizer.
Request¶
curl -X POST http://localhost:40114/olla/llamacpp/tokenize \
-H "Content-Type: application/json" \
-d '{
"content": "Hello, world! This is a test."
}'
Response¶
Use Cases: - Token counting for context management - Custom prompt engineering and optimization - Token-level analysis and debugging - Billing calculations based on token usage
POST /olla/llamacpp/detokenize¶
Decode token IDs back to text.
Request¶
curl -X POST http://localhost:40114/olla/llamacpp/detokenize \
-H "Content-Type: application/json" \
-d '{
"tokens": [9906, 11, 1917, 0, 1115, 374, 264, 1296, 13]
}'
Response¶
Code Infill (llamacpp-specific)¶
POST /olla/llamacpp/infill¶
Fill-In-the-Middle (FIM) code completion for IDE integration.
Request¶
curl -X POST http://localhost:40114/olla/llamacpp/infill \
-H "Content-Type: application/json" \
-d '{
"input_prefix": "def fibonacci(n):\n \"\"\"Calculate fibonacci number\"\"\"\n if n <= 1:\n return n\n ",
"input_suffix": "\n return result",
"n_predict": 100,
"temperature": 0.2,
"stop": ["\n\n"]
}'
Parameters¶
Parameter | Type | Default | Description |
---|---|---|---|
input_prefix | string | required | Code before the cursor/insertion point |
input_suffix | string | "" | Code after the cursor/insertion point |
n_predict | integer | 512 | Maximum tokens to generate |
temperature | float | 0.8 | Sampling temperature (lower for code) |
top_k | integer | 40 | Top-k sampling |
top_p | float | 0.95 | Top-p sampling |
stop | array | [] | Stop sequences (e.g., ["\n\n"]) |
Response¶
{
"content": "else:\n return fibonacci(n-1) + fibonacci(n-2)",
"stop": true,
"model": "deepseek-coder-6.7b-instruct.Q5_K_M.gguf",
"stopped_word": true,
"stopping_word": "\n\n",
"tokens_predicted": 24,
"tokens_evaluated": 45,
"truncated": false,
"timings": {
"prompt_n": 45,
"prompt_ms": 125.8,
"predicted_n": 24,
"predicted_ms": 310.5,
"predicted_per_token_ms": 12.94,
"predicted_per_second": 77.3
}
}
Supported Models:
- CodeLlama (code-instruct variants)
- StarCoder / StarCoder2
- DeepSeek-Coder
- WizardCoder
- Phind-CodeLlama
Use Cases:
- IDE code completion
- In-line code generation
- Code refactoring suggestions
- Docstring generation
- Test case generation
Response Headers¶
All responses from Olla include these headers:
X-Olla-Backend-Type: llamacpp
- Identifies the backend typeX-Olla-Endpoint: <name>
- Backend endpoint name (e.g., "llamacpp-server")X-Olla-Model: <model>
- GGUF model used for the requestX-Olla-Request-Id: <id>
- Unique request identifierX-Olla-Response-Time: <ms>
- Total processing time in millisecondsVia: 1.1 olla/<version>
- Olla proxy version
Configuration Example¶
endpoints:
- url: "http://192.168.0.100:8080"
name: "llamacpp-llama-8b"
type: "llamacpp"
priority: 95
# Profile handles health checks and model discovery
headers:
X-API-Key: "${LLAMACPP_API_KEY}"
Multi-Instance Setup¶
llama.cpp serves one model per instance. For multiple models, run multiple instances:
endpoints:
# Instance 1: Chat model
- url: "http://192.168.0.100:8080"
name: "llamacpp-chat"
type: "llamacpp"
priority: 90
# Instance 2: Code model
- url: "http://192.168.0.101:8080"
name: "llamacpp-code"
type: "llamacpp"
priority: 85
# Instance 3: Embedding model
- url: "http://192.168.0.102:8080"
name: "llamacpp-embed"
type: "llamacpp"
priority: 80
Olla will automatically route requests to the appropriate instance based on the model name.
Error Responses¶
503 Service Unavailable (All Slots Full)¶
When all processing slots are busy:
{
"error": {
"message": "All slots are busy. Please try again later.",
"type": "server_error",
"code": "slots_exhausted"
}
}
404 Model Not Found¶
When requesting a non-existent model:
{
"error": {
"message": "Model not found: unknown-model.gguf",
"type": "invalid_request_error",
"code": "model_not_found"
}
}
400 Bad Request (Invalid Parameters)¶
When request parameters are invalid:
{
"error": {
"message": "Invalid parameter: temperature must be between 0.0 and 2.0",
"type": "invalid_request_error",
"code": "invalid_parameter"
}
}
Examples¶
Python with OpenAI SDK¶
from openai import OpenAI
# Configure OpenAI SDK to use Olla proxy
client = OpenAI(
base_url="http://localhost:40114/olla/llamacpp/v1",
api_key="not-needed"
)
# Chat completion
response = client.chat.completions.create(
model="llama-3.1-8b-instruct-q4_k_m.gguf",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain GGUF quantization"}
],
temperature=0.7,
max_tokens=200
)
print(response.choices[0].message.content)
# Streaming chat completion
stream = client.chat.completions.create(
model="llama-3.1-8b-instruct-q4_k_m.gguf",
messages=[
{"role": "user", "content": "Count to 10 slowly"}
],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
# Embeddings
embeddings_response = client.embeddings.create(
model="nomic-embed-text-v1.5.Q4_K_M.gguf",
input="Text to embed"
)
print(embeddings_response.data[0].embedding)
Code Infill for IDE Integration¶
# Python function completion
curl -X POST http://localhost:40114/olla/llamacpp/infill \
-H "Content-Type: application/json" \
-d '{
"input_prefix": "def calculate_statistics(data: list) -> dict:\n \"\"\"Calculate mean, median, and mode\"\"\"\n ",
"input_suffix": "\n return stats",
"n_predict": 150,
"temperature": 0.2,
"stop": ["\n\n", "def "]
}'
# JavaScript function completion
curl -X POST http://localhost:40114/olla/llamacpp/infill \
-H "Content-Type: application/json" \
-d '{
"input_prefix": "async function fetchUserData(userId) {\n try {\n ",
"input_suffix": "\n } catch (error) {\n console.error(error);\n }\n}",
"n_predict": 100,
"temperature": 0.2
}'
Token Counting for Context Management¶
# Count tokens in prompt
TOKEN_RESPONSE=$(curl -s -X POST http://localhost:40114/olla/llamacpp/tokenize \
-H "Content-Type: application/json" \
-d "{\"content\": \"$PROMPT_TEXT\"}")
TOKEN_COUNT=$(echo $TOKEN_RESPONSE | jq '.tokens | length')
# Check against context window
MAX_CONTEXT=4096
MAX_COMPLETION=512
AVAILABLE=$((MAX_CONTEXT - TOKEN_COUNT - MAX_COMPLETION))
if [ $TOKEN_COUNT -gt $AVAILABLE ]; then
echo "Prompt too long: $TOKEN_COUNT tokens (max: $AVAILABLE)"
exit 1
fi
echo "Prompt size: $TOKEN_COUNT tokens, available for completion: $AVAILABLE"
llamacpp-Specific Features¶
Slot Management¶
llama.cpp uses a slot-based architecture for concurrency control:
- Processing Slots: Fixed number of concurrent request handlers
- State Tracking: Real-time visibility into slot usage (via direct backend access)
- Queue Management: Requests wait when all slots are busy
- Capacity Planning: Monitor slot utilisation to scale infrastructure
Note: Slot monitoring is not available through Olla proxy endpoints. Access slot status directly from the llama.cpp backend at
http://backend:8080/slots
or use Olla's internal monitoring endpoints.
GGUF Quantization Levels¶
llama.cpp exclusively supports GGUF format with extensive quantization options:
Quantization | BPW | Memory | Quality | Use Case |
---|---|---|---|---|
Q2_K | 2.63 | ~35% | Lower | Extreme compression |
Q3_K_M | 3.91 | ~45% | Moderate | Balanced small models |
Q4_K_M | 4.85 | ~50% | Good | Recommended default |
Q5_K_M | 5.69 | ~62% | High | Quality-focused |
Q6_K | 6.59 | ~75% | Very High | Near-original quality |
Q8_0 | 8.50 | ~87% | Excellent | High-fidelity |
F16 | 16.0 | 100% | Original | Baseline |
Recommendation: Q4_K_M provides the best quality/performance/memory balance for most use cases.
Grammar Support¶
llama.cpp supports GBNF (GGML BNF) grammars for constrained generation:
curl -X POST http://localhost:40114/olla/llamacpp/completion \
-H "Content-Type: application/json" \
-d '{
"prompt": "Generate a JSON object with name and age:",
"n_predict": 100,
"temperature": 0.7,
"grammar": "root ::= \"{\" ws \"\\\"name\\\":\" ws string \",\" ws \"\\\"age\\\":\" ws number ws \"}\" ws\nstring ::= \"\\\"\" [^\"]* \"\\\"\"\nnumber ::= [0-9]+\nws ::= [ \\t\\n]*"
}'
This ensures output conforms to the specified grammar structure.
Performance Characteristics¶
Memory Requirements (Q4_K_M Quantization)¶
Model Size | RAM (CPU) | VRAM (GPU) | Typical Slots |
---|---|---|---|
1B-3B | 2-4 GB | 2-4 GB | 8 |
7B-8B | 6-8 GB | 6-8 GB | 4 |
13B-14B | 10-16 GB | 10-16 GB | 2-4 |
30B-34B | 20-24 GB | 20-24 GB | 1-2 |
70B+ | 40-48 GB | 40-48 GB | 1 |
Hardware Backend Support¶
llama.cpp supports multiple hardware acceleration backends:
- CPU: Full functionality without GPU (AVX2, AVX512, NEON)
- CUDA: NVIDIA GPUs (compute capability 6.0+)
- Metal: Apple Silicon (M1/M2/M3/M4)
- Vulkan: Cross-platform GPU acceleration
- SYCL: Intel GPUs
- ROCm: AMD GPUs
Throughput Benchmarks¶
Typical performance on consumer hardware (Q4_K_M, batch size 1):
CPU-only (Ryzen 9 5950X):
- 7B model: 15-25 tokens/sec
- 13B model: 8-12 tokens/sec
GPU-accelerated (RTX 4090):
- 7B model: 80-120 tokens/sec
- 13B model: 50-80 tokens/sec
- 70B model: 15-25 tokens/sec
Apple Silicon (M3 Max):
- 7B model: 60-90 tokens/sec
- 13B model: 35-55 tokens/sec
Best Practices¶
Production Deployment¶
- Use Q4_K_M or Q5_K_M quantization for optimal quality/performance balance
- Configure slots based on expected concurrency (typically 4-8 slots)
- Monitor performance via Olla's internal endpoints or direct backend access for capacity planning
- Deploy multiple instances for different model types (chat, code, embeddings)
- Enable GPU acceleration when available for higher throughput
- Set appropriate context window (n_ctx) based on use case requirements
- Implement health checks and automatic failover via Olla proxy
Slot Management¶
- Monitor slot exhaustion and scale when consistently at capacity
- Implement client-side retries with exponential backoff for 503 errors
- Use Olla load balancing to distribute load across multiple instances
- Set reasonable timeout values to prevent stuck slots
Security¶
- Bind to localhost (127.0.0.1) for local-only access
- Use reverse proxy (nginx/caddy) with authentication for external access
- Implement rate limiting at the reverse proxy level
- Monitor metrics for unusual usage patterns
- Restrict file system access if model loading is dynamic
Performance Optimization¶
- Choose appropriate quantization level based on quality requirements
- Enable Flash Attention if supported by hardware
- Tune slot count based on available memory and expected load
- Use batch processing for embeddings when possible
- Monitor TTFT and throughput metrics for optimization opportunities
Troubleshooting¶
All Slots Busy (503 Errors)¶
Symptoms: Requests return 503 "All slots are busy"
Solutions:
- Increase slot count in llama.cpp server configuration
- Deploy additional llama.cpp instances behind Olla
- Implement client-side retry logic with exponential backoff
- Monitor slot usage patterns and scale accordingly
High Memory Usage¶
Symptoms: OOM errors, system slowdown
Solutions:
- Use lower quantization level (Q4 instead of Q8)
- Reduce context window size (n_ctx)
- Decrease slot count
- Upgrade to system with more RAM
- Enable GPU offloading to move workload to VRAM
Slow Inference Speed¶
Symptoms: Low tokens/second, high latency
Solutions:
- Enable GPU acceleration (CUDA/Metal)
- Use lower quantization for faster inference (Q4 vs Q8)
- Reduce batch size if using continuous batching
- Check CPU/GPU utilisation and thermal throttling
- Upgrade hardware if consistently CPU/GPU-bound
Model Loading Failures¶
Symptoms: Server fails to start or load model
Solutions:
- Verify GGUF file integrity and format
- Ensure sufficient memory for model size
- Check file permissions and paths
- Validate quantization level is supported
- Review llama.cpp server logs for detailed errors