llama.cpp Integration¶
Home | github.com/ggml-org/llama.cpp github.com/gikawrakow/ik_llama.cpp |
---|---|
Since | Olla v0.0.20 (previously since v0.0.1) |
Type | llamacpp (use in endpoint configuration) |
Profile | llamacpp.yaml (see latest) |
Features |
|
Unsupported |
|
Attributes |
|
Prefixes |
|
Priority | 95 (high priority, between Ollama and LM Studio) |
Endpoints | See below |
Compatibility with mainline LlamaCpp
Primary development was done for compatibility with the original llamacpp and tested on forks like ik_llama but we may not support the wider forks yet.
Configuration¶
Basic Setup¶
Add llama.cpp to your Olla configuration:
discovery:
static:
endpoints:
- url: "http://localhost:8080"
name: "local-llamacpp"
type: "llamacpp"
priority: 95
# Profile handles health checks and model discovery
Production Setup¶
Configure llama.cpp for production with proper timeouts:
discovery:
static:
endpoints:
- url: "http://inference-server:8080"
name: "llamacpp-prod"
type: "llamacpp"
priority: 95
# Profile handles health checks and model discovery
proxy:
engine: "olla" # Use high-performance engine
load_balancer: "round-robin"
Multiple Instances with Different Quantisations¶
Deploy multiple llama.cpp instances with different quantisation levels:
discovery:
static:
endpoints:
# High quality Q8 instance
- url: "http://gpu-server:8080"
name: "llamacpp-q8-quality"
type: "llamacpp"
priority: 100
# Balanced Q4 instance
- url: "http://cpu-server:8081"
name: "llamacpp-q4-balanced"
type: "llamacpp"
priority: 80
# Fast Q2 instance for edge
- url: "http://edge-device:8082"
name: "llamacpp-q2-fast"
type: "llamacpp"
priority: 60
Endpoints Supported¶
The following 9 inference endpoints are proxied by the llama.cpp integration profile:
Note: Monitoring endpoints (
/health
,/props
,/slots
,/metrics
) are not exposed through the Olla proxy as per architectural design. The/olla/*
endpoints are strictly for inference requests. Health checks are handled internally by Olla, and monitoring should use Olla's/internal/*
endpoints or direct backend access.
Path | Description |
---|---|
/v1/models | List Models (OpenAI format) |
/completion | Native Completion Endpoint |
/v1/completions | Text Completions (OpenAI format) |
/v1/chat/completions | Chat Completions (OpenAI format) |
/embedding | Native Embedding Endpoint |
/v1/embeddings | Embeddings (OpenAI format) |
/tokenize | Encode Text to Tokens (llama.cpp-specific) |
/detokenize | Decode Tokens to Text (llama.cpp-specific) |
/infill | Code Infill/Completion (llama.cpp-specific, FIM) |
Usage Examples¶
Chat Completion¶
curl -X POST http://localhost:40114/olla/llamacpp/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b-instruct-q4_k_m.gguf",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms"}
],
"temperature": 0.7,
"max_tokens": 500
}'
Streaming Response¶
curl -X POST http://localhost:40114/olla/llamacpp/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral-7b-instruct-q4_k_m.gguf",
"messages": [
{"role": "user", "content": "Write a story about a robot"}
],
"stream": true,
"temperature": 0.8
}'
Code Infill (FIM Support)¶
Code completion using Fill-In-the-Middle (llama.cpp-specific):
curl -X POST http://localhost:40114/olla/llamacpp/infill \
-H "Content-Type: application/json" \
-d '{
"input_prefix": "def fibonacci(n):\n if n <= 1:\n return n\n ",
"input_suffix": "\n return result",
"temperature": 0.2,
"max_tokens": 100
}'
# Useful for IDE integrations like Continue.dev, Aider
Tokenisation¶
Encode and decode tokens using the model's tokeniser (llama.cpp-specific):
# Encode text to tokens
curl -X POST http://localhost:40114/olla/llamacpp/tokenize \
-H "Content-Type: application/json" \
-d '{
"content": "Hello, world!"
}'
# Response: {"tokens": [15496, 11, 1917, 0]}
# Decode tokens to text
curl -X POST http://localhost:40114/olla/llamacpp/detokenize \
-H "Content-Type: application/json" \
-d '{
"tokens": [15496, 11, 1917, 0]
}'
# Response: {"content": "Hello, world!"}
Embeddings¶
Generate embeddings for semantic search:
curl -X POST http://localhost:40114/olla/llamacpp/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "nomic-embed-text-v1.5-q4_k_m.gguf",
"input": "The quick brown fox jumps over the lazy dog"
}'
List Models¶
# OpenAI format
curl http://localhost:40114/olla/llamacpp/v1/models
# Response typically shows single model (single-model architecture)
# {
# "object": "list",
# "data": [
# {
# "id": "llama-3.2-3b-instruct-q4_k_m.gguf",
# "object": "model",
# "created": 1704067200,
# "owned_by": "meta-llama"
# }
# ]
# }
llama.cpp Specifics¶
Single Model Architecture¶
llama.cpp serves one model per instance, loaded at startup:
- No Runtime Switching: Cannot change models without restart
- Model Discovery: Returns single model in
/v1/models
response - Efficient Memory: All resources dedicated to one model
- Predictable Performance: No model switching overhead
This differs from Ollama (multi-model) and requires running multiple llama.cpp instances for multiple models.
Slot Management¶
llama.cpp uses slot-based concurrency for fine-grained control:
- Default Slots: 4 concurrent processing slots (configurable with
--parallel
) - Explicit Control: Each slot handles one request at a time
- Queue Management: Additional requests queue when slots full
- Monitoring: Available via direct backend access at
http://backend:8080/slots
- Capacity Planning: Adjust slots based on hardware and model size
Slot configuration example:
Note: Slot monitoring is not available through Olla proxy paths. Access slot status directly from the llama.cpp backend or use Olla's internal monitoring endpoints.
CPU Inference Capabilities¶
llama.cpp is optimised for CPU-first deployment:
- No GPU Required: Full functionality on CPU-only systems
- ARM Support: Runs on Apple Silicon (M1/M2/M3), Raspberry Pi
- Edge Deployment: Suitable for IoT and embedded systems
- Portable: Pure C++ with minimal dependencies
CPU performance tips: - Use Q4 quantisation for best CPU performance/quality trade-off - Allocate sufficient threads (--threads
) - Consider smaller models (3B-7B) for CPU deployment
Quantisation Options¶
llama.cpp provides extensive GGUF quantisation levels:
Quantisation | BPW (Bits Per Weight) | Memory vs F16 | Quality | Use Case |
---|---|---|---|---|
Q2_K | 2.63 | 35% | Low | Extreme compression, edge devices |
Q3_K_M | 3.91 | 45% | Moderate | Balanced compression |
Q4_K_M | 4.85 | 50% | Good | Recommended for most use cases |
Q5_K_M | 5.69 | 62.5% | High | Quality-focused deployments |
Q6_K | 6.59 | 75% | Very High | Near-original quality |
Q8_0 | 8.50 | 87.5% | Excellent | Production quality requirements |
F16 | 16 | 100% | Original | Baseline reference |
F32 | 32 | 200% | Perfect | Research, original weights |
Memory Requirements¶
Approximate memory requirements for different model sizes with Q4_K_M quantisation:
Model Size | Q4_K_M Memory | Q8_0 Memory | Recommended RAM | Max Slots (Typical) |
---|---|---|---|---|
1-3B | 2-3GB | 3-5GB | 4GB | 8 |
7-8B | 4-6GB | 8-12GB | 8GB | 4 |
13-14B | 8-10GB | 16-20GB | 16GB | 2 |
30-34B | 20GB | 40GB | 24GB | 2 |
70-72B | 40GB | 80GB | 48GB | 1 |
Note: Memory requirements increase with context length. Add ~1GB per 8K context tokens.
Starting llama.cpp Server¶
Basic Start¶
# Download llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
# Start server with model
./llama-server \
-m models/llama-3.2-3b-instruct-q4_k_m.gguf \
--port 8080
With Slots Configuration¶
# Configure 8 concurrent slots
./llama-server \
-m models/mistral-7b-instruct-q4_k_m.gguf \
--port 8080 \
--parallel 8 \
--threads 8
CPU-Only Example¶
Optimised for CPU inference:
./llama-server \
-m models/llama-3.2-3b-instruct-q4_k_m.gguf \
--port 8080 \
--threads 8 \
--ctx-size 8192 \
--parallel 4 \
--host 0.0.0.0
GPU Acceleration (CUDA)¶
For NVIDIA GPUs:
# Build with CUDA support
make GGML_CUDA=1
# Start with GPU offloading
./llama-server \
-m models/llama-3.2-3b-instruct-q4_k_m.gguf \
--port 8080 \
--n-gpu-layers 35 \
--parallel 8
GPU Acceleration (Metal)¶
For Apple Silicon (M1/M2/M3):
# Build with Metal support (default on macOS)
make
# Start with GPU acceleration
./llama-server \
-m models/llama-3.2-3b-instruct-q4_k_m.gguf \
--port 8080 \
--n-gpu-layers 35 \
--parallel 4
Docker Deployment¶
# Pull llama.cpp server image
docker pull ghcr.io/ggerganov/llama.cpp:server
# Run with model volume
docker run -d \
--name llamacpp \
-p 8080:8080 \
-v /path/to/models:/models \
ghcr.io/ggerganov/llama.cpp:server \
-m /models/llama-3.2-3b-instruct-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080 \
--parallel 4
# With GPU support (NVIDIA)
docker run -d \
--gpus all \
--name llamacpp-gpu \
-p 8080:8080 \
-v /path/to/models:/models \
ghcr.io/ggerganov/llama.cpp:server-cuda \
-m /models/mistral-7b-instruct-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080 \
--n-gpu-layers 35 \
--parallel 8
Profile Customisation¶
To customise llama.cpp behaviour, create config/profiles/llamacpp-custom.yaml
. See Profile Configuration for detailed explanations of each section.
Example Customisation¶
name: llamacpp
version: "1.0"
# Add custom prefixes
routing:
prefixes:
- llamacpp
- cpp # Add custom prefix
- local # Add custom prefix
# Adjust for larger models and CPU inference
characteristics:
timeout: 10m # Increase for large models
max_concurrent_requests: 2 # Reduce for limited hardware
# Model capability detection
models:
capability_patterns:
code:
- "*deepseek-coder*"
- "*codellama*"
- "*starcoder*"
- "*phind-codellama*"
embeddings:
- "*embed*"
- "*nomic*"
- "*bge*"
# Custom context patterns
context_patterns:
- pattern: "*-128k*"
context: 131072
- pattern: "*qwen2.5*"
context: 32768
# Slot configuration
resources:
slot_configuration:
default_slots: 8 # Increase for more concurrency
max_slots: 16
slot_monitoring: true
# Adjust concurrency for hardware
concurrency_limits:
- min_memory_gb: 20 # Large models
max_concurrent: 1
- min_memory_gb: 8 # Medium models
max_concurrent: 4
- min_memory_gb: 0 # Small models
max_concurrent: 8
See Profile Configuration for complete customisation options.
Troubleshooting¶
Slot Exhaustion (504 Errors)¶
Issue: "all slots are busy" or 504 timeout errors
Solution:
-
Increase parallel slots:
-
Add more llama.cpp instances:
Model Loading Failures¶
Issue: Model fails to load or crashes at startup
Solution:
-
Verify GGUF file integrity:
-
Check memory requirements:
-
Reduce context size:
-
Use smaller quantisation (Q4 instead of Q8):
Memory Issues¶
Issue: Out of memory errors or system freezing
Solution:
- Use more aggressive quantisation:
- Q8 → Q5_K_M (37.5% memory saving)
- Q5_K_M → Q4_K_M (50% memory saving)
-
Q4_K_M → Q3_K_M (55% memory saving)
-
Reduce parallel slots:
-
Limit context window:
-
Monitor memory usage:
Connection Timeouts¶
Issue: Requests timeout before completion
Solution:
-
Increase Olla timeout:
-
Increase llama.cpp timeout in profile:
-
Monitor performance via Olla internal endpoints:
GGUF Format Incompatibility¶
Issue: "invalid model file" or version errors
Solution:
-
Update llama.cpp to latest version:
-
Re-download model with compatible GGUF version:
-
Verify GGUF metadata:
Best Practices¶
1. Slot Configuration for Workload¶
Match slots to your workload pattern:
# High concurrency (many short requests)
resources:
slot_configuration:
default_slots: 8
max_slots: 16
# Low concurrency (few long requests)
resources:
slot_configuration:
default_slots: 2
max_slots: 4
2. Quantisation Selection Guide¶
Choose quantisation based on requirements:
Priority | Recommended Quantisation | Use Case |
---|---|---|
Quality First | Q8_0, Q6_K | Production, quality-critical |
Balanced | Q4_K_M | General purpose, recommended |
Speed/Memory | Q3_K_M, Q2_K | Edge devices, limited resources |
Research | F16, F32 | Benchmarking, development |
3. CPU vs GPU Deployment Decisions¶
Use CPU when: - GPU not available (edge devices, workstations) - Small models (1-7B with Q4 quantisation) - Low concurrency requirements - Cost-sensitive deployments
Use GPU when: - Available GPU memory (8GB+) - Large models (13B+) - High throughput requirements - Low latency critical
4. Multiple Instance Patterns¶
Deploy multiple llama.cpp instances strategically:
# Pattern 1: Same model, different servers (load balancing)
endpoints:
- url: "http://server1:8080"
name: "llamacpp-1"
priority: 100
- url: "http://server2:8080"
name: "llamacpp-2"
priority: 100
# Pattern 2: Different models, different instances
endpoints:
- url: "http://server1:8080" # llama-3.2-3b-q4
name: "llamacpp-small"
priority: 90
- url: "http://server2:8081" # mistral-7b-q4
name: "llamacpp-medium"
priority: 95
# Pattern 3: Different quantisations, quality tiers
endpoints:
- url: "http://server1:8080" # Q8 high quality
name: "llamacpp-quality"
priority: 100
- url: "http://server2:8080" # Q4 balanced
name: "llamacpp-balanced"
priority: 80
5. Memory Management¶
Plan memory allocation carefully:
# Reserve memory headroom (20% buffer recommended)
# For 7B Q4 model requiring 5GB:
# System RAM needed = 5GB × 1.2 = 6GB minimum
# Monitor actual usage
./llama-server -m model.gguf --verbose
# Adjust slots based on memory:
# Total RAM / (Model Memory + Context Memory) = Max Slots
# 16GB / (5GB model + 1GB context) ≈ 2-3 safe slots
Integration with Tools¶
OpenAI SDK¶
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:40114/olla/llamacpp/v1",
api_key="not-needed" # llama.cpp doesn't require API keys
)
response = client.chat.completions.create(
model="llama-3.2-3b-instruct-q4_k_m.gguf",
messages=[
{"role": "user", "content": "Hello!"}
]
)
print(response.choices[0].message.content)
LangChain¶
from langchain.llms import OpenAI
llm = OpenAI(
openai_api_base="http://localhost:40114/olla/llamacpp/v1",
openai_api_key="not-needed",
model_name="mistral-7b-instruct-q4_k_m.gguf"
)
response = llm("Explain machine learning")
print(response)
Continue.dev (Code Completion)¶
Configure Continue for IDE code completion:
{
"models": [{
"title": "llama.cpp via Olla",
"provider": "openai",
"model": "deepseek-coder-6.7b-instruct-q5_k_m.gguf",
"apiBase": "http://localhost:40114/olla/llamacpp/v1",
"useLegacyCompletionsEndpoint": false
}],
"tabAutocompleteModel": {
"title": "llama.cpp Autocomplete",
"provider": "openai",
"model": "deepseek-coder-1.3b-base-q4_k_m.gguf",
"apiBase": "http://localhost:40114/olla/llamacpp/v1"
}
}
Aider (Pair Programming)¶
# Use llama.cpp with Aider for code assistance
aider \
--openai-api-base http://localhost:40114/olla/llamacpp/v1 \
--model deepseek-coder-6.7b-instruct-q5_k_m.gguf \
--no-auto-commits
# For code infill (FIM) with compatible models
aider \
--openai-api-base http://localhost:40114/olla/llamacpp/v1 \
--model codellama-13b-instruct-q4_k_m.gguf \
--edit-format whole
Next Steps¶
- Profile Configuration - Customise llama.cpp behaviour
- Model Unification - Understand model management across instances
- Load Balancing - Scale with multiple llama.cpp instances
- Monitoring - Set up slot and performance monitoring