llama.cpp Integration¶

Home	github.com/ggml-org/llama.cpp github.com/gikawrakow/ik_llama.cpp
Since	Olla `v0.0.20` (previously since v0.0.1)
Type	`llamacpp` (use in endpoint configuration)
Profile	`llamacpp.yaml` (see latest)
Features	Proxy Forwarding Model Unification Model Detection & Normalisation OpenAI API Compatibility Tokenisation API Code Infill (FIM Support) GGUF Exclusive Format
Unsupported	Runtime Model Switching Multi-Model Per Instance Model Management (loading/unloading)
Attributes	OpenAI Compatible Single Model Architecture Slot-Based Concurrency CPU Inference Ready GGUF Exclusive Edge Device Friendly Lightweight C++
Prefixes	`/llamacpp` (see Routing Prefixes) `/llama-cpp` `/llama_cpp`
Priority	95 (high priority, between Ollama and LM Studio)
Endpoints	See below

Compatibility with mainline LlamaCpp

Primary development was done for compatibility with the original llamacpp and tested on forks like ik_llama but we may not support the wider forks yet.

Configuration¶

Basic Setup¶

Add llama.cpp to your Olla configuration:

discovery:
  static:
    endpoints:
      - url: "http://localhost:8080"
        name: "local-llamacpp"
        type: "llamacpp"
        priority: 95
        # Profile handles health checks and model discovery

Production Setup¶

Configure llama.cpp for production with proper timeouts:

discovery:
  static:
    endpoints:
      - url: "http://inference-server:8080"
        name: "llamacpp-prod"
        type: "llamacpp"
        priority: 95
        # Profile handles health checks and model discovery

proxy:
  engine: "olla"  # Use high-performance engine
  load_balancer: "round-robin"

Multiple Instances with Different Quantisations¶

Deploy multiple llama.cpp instances with different quantisation levels:

discovery:
  static:
    endpoints:
      # High quality Q8 instance
      - url: "http://gpu-server:8080"
        name: "llamacpp-q8-quality"
        type: "llamacpp"
        priority: 100

      # Balanced Q4 instance
      - url: "http://cpu-server:8081"
        name: "llamacpp-q4-balanced"
        type: "llamacpp"
        priority: 80

      # Fast Q2 instance for edge
      - url: "http://edge-device:8082"
        name: "llamacpp-q2-fast"
        type: "llamacpp"
        priority: 60

Endpoints Supported¶

The following 9 inference endpoints are proxied by the llama.cpp integration profile:

Note: Monitoring endpoints (/health, /props, /slots, /metrics) are not exposed through the Olla proxy as per architectural design. The /olla/* endpoints are strictly for inference requests. Health checks are handled internally by Olla, and monitoring should use Olla's /internal/* endpoints or direct backend access.

Path	Description
`/v1/models`	List Models (OpenAI format)
`/completion`	Native Completion Endpoint
`/v1/completions`	Text Completions (OpenAI format)
`/v1/chat/completions`	Chat Completions (OpenAI format)
`/embedding`	Native Embedding Endpoint
`/v1/embeddings`	Embeddings (OpenAI format)
`/tokenize`	Encode Text to Tokens (llama.cpp-specific)
`/detokenize`	Decode Tokens to Text (llama.cpp-specific)
`/infill`	Code Infill/Completion (llama.cpp-specific, FIM)

Usage Examples¶

Chat Completion¶

curl -X POST http://localhost:40114/olla/llamacpp/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.2-3b-instruct-q4_k_m.gguf",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing in simple terms"}
    ],
    "temperature": 0.7,
    "max_tokens": 500
  }'

Streaming Response¶

curl -X POST http://localhost:40114/olla/llamacpp/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-7b-instruct-q4_k_m.gguf",
    "messages": [
      {"role": "user", "content": "Write a story about a robot"}
    ],
    "stream": true,
    "temperature": 0.8
  }'

Code Infill (FIM Support)¶

Code completion using Fill-In-the-Middle (llama.cpp-specific):

curl -X POST http://localhost:40114/olla/llamacpp/infill \
  -H "Content-Type: application/json" \
  -d '{
    "input_prefix": "def fibonacci(n):\n    if n <= 1:\n        return n\n    ",
    "input_suffix": "\n    return result",
    "temperature": 0.2,
    "max_tokens": 100
  }'

# Useful for IDE integrations like Continue.dev, Aider

Tokenisation¶

Encode and decode tokens using the model's tokeniser (llama.cpp-specific):

# Encode text to tokens
curl -X POST http://localhost:40114/olla/llamacpp/tokenize \
  -H "Content-Type: application/json" \
  -d '{
    "content": "Hello, world!"
  }'

# Response: {"tokens": [15496, 11, 1917, 0]}

# Decode tokens to text
curl -X POST http://localhost:40114/olla/llamacpp/detokenize \
  -H "Content-Type: application/json" \
  -d '{
    "tokens": [15496, 11, 1917, 0]
  }'

# Response: {"content": "Hello, world!"}

Embeddings¶

Generate embeddings for semantic search:

curl -X POST http://localhost:40114/olla/llamacpp/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed-text-v1.5-q4_k_m.gguf",
    "input": "The quick brown fox jumps over the lazy dog"
  }'

List Models¶

# OpenAI format
curl http://localhost:40114/olla/llamacpp/v1/models

# Response typically shows single model (single-model architecture)
# {
#   "object": "list",
#   "data": [
#     {
#       "id": "llama-3.2-3b-instruct-q4_k_m.gguf",
#       "object": "model",
#       "created": 1704067200,
#       "owned_by": "meta-llama"
#     }
#   ]
# }

llama.cpp Specifics¶

Single Model Architecture¶

llama.cpp serves one model per instance, loaded at startup:

No Runtime Switching: Cannot change models without restart
Model Discovery: Returns single model in /v1/models response
Efficient Memory: All resources dedicated to one model
Predictable Performance: No model switching overhead

This differs from Ollama (multi-model) and requires running multiple llama.cpp instances for multiple models.

Slot Management¶

llama.cpp uses slot-based concurrency for fine-grained control:

Default Slots: 4 concurrent processing slots (configurable with --parallel)
Explicit Control: Each slot handles one request at a time
Queue Management: Additional requests queue when slots full
Monitoring: Available via direct backend access at http://backend:8080/slots
Capacity Planning: Adjust slots based on hardware and model size

Slot configuration example:

# Start with 8 slots for higher concurrency
llama-server -m model.gguf --parallel 8 --port 8080

Note: Slot monitoring is not available through Olla proxy paths. Access slot status directly from the llama.cpp backend or use Olla's internal monitoring endpoints.

CPU Inference Capabilities¶

llama.cpp is optimised for CPU-first deployment:

No GPU Required: Full functionality on CPU-only systems
ARM Support: Runs on Apple Silicon (M1/M2/M3), Raspberry Pi
Edge Deployment: Suitable for IoT and embedded systems
Portable: Pure C++ with minimal dependencies

CPU performance tips: - Use Q4 quantisation for best CPU performance/quality trade-off - Allocate sufficient threads (--threads) - Consider smaller models (3B-7B) for CPU deployment

Quantisation Options¶

llama.cpp provides extensive GGUF quantisation levels:

Quantisation	BPW (Bits Per Weight)	Memory vs F16	Quality	Use Case
`Q2_K`	2.63	35%	Low	Extreme compression, edge devices
`Q3_K_M`	3.91	45%	Moderate	Balanced compression
`Q4_K_M`	4.85	50%	Good	Recommended for most use cases
`Q5_K_M`	5.69	62.5%	High	Quality-focused deployments
`Q6_K`	6.59	75%	Very High	Near-original quality
`Q8_0`	8.50	87.5%	Excellent	Production quality requirements
`F16`	16	100%	Original	Baseline reference
`F32`	32	200%	Perfect	Research, original weights

Memory Requirements¶

Approximate memory requirements for different model sizes with Q4_K_M quantisation:

Model Size	Q4_K_M Memory	Q8_0 Memory	Recommended RAM	Max Slots (Typical)
1-3B	2-3GB	3-5GB	4GB	8
7-8B	4-6GB	8-12GB	8GB	4
13-14B	8-10GB	16-20GB	16GB	2
30-34B	20GB	40GB	24GB	2
70-72B	40GB	80GB	48GB	1

Note: Memory requirements increase with context length. Add ~1GB per 8K context tokens.

Starting llama.cpp Server¶

Basic Start¶

# Download llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Start server with model
./llama-server \
  -m models/llama-3.2-3b-instruct-q4_k_m.gguf \
  --port 8080

With Slots Configuration¶

# Configure 8 concurrent slots
./llama-server \
  -m models/mistral-7b-instruct-q4_k_m.gguf \
  --port 8080 \
  --parallel 8 \
  --threads 8

CPU-Only Example¶

Optimised for CPU inference:

./llama-server \
  -m models/llama-3.2-3b-instruct-q4_k_m.gguf \
  --port 8080 \
  --threads 8 \
  --ctx-size 8192 \
  --parallel 4 \
  --host 0.0.0.0

GPU Acceleration (CUDA)¶

For NVIDIA GPUs:

# Build with CUDA support
make GGML_CUDA=1

# Start with GPU offloading
./llama-server \
  -m models/llama-3.2-3b-instruct-q4_k_m.gguf \
  --port 8080 \
  --n-gpu-layers 35 \
  --parallel 8

GPU Acceleration (Metal)¶

For Apple Silicon (M1/M2/M3):

# Build with Metal support (default on macOS)
make

# Start with GPU acceleration
./llama-server \
  -m models/llama-3.2-3b-instruct-q4_k_m.gguf \
  --port 8080 \
  --n-gpu-layers 35 \
  --parallel 4

Docker Deployment¶

# Pull llama.cpp server image
docker pull ghcr.io/ggerganov/llama.cpp:server

# Run with model volume
docker run -d \
  --name llamacpp \
  -p 8080:8080 \
  -v /path/to/models:/models \
  ghcr.io/ggerganov/llama.cpp:server \
  -m /models/llama-3.2-3b-instruct-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --parallel 4

# With GPU support (NVIDIA)
docker run -d \
  --gpus all \
  --name llamacpp-gpu \
  -p 8080:8080 \
  -v /path/to/models:/models \
  ghcr.io/ggerganov/llama.cpp:server-cuda \
  -m /models/mistral-7b-instruct-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --n-gpu-layers 35 \
  --parallel 8

Profile Customisation¶

To customise llama.cpp behaviour, create config/profiles/llamacpp-custom.yaml. See Profile Configuration for detailed explanations of each section.

Example Customisation¶

name: llamacpp
version: "1.0"

# Add custom prefixes
routing:
  prefixes:
    - llamacpp
    - cpp       # Add custom prefix
    - local     # Add custom prefix

# Adjust for larger models and CPU inference
characteristics:
  timeout: 10m                  # Increase for large models
  max_concurrent_requests: 2    # Reduce for limited hardware

# Model capability detection
models:
  capability_patterns:
    code:
      - "*deepseek-coder*"
      - "*codellama*"
      - "*starcoder*"
      - "*phind-codellama*"
    embeddings:
      - "*embed*"
      - "*nomic*"
      - "*bge*"

# Custom context patterns
  context_patterns:
    - pattern: "*-128k*"
      context: 131072
    - pattern: "*qwen2.5*"
      context: 32768

# Slot configuration
resources:
  slot_configuration:
    default_slots: 8          # Increase for more concurrency
    max_slots: 16
    slot_monitoring: true

  # Adjust concurrency for hardware
  concurrency_limits:
    - min_memory_gb: 20       # Large models
      max_concurrent: 1
    - min_memory_gb: 8        # Medium models
      max_concurrent: 4
    - min_memory_gb: 0        # Small models
      max_concurrent: 8

See Profile Configuration for complete customisation options.

Troubleshooting¶

Slot Exhaustion (504 Errors)¶

Issue: "all slots are busy" or 504 timeout errors

Solution:

Increase parallel slots:

./llama-server -m model.gguf --parallel 8

Add more llama.cpp instances:

endpoints:
  - url: "http://server1:8080"
    name: "llamacpp-1"
  - url: "http://server2:8080"
    name: "llamacpp-2"

Model Loading Failures¶

Issue: Model fails to load or crashes at startup

Solution:

Verify GGUF file integrity:

# Check file size and format
file model.gguf

Check memory requirements:

# Estimate memory needed (Q4_K_M ≈ 0.5 bytes per parameter)
# 7B model = 7 billion × 0.5 bytes = 3.5GB

Reduce context size:

./llama-server -m model.gguf --ctx-size 4096  # Reduce from default

Use smaller quantisation (Q4 instead of Q8):
```
# Download Q4 variant instead of Q8
```

Memory Issues¶

Issue: Out of memory errors or system freezing

Solution:

Use more aggressive quantisation:
Q8 → Q5_K_M (37.5% memory saving)
Q5_K_M → Q4_K_M (50% memory saving)
Q4_K_M → Q3_K_M (55% memory saving)

Reduce parallel slots:

./llama-server -m model.gguf --parallel 1  # Single concurrent request

Limit context window:

./llama-server -m model.gguf --ctx-size 2048  # Smaller context

Monitor memory usage:

# Linux
watch -n 1 'free -h'

# macOS
vm_stat

Connection Timeouts¶

Issue: Requests timeout before completion

Solution:

Increase Olla timeout:

proxy:
  response_timeout: 600s  # 10 minutes

Increase llama.cpp timeout in profile:
```
characteristics:
  timeout: 10m
```

Monitor performance via Olla internal endpoints:

# Check Olla's internal status
curl http://localhost:40114/internal/status

GGUF Format Incompatibility¶

Issue: "invalid model file" or version errors

Solution:

Update llama.cpp to latest version:

cd llama.cpp
git pull
make clean && make

Re-download model with compatible GGUF version:

# Check model compatibility with llama.cpp version

Verify GGUF metadata:

# Use llama.cpp tools to inspect GGUF file
./llama-cli --model model.gguf --verbose

Best Practices¶

1. Slot Configuration for Workload¶

Match slots to your workload pattern:

# High concurrency (many short requests)
resources:
  slot_configuration:
    default_slots: 8
    max_slots: 16

# Low concurrency (few long requests)
resources:
  slot_configuration:
    default_slots: 2
    max_slots: 4

2. Quantisation Selection Guide¶

Choose quantisation based on requirements:

Priority	Recommended Quantisation	Use Case
Quality First	Q8_0, Q6_K	Production, quality-critical
Balanced	Q4_K_M	General purpose, recommended
Speed/Memory	Q3_K_M, Q2_K	Edge devices, limited resources
Research	F16, F32	Benchmarking, development

3. CPU vs GPU Deployment Decisions¶

Use CPU when: - GPU not available (edge devices, workstations) - Small models (1-7B with Q4 quantisation) - Low concurrency requirements - Cost-sensitive deployments

Use GPU when: - Available GPU memory (8GB+) - Large models (13B+) - High throughput requirements - Low latency critical

4. Multiple Instance Patterns¶

Deploy multiple llama.cpp instances strategically:

# Pattern 1: Same model, different servers (load balancing)
endpoints:
  - url: "http://server1:8080"
    name: "llamacpp-1"
    priority: 100
  - url: "http://server2:8080"
    name: "llamacpp-2"
    priority: 100

# Pattern 2: Different models, different instances
endpoints:
  - url: "http://server1:8080"  # llama-3.2-3b-q4
    name: "llamacpp-small"
    priority: 90
  - url: "http://server2:8081"  # mistral-7b-q4
    name: "llamacpp-medium"
    priority: 95

# Pattern 3: Different quantisations, quality tiers
endpoints:
  - url: "http://server1:8080"  # Q8 high quality
    name: "llamacpp-quality"
    priority: 100
  - url: "http://server2:8080"  # Q4 balanced
    name: "llamacpp-balanced"
    priority: 80

5. Memory Management¶

Plan memory allocation carefully:

# Reserve memory headroom (20% buffer recommended)
# For 7B Q4 model requiring 5GB:
# System RAM needed = 5GB × 1.2 = 6GB minimum

# Monitor actual usage
./llama-server -m model.gguf --verbose

# Adjust slots based on memory:
# Total RAM / (Model Memory + Context Memory) = Max Slots
# 16GB / (5GB model + 1GB context) ≈ 2-3 safe slots

Integration with Tools¶

OpenAI SDK¶

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:40114/olla/llamacpp/v1",
    api_key="not-needed"  # llama.cpp doesn't require API keys
)

response = client.chat.completions.create(
    model="llama-3.2-3b-instruct-q4_k_m.gguf",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

print(response.choices[0].message.content)

LangChain¶

from langchain.llms import OpenAI

llm = OpenAI(
    openai_api_base="http://localhost:40114/olla/llamacpp/v1",
    openai_api_key="not-needed",
    model_name="mistral-7b-instruct-q4_k_m.gguf"
)

response = llm("Explain machine learning")
print(response)

Continue.dev (Code Completion)¶

Configure Continue for IDE code completion:

{
  "models": [{
    "title": "llama.cpp via Olla",
    "provider": "openai",
    "model": "deepseek-coder-6.7b-instruct-q5_k_m.gguf",
    "apiBase": "http://localhost:40114/olla/llamacpp/v1",
    "useLegacyCompletionsEndpoint": false
  }],
  "tabAutocompleteModel": {
    "title": "llama.cpp Autocomplete",
    "provider": "openai",
    "model": "deepseek-coder-1.3b-base-q4_k_m.gguf",
    "apiBase": "http://localhost:40114/olla/llamacpp/v1"
  }
}

Aider (Pair Programming)¶

# Use llama.cpp with Aider for code assistance
aider \
  --openai-api-base http://localhost:40114/olla/llamacpp/v1 \
  --model deepseek-coder-6.7b-instruct-q5_k_m.gguf \
  --no-auto-commits

# For code infill (FIM) with compatible models
aider \
  --openai-api-base http://localhost:40114/olla/llamacpp/v1 \
  --model codellama-13b-instruct-q4_k_m.gguf \
  --edit-format whole

Next Steps¶

Profile Configuration - Customise llama.cpp behaviour
Model Unification - Understand model management across instances
Load Balancing - Scale with multiple llama.cpp instances
Monitoring - Set up slot and performance monitoring