Skip to content

llama.cpp Integration

Home github.com/ggml-org/llama.cpp
github.com/gikawrakow/ik_llama.cpp
Since Olla v0.0.20 (previously since v0.0.1)
Type llamacpp (use in endpoint configuration)
Profile llamacpp.yaml (see latest)
Features
  • Proxy Forwarding
  • Model Unification
  • Model Detection & Normalisation
  • OpenAI API Compatibility
  • Tokenisation API
  • Code Infill (FIM Support)
  • GGUF Exclusive Format
Unsupported
  • Runtime Model Switching
  • Multi-Model Per Instance
  • Model Management (loading/unloading)
Attributes
  • OpenAI Compatible
  • Single Model Architecture
  • Slot-Based Concurrency
  • CPU Inference Ready
  • GGUF Exclusive
  • Edge Device Friendly
  • Lightweight C++
Prefixes
Priority 95 (high priority, between Ollama and LM Studio)
Endpoints See below

Compatibility with mainline LlamaCpp

Primary development was done for compatibility with the original llamacpp and tested on forks like ik_llama but we may not support the wider forks yet.

Configuration

Basic Setup

Add llama.cpp to your Olla configuration:

discovery:
  static:
    endpoints:
      - url: "http://localhost:8080"
        name: "local-llamacpp"
        type: "llamacpp"
        priority: 95
        # Profile handles health checks and model discovery

Production Setup

Configure llama.cpp for production with proper timeouts:

discovery:
  static:
    endpoints:
      - url: "http://inference-server:8080"
        name: "llamacpp-prod"
        type: "llamacpp"
        priority: 95
        # Profile handles health checks and model discovery

proxy:
  engine: "olla"  # Use high-performance engine
  load_balancer: "round-robin"

Multiple Instances with Different Quantisations

Deploy multiple llama.cpp instances with different quantisation levels:

discovery:
  static:
    endpoints:
      # High quality Q8 instance
      - url: "http://gpu-server:8080"
        name: "llamacpp-q8-quality"
        type: "llamacpp"
        priority: 100

      # Balanced Q4 instance
      - url: "http://cpu-server:8081"
        name: "llamacpp-q4-balanced"
        type: "llamacpp"
        priority: 80

      # Fast Q2 instance for edge
      - url: "http://edge-device:8082"
        name: "llamacpp-q2-fast"
        type: "llamacpp"
        priority: 60

Endpoints Supported

The following 9 inference endpoints are proxied by the llama.cpp integration profile:

Note: Monitoring endpoints (/health, /props, /slots, /metrics) are not exposed through the Olla proxy as per architectural design. The /olla/* endpoints are strictly for inference requests. Health checks are handled internally by Olla, and monitoring should use Olla's /internal/* endpoints or direct backend access.

Path Description
/v1/models List Models (OpenAI format)
/completion Native Completion Endpoint
/v1/completions Text Completions (OpenAI format)
/v1/chat/completions Chat Completions (OpenAI format)
/embedding Native Embedding Endpoint
/v1/embeddings Embeddings (OpenAI format)
/tokenize Encode Text to Tokens (llama.cpp-specific)
/detokenize Decode Tokens to Text (llama.cpp-specific)
/infill Code Infill/Completion (llama.cpp-specific, FIM)

Usage Examples

Chat Completion

curl -X POST http://localhost:40114/olla/llamacpp/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.2-3b-instruct-q4_k_m.gguf",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing in simple terms"}
    ],
    "temperature": 0.7,
    "max_tokens": 500
  }'

Streaming Response

curl -X POST http://localhost:40114/olla/llamacpp/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-7b-instruct-q4_k_m.gguf",
    "messages": [
      {"role": "user", "content": "Write a story about a robot"}
    ],
    "stream": true,
    "temperature": 0.8
  }'

Code Infill (FIM Support)

Code completion using Fill-In-the-Middle (llama.cpp-specific):

curl -X POST http://localhost:40114/olla/llamacpp/infill \
  -H "Content-Type: application/json" \
  -d '{
    "input_prefix": "def fibonacci(n):\n    if n <= 1:\n        return n\n    ",
    "input_suffix": "\n    return result",
    "temperature": 0.2,
    "max_tokens": 100
  }'

# Useful for IDE integrations like Continue.dev, Aider

Tokenisation

Encode and decode tokens using the model's tokeniser (llama.cpp-specific):

# Encode text to tokens
curl -X POST http://localhost:40114/olla/llamacpp/tokenize \
  -H "Content-Type: application/json" \
  -d '{
    "content": "Hello, world!"
  }'

# Response: {"tokens": [15496, 11, 1917, 0]}

# Decode tokens to text
curl -X POST http://localhost:40114/olla/llamacpp/detokenize \
  -H "Content-Type: application/json" \
  -d '{
    "tokens": [15496, 11, 1917, 0]
  }'

# Response: {"content": "Hello, world!"}

Embeddings

Generate embeddings for semantic search:

curl -X POST http://localhost:40114/olla/llamacpp/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed-text-v1.5-q4_k_m.gguf",
    "input": "The quick brown fox jumps over the lazy dog"
  }'

List Models

# OpenAI format
curl http://localhost:40114/olla/llamacpp/v1/models

# Response typically shows single model (single-model architecture)
# {
#   "object": "list",
#   "data": [
#     {
#       "id": "llama-3.2-3b-instruct-q4_k_m.gguf",
#       "object": "model",
#       "created": 1704067200,
#       "owned_by": "meta-llama"
#     }
#   ]
# }

llama.cpp Specifics

Single Model Architecture

llama.cpp serves one model per instance, loaded at startup:

  • No Runtime Switching: Cannot change models without restart
  • Model Discovery: Returns single model in /v1/models response
  • Efficient Memory: All resources dedicated to one model
  • Predictable Performance: No model switching overhead

This differs from Ollama (multi-model) and requires running multiple llama.cpp instances for multiple models.

Slot Management

llama.cpp uses slot-based concurrency for fine-grained control:

  • Default Slots: 4 concurrent processing slots (configurable with --parallel)
  • Explicit Control: Each slot handles one request at a time
  • Queue Management: Additional requests queue when slots full
  • Monitoring: Available via direct backend access at http://backend:8080/slots
  • Capacity Planning: Adjust slots based on hardware and model size

Slot configuration example:

# Start with 8 slots for higher concurrency
llama-server -m model.gguf --parallel 8 --port 8080

Note: Slot monitoring is not available through Olla proxy paths. Access slot status directly from the llama.cpp backend or use Olla's internal monitoring endpoints.

CPU Inference Capabilities

llama.cpp is optimised for CPU-first deployment:

  • No GPU Required: Full functionality on CPU-only systems
  • ARM Support: Runs on Apple Silicon (M1/M2/M3), Raspberry Pi
  • Edge Deployment: Suitable for IoT and embedded systems
  • Portable: Pure C++ with minimal dependencies

CPU performance tips: - Use Q4 quantisation for best CPU performance/quality trade-off - Allocate sufficient threads (--threads) - Consider smaller models (3B-7B) for CPU deployment

Quantisation Options

llama.cpp provides extensive GGUF quantisation levels:

Quantisation BPW (Bits Per Weight) Memory vs F16 Quality Use Case
Q2_K 2.63 35% Low Extreme compression, edge devices
Q3_K_M 3.91 45% Moderate Balanced compression
Q4_K_M 4.85 50% Good Recommended for most use cases
Q5_K_M 5.69 62.5% High Quality-focused deployments
Q6_K 6.59 75% Very High Near-original quality
Q8_0 8.50 87.5% Excellent Production quality requirements
F16 16 100% Original Baseline reference
F32 32 200% Perfect Research, original weights

Memory Requirements

Approximate memory requirements for different model sizes with Q4_K_M quantisation:

Model Size Q4_K_M Memory Q8_0 Memory Recommended RAM Max Slots (Typical)
1-3B 2-3GB 3-5GB 4GB 8
7-8B 4-6GB 8-12GB 8GB 4
13-14B 8-10GB 16-20GB 16GB 2
30-34B 20GB 40GB 24GB 2
70-72B 40GB 80GB 48GB 1

Note: Memory requirements increase with context length. Add ~1GB per 8K context tokens.

Starting llama.cpp Server

Basic Start

# Download llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Start server with model
./llama-server \
  -m models/llama-3.2-3b-instruct-q4_k_m.gguf \
  --port 8080

With Slots Configuration

# Configure 8 concurrent slots
./llama-server \
  -m models/mistral-7b-instruct-q4_k_m.gguf \
  --port 8080 \
  --parallel 8 \
  --threads 8

CPU-Only Example

Optimised for CPU inference:

./llama-server \
  -m models/llama-3.2-3b-instruct-q4_k_m.gguf \
  --port 8080 \
  --threads 8 \
  --ctx-size 8192 \
  --parallel 4 \
  --host 0.0.0.0

GPU Acceleration (CUDA)

For NVIDIA GPUs:

# Build with CUDA support
make GGML_CUDA=1

# Start with GPU offloading
./llama-server \
  -m models/llama-3.2-3b-instruct-q4_k_m.gguf \
  --port 8080 \
  --n-gpu-layers 35 \
  --parallel 8

GPU Acceleration (Metal)

For Apple Silicon (M1/M2/M3):

# Build with Metal support (default on macOS)
make

# Start with GPU acceleration
./llama-server \
  -m models/llama-3.2-3b-instruct-q4_k_m.gguf \
  --port 8080 \
  --n-gpu-layers 35 \
  --parallel 4

Docker Deployment

# Pull llama.cpp server image
docker pull ghcr.io/ggerganov/llama.cpp:server

# Run with model volume
docker run -d \
  --name llamacpp \
  -p 8080:8080 \
  -v /path/to/models:/models \
  ghcr.io/ggerganov/llama.cpp:server \
  -m /models/llama-3.2-3b-instruct-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --parallel 4

# With GPU support (NVIDIA)
docker run -d \
  --gpus all \
  --name llamacpp-gpu \
  -p 8080:8080 \
  -v /path/to/models:/models \
  ghcr.io/ggerganov/llama.cpp:server-cuda \
  -m /models/mistral-7b-instruct-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --n-gpu-layers 35 \
  --parallel 8

Profile Customisation

To customise llama.cpp behaviour, create config/profiles/llamacpp-custom.yaml. See Profile Configuration for detailed explanations of each section.

Example Customisation

name: llamacpp
version: "1.0"

# Add custom prefixes
routing:
  prefixes:
    - llamacpp
    - cpp       # Add custom prefix
    - local     # Add custom prefix

# Adjust for larger models and CPU inference
characteristics:
  timeout: 10m                  # Increase for large models
  max_concurrent_requests: 2    # Reduce for limited hardware

# Model capability detection
models:
  capability_patterns:
    code:
      - "*deepseek-coder*"
      - "*codellama*"
      - "*starcoder*"
      - "*phind-codellama*"
    embeddings:
      - "*embed*"
      - "*nomic*"
      - "*bge*"

# Custom context patterns
  context_patterns:
    - pattern: "*-128k*"
      context: 131072
    - pattern: "*qwen2.5*"
      context: 32768

# Slot configuration
resources:
  slot_configuration:
    default_slots: 8          # Increase for more concurrency
    max_slots: 16
    slot_monitoring: true

  # Adjust concurrency for hardware
  concurrency_limits:
    - min_memory_gb: 20       # Large models
      max_concurrent: 1
    - min_memory_gb: 8        # Medium models
      max_concurrent: 4
    - min_memory_gb: 0        # Small models
      max_concurrent: 8

See Profile Configuration for complete customisation options.

Troubleshooting

Slot Exhaustion (504 Errors)

Issue: "all slots are busy" or 504 timeout errors

Solution:

  1. Increase parallel slots:

    ./llama-server -m model.gguf --parallel 8
    

  2. Add more llama.cpp instances:

    endpoints:
      - url: "http://server1:8080"
        name: "llamacpp-1"
      - url: "http://server2:8080"
        name: "llamacpp-2"
    

Model Loading Failures

Issue: Model fails to load or crashes at startup

Solution:

  1. Verify GGUF file integrity:

    # Check file size and format
    file model.gguf
    

  2. Check memory requirements:

    # Estimate memory needed (Q4_K_M ≈ 0.5 bytes per parameter)
    # 7B model = 7 billion × 0.5 bytes = 3.5GB
    

  3. Reduce context size:

    ./llama-server -m model.gguf --ctx-size 4096  # Reduce from default
    

  4. Use smaller quantisation (Q4 instead of Q8):

    # Download Q4 variant instead of Q8
    

Memory Issues

Issue: Out of memory errors or system freezing

Solution:

  1. Use more aggressive quantisation:
  2. Q8 → Q5_K_M (37.5% memory saving)
  3. Q5_K_M → Q4_K_M (50% memory saving)
  4. Q4_K_M → Q3_K_M (55% memory saving)

  5. Reduce parallel slots:

    ./llama-server -m model.gguf --parallel 1  # Single concurrent request
    

  6. Limit context window:

    ./llama-server -m model.gguf --ctx-size 2048  # Smaller context
    

  7. Monitor memory usage:

    # Linux
    watch -n 1 'free -h'
    
    # macOS
    vm_stat
    

Connection Timeouts

Issue: Requests timeout before completion

Solution:

  1. Increase Olla timeout:

    proxy:
      response_timeout: 600s  # 10 minutes
    

  2. Increase llama.cpp timeout in profile:

    characteristics:
      timeout: 10m
    

  3. Monitor performance via Olla internal endpoints:

    # Check Olla's internal status
    curl http://localhost:40114/internal/status
    

GGUF Format Incompatibility

Issue: "invalid model file" or version errors

Solution:

  1. Update llama.cpp to latest version:

    cd llama.cpp
    git pull
    make clean && make
    

  2. Re-download model with compatible GGUF version:

    # Check model compatibility with llama.cpp version
    

  3. Verify GGUF metadata:

    # Use llama.cpp tools to inspect GGUF file
    ./llama-cli --model model.gguf --verbose
    

Best Practices

1. Slot Configuration for Workload

Match slots to your workload pattern:

# High concurrency (many short requests)
resources:
  slot_configuration:
    default_slots: 8
    max_slots: 16

# Low concurrency (few long requests)
resources:
  slot_configuration:
    default_slots: 2
    max_slots: 4

2. Quantisation Selection Guide

Choose quantisation based on requirements:

Priority Recommended Quantisation Use Case
Quality First Q8_0, Q6_K Production, quality-critical
Balanced Q4_K_M General purpose, recommended
Speed/Memory Q3_K_M, Q2_K Edge devices, limited resources
Research F16, F32 Benchmarking, development

3. CPU vs GPU Deployment Decisions

Use CPU when: - GPU not available (edge devices, workstations) - Small models (1-7B with Q4 quantisation) - Low concurrency requirements - Cost-sensitive deployments

Use GPU when: - Available GPU memory (8GB+) - Large models (13B+) - High throughput requirements - Low latency critical

4. Multiple Instance Patterns

Deploy multiple llama.cpp instances strategically:

# Pattern 1: Same model, different servers (load balancing)
endpoints:
  - url: "http://server1:8080"
    name: "llamacpp-1"
    priority: 100
  - url: "http://server2:8080"
    name: "llamacpp-2"
    priority: 100

# Pattern 2: Different models, different instances
endpoints:
  - url: "http://server1:8080"  # llama-3.2-3b-q4
    name: "llamacpp-small"
    priority: 90
  - url: "http://server2:8081"  # mistral-7b-q4
    name: "llamacpp-medium"
    priority: 95

# Pattern 3: Different quantisations, quality tiers
endpoints:
  - url: "http://server1:8080"  # Q8 high quality
    name: "llamacpp-quality"
    priority: 100
  - url: "http://server2:8080"  # Q4 balanced
    name: "llamacpp-balanced"
    priority: 80

5. Memory Management

Plan memory allocation carefully:

# Reserve memory headroom (20% buffer recommended)
# For 7B Q4 model requiring 5GB:
# System RAM needed = 5GB × 1.2 = 6GB minimum

# Monitor actual usage
./llama-server -m model.gguf --verbose

# Adjust slots based on memory:
# Total RAM / (Model Memory + Context Memory) = Max Slots
# 16GB / (5GB model + 1GB context) ≈ 2-3 safe slots

Integration with Tools

OpenAI SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:40114/olla/llamacpp/v1",
    api_key="not-needed"  # llama.cpp doesn't require API keys
)

response = client.chat.completions.create(
    model="llama-3.2-3b-instruct-q4_k_m.gguf",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

print(response.choices[0].message.content)

LangChain

from langchain.llms import OpenAI

llm = OpenAI(
    openai_api_base="http://localhost:40114/olla/llamacpp/v1",
    openai_api_key="not-needed",
    model_name="mistral-7b-instruct-q4_k_m.gguf"
)

response = llm("Explain machine learning")
print(response)

Continue.dev (Code Completion)

Configure Continue for IDE code completion:

{
  "models": [{
    "title": "llama.cpp via Olla",
    "provider": "openai",
    "model": "deepseek-coder-6.7b-instruct-q5_k_m.gguf",
    "apiBase": "http://localhost:40114/olla/llamacpp/v1",
    "useLegacyCompletionsEndpoint": false
  }],
  "tabAutocompleteModel": {
    "title": "llama.cpp Autocomplete",
    "provider": "openai",
    "model": "deepseek-coder-1.3b-base-q4_k_m.gguf",
    "apiBase": "http://localhost:40114/olla/llamacpp/v1"
  }
}

Aider (Pair Programming)

# Use llama.cpp with Aider for code assistance
aider \
  --openai-api-base http://localhost:40114/olla/llamacpp/v1 \
  --model deepseek-coder-6.7b-instruct-q5_k_m.gguf \
  --no-auto-commits

# For code infill (FIM) with compatible models
aider \
  --openai-api-base http://localhost:40114/olla/llamacpp/v1 \
  --model codellama-13b-instruct-q4_k_m.gguf \
  --edit-format whole

Next Steps