Crush CLI Integration with Dual API Support¶
Crush CLI is a modern terminal AI assistant by Charmbracelet that natively supports both OpenAI and Anthropic APIs. Connect it to Olla to use local LLM infrastructure with seamless provider switching and no cloud API costs.
Set in Crush CLI (~/.config/crush/crush.json):
{
"providers": {
"olla-anthropic": {
"type": "anthropic",
"base_url": "http://localhost:40114/olla/anthropic/v1",
"api_key": "not-required",
"models": [
{ "id": "llama3.2:latest", "name": "Llama 3.2" }
]
},
"olla-openai": {
"type": "openai-compat",
"base_url": "http://localhost:40114/olla/openai/v1",
"api_key": "not-required",
"models": [
{ "id": "llama3.2:latest", "name": "Llama 3.2" }
]
}
},
"models": {
"large": { "model": "llama3.2:latest", "provider": "olla-openai" },
"small": { "model": "llama3.2:latest", "provider": "olla-openai" }
}
}
Custom providers must list their models
A custom provider (any provider not built into Crush's catwalk catalogue, which is the case for Olla) is silently skipped at load time unless it declares both base_url and a non-empty models array. Each entry needs at least an id that matches a model ID Olla returns from /olla/openai/v1/models. The top-level models.large / models.small keys only select a model; they do not register it.
What you get via Olla
- Dual API support in one proxy (both OpenAI and Anthropic endpoints)
- Switch between providers by updating the
modelskey incrush.json - Priority/least-connections load-balancing and health checks
- Streaming passthrough for both formats
- Unified
/v1/modelsacross all backends - Seamless format translation (Anthropic-to-OpenAI or direct OpenAI)
Overview¶
| Project | Crush CLI (by Charmbracelet) |
|---|---|
| Integration Type | Terminal UI / CLI Assistant |
| Connection Method | Native dual API support (OpenAI + Anthropic) |
| Features Supported (via Olla) |
|
| Configuration | Configure providers in ~/.config/crush/crush.json with both API endpoints |
| Example | Complete working example available in examples/crush-vllm/ |
What is Crush CLI?¶
Crush CLI is a terminal AI assistant by Charmbracelet. Key features:
- Go-based: Fast, lightweight, single binary
- TUI Interface: Beautiful terminal user interface using Bubble Tea
- Dual API Support: Natively supports both OpenAI-compatible and Anthropic formats
- Modern Design: Charmbracelet's signature polished experience
Official Repository: https://github.com/charmbracelet/crush
By default, Crush CLI connects to cloud APIs. With Olla's dual endpoint support, you can redirect it to local models whilst maintaining full compatibility with both API formats.
Architecture¶
┌──────────────┐ ┌──────────┐ ┌─────────────────────┐
│ Crush CLI │ OpenAI API │ Olla │ OpenAI API │ Ollama :11434 │
│ (TUI) │─────────────────────▶ │ :40114 │─────────────────▶ └─────────────────────┘
│ │ /openai/v1/* │ │ /v1/* ┌─────────────────────┐
│ │ │ • Dual │─────────────────▶ │ LM Studio :1234 │
│ │ Anthropic API │ API │ └─────────────────────┘
│ │─────────────────────▶ │ • Load │ ┌─────────────────────┐
│ │ /anthropic/v1/* │ Balance│─────────────────▶ │ vLLM :8000 │
│ │◀───────────────────── │ • Health│ └─────────────────────┘
└──────────────┘ Both formats └──────────┘
│
├─ Direct OpenAI format passthrough
├─ Anthropic → OpenAI translation
└─ Routes to healthy backend
Prerequisites¶
Before starting, ensure you have:
- Crush CLI Installed
- Download from GitHub Releases
- Or build from source:
go install github.com/charmbracelet/crush@latest -
Verify:
crush --version -
Olla Running
- Installed and configured (see Installation Guide)
-
Both OpenAI and Anthropic endpoints enabled (default)
-
At Least One Backend
- Ollama, LM Studio, vLLM, llama.cpp, or any OpenAI-compatible endpoint
-
With at least one model loaded/available
-
Docker & Docker Compose (for examples)
- Required only if following Docker-based quick start
Quick Start (Docker Compose)¶
1. Create Project Directory¶
2. Create Configuration Files¶
Create compose.yaml:
services:
olla:
image: ghcr.io/thushan/olla:latest
container_name: olla
restart: unless-stopped
ports:
- "40114:40114"
volumes:
- ./olla.yaml:/app/config.yaml:ro
- ./logs:/app/logs
healthcheck:
test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:40114/internal/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
healthcheck:
test: ["CMD", "ollama", "list"]
interval: 30s
timeout: 10s
retries: 3
volumes:
ollama_data:
driver: local
Create olla.yaml:
server:
host: 0.0.0.0
port: 40114
proxy:
engine: olla # default; or: sherpa (simpler codebase, maintenance mode)
load_balancer: priority # or: least-connections
response_timeout: 1800s # 30 min for long generations
read_timeout: 600s
# Anthropic translator enables /olla/anthropic/v1/*
# OpenAI is the native format, no translator needed
translators:
anthropic:
enabled: true
# Service discovery for backends
discovery:
type: static
static:
endpoints:
- url: http://ollama:11434
name: local-ollama
type: ollama
priority: 100
check_interval: 30s
check_timeout: 5s
logging:
level: info
# Optional: Streaming optimisation
# proxy:
# profile: streaming
3. Start Services¶
Wait for services to be healthy:
4. Pull a Model (Ollama)¶
docker exec ollama ollama pull llama3.2:latest
# Or a coding-focused model:
docker exec ollama ollama pull qwen2.5-coder:32b
5. Verify Olla Setup¶
# Health check
curl http://localhost:40114/internal/health
# List models via OpenAI endpoint
curl http://localhost:40114/olla/openai/v1/models | jq
# List models via Anthropic endpoint
curl http://localhost:40114/olla/anthropic/v1/models | jq
# Test OpenAI format
curl -X POST http://localhost:40114/olla/openai/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:latest",
"messages": [{"role":"user","content":"Hello from Olla"}]
}' | jq
# Test Anthropic format
curl -X POST http://localhost:40114/olla/anthropic/v1/messages \
-H "Content-Type: application/json" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "llama3.2:latest",
"max_tokens": 100,
"messages": [{"role":"user","content":"Hello from Olla"}]
}' | jq
6. Configure Crush CLI¶
Create or edit ~/.config/crush/crush.json:
macOS/Linux: ~/.config/crush/crush.json Windows: %LOCALAPPDATA%\crush\crush.json
Crush also checks .crush.json or crush.json in the current project directory, which takes precedence.
{
"providers": {
"olla-openai": {
"type": "openai-compat",
"base_url": "http://localhost:40114/olla/openai/v1",
"api_key": "not-required",
"models": [
{ "id": "llama3.2:latest", "name": "Llama 3.2" }
]
},
"olla-anthropic": {
"type": "anthropic",
"base_url": "http://localhost:40114/olla/anthropic/v1",
"api_key": "not-required",
"models": [
{ "id": "llama3.2:latest", "name": "Llama 3.2" }
]
}
},
"models": {
"large": { "model": "llama3.2:latest", "provider": "olla-openai" },
"small": { "model": "llama3.2:latest", "provider": "olla-openai" }
}
}
7. Start Crush CLI¶
You can now: - Start chatting with your local models - Try prompts like: - "Write a Python function to calculate factorial" - "Explain this code: [paste code]" - "Help me debug this error: [paste error]"
Configuration Options¶
Crush CLI Configuration¶
Edit ~/.config/crush/crush.json to customise provider settings.
Basic Configuration:
{
"providers": {
"olla-openai": {
"type": "openai-compat",
"base_url": "http://localhost:40114/olla/openai/v1",
"api_key": "not-required",
"models": [
{ "id": "llama3.2:latest", "name": "Llama 3.2" }
]
},
"olla-anthropic": {
"type": "anthropic",
"base_url": "http://localhost:40114/olla/anthropic/v1",
"api_key": "not-required",
"models": [
{ "id": "llama3.2:latest", "name": "Llama 3.2" }
]
}
},
"models": {
"large": { "model": "llama3.2:latest", "provider": "olla-openai" },
"small": { "model": "llama3.2:latest", "provider": "olla-openai" }
}
}
Multiple Providers (with cloud fallback):
{
"providers": {
"local-openai": {
"type": "openai-compat",
"base_url": "http://localhost:40114/olla/openai/v1",
"api_key": "not-required"
},
"local-anthropic": {
"type": "anthropic",
"base_url": "http://localhost:40114/olla/anthropic/v1",
"api_key": "not-required"
},
"openai-cloud": {
"type": "openai",
"api_key": "sk-..."
},
"anthropic-cloud": {
"type": "anthropic",
"api_key": "sk-ant-..."
}
},
"models": {
"large": { "model": "qwen2.5-coder:32b", "provider": "local-openai" },
"small": { "model": "qwen2.5-coder:32b", "provider": "local-openai" }
}
}
Provider-Specific Routing:
Use Crush's models.large / models.small keys to direct different task types to different providers:
{
"providers": {
"coding": {
"type": "openai-compat",
"base_url": "http://localhost:40114/olla/openai/v1",
"api_key": "not-required"
}
},
"models": {
"large": { "model": "qwen2.5-coder:32b", "provider": "coding" },
"small": { "model": "llama3.3:latest", "provider": "coding" }
}
}
Olla Configuration¶
Edit olla.yaml to customise backend behaviour:
Load Balancing Strategy:
- priority: Uses highest priority backend first (recommended for local + fallback setup)
- round-robin: Distributes evenly across all backends
- least-connections: Routes to backend with fewest active requests
Timeout Configuration:
proxy:
response_timeout: 1800s # Max time for response (30 minutes)
read_timeout: 600s # Max time for reading response body
Streaming Optimisation:
Multiple Backends:
discovery:
static:
endpoints:
- url: http://ollama:11434
name: local-ollama
type: ollama
priority: 100
- url: http://lmstudio:1234
name: lmstudio-gpu
type: lmstudio
priority: 90
- url: http://vllm:8000
name: vllm-cluster
type: vllm
priority: 80
Usage Examples¶
Basic Chat¶
Code Generation¶
# In Crush CLI
> Write a Python function that implements quicksort with type hints and docstrings
> Create a REST API endpoint in Go that handles user authentication with JWT
Code Explanation¶
# Paste code and ask for explanation
> Explain this code:
[paste complex code snippet]
> What design patterns are used in this TypeScript class?
[paste class definition]
Debugging Assistance¶
> I'm getting this error: TypeError: 'NoneType' object is not iterable
[paste relevant code]
> Help me debug this
> Why is my Go routine leaking?
[paste goroutine code]
Provider Switching¶
The active provider and model are set via the models key in crush.json. To switch which provider handles large/small tasks, update the models.large and models.small entries. You can also maintain separate project-level crush.json files in different project directories.
Switching Between APIs¶
Crush CLI's key advantage is native dual API support:
Why Use Multiple APIs?¶
OpenAI Format (default): - Broader model compatibility - Standard for most local backends - Simpler request/response structure
Anthropic Format: - Better structured responses with content blocks - Native tool use format - Cleaner streaming event structure
Switching Providers¶
Configure which provider handles each role via the top-level models key:
{
"models": {
"large": { "model": "llama3.3:latest", "provider": "olla-anthropic" },
"small": { "model": "qwen2.5-coder:7b", "provider": "olla-openai" }
}
}
For project-specific overrides, place a crush.json or .crush.json in the project root and it takes precedence over the global config.
See the Crush CLI repository for the full list of supported configuration options and CLI flags.
Docker Deployment (Production)¶
For production deployments, enhance security and reliability:
Enhanced compose.yaml¶
services:
olla:
image: ghcr.io/thushan/olla:latest
container_name: olla
restart: unless-stopped
ports:
- "40114:40114"
volumes:
- ./olla.yaml:/app/config.yaml:ro
- ./logs:/app/logs
environment:
- OLLA_LOG_LEVEL=info
healthcheck:
test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:40114/internal/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
networks:
- olla-network
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
- ollama_data:/root/.ollama
networks:
- olla-network
healthcheck:
test: ["CMD", "ollama", "list"]
interval: 30s
timeout: 10s
retries: 3
volumes:
ollama_data:
driver: local
networks:
olla-network:
driver: bridge
Production olla.yaml¶
server:
host: 0.0.0.0
port: 40114
rate_limits:
global_requests_per_minute: 100
per_ip_requests_per_minute: 60
burst_size: 20
proxy:
engine: olla # Use high-performance engine
load_balancer: least-connections
response_timeout: 1800s
read_timeout: 600s
profile: streaming
# Anthropic translator enables /olla/anthropic/v1/*
translators:
anthropic:
enabled: true
discovery:
type: static
static:
endpoints:
- url: http://ollama:11434
name: local-ollama
type: ollama
priority: 100
check_interval: 30s
check_timeout: 5s
logging:
level: info
format: json
Model Selection Tips¶
Recommended Models for Crush CLI¶
Code-Focused Models: - qwen2.5-coder:32b - Excellent for code generation and understanding - deepseek-coder-v2:latest - Strong multi-language support - codellama:34b - Meta's specialised coding model - phi3.5:latest - Efficient, good for quick tasks
General Purpose (Code + Chat): - llama3.3:latest - Well-balanced, fast - mistralai/magistral-small - Good reasoning abilities - qwen3:32b - Strong multi-task performance
Performance vs Quality Trade-offs:
| Model Size | Response Time | Quality | Memory Required |
|---|---|---|---|
| 3-8B | Fast (< 2s) | Good | 4-8 GB |
| 13-20B | Medium (2-5s) | Better | 12-16 GB |
| 30-70B | Slow (5-15s) | Best | 24-64 GB |
Loading Models:
# Ollama
docker exec ollama ollama pull qwen2.5-coder:32b
# Check loaded models
docker exec ollama ollama list
# Remove unused models to save space
docker exec ollama ollama rm <model-name>
Troubleshooting¶
Crush CLI Can't Connect to Olla¶
Check configuration file:
Test Olla directly:
Check Crush CLI logs:
No Models Available¶
List models from Olla:
# OpenAI endpoint
curl http://localhost:40114/olla/openai/v1/models | jq
# Anthropic endpoint
curl http://localhost:40114/olla/anthropic/v1/models | jq
# Should show models from all backends
Check backend health:
Verify backend directly:
Pull a model if empty:
Provider Not Working¶
Verify providers configured:
Check provider type:
For Olla integrations, the relevant type values are: openai-compat (custom OpenAI-compatible endpoints like Olla), openai (actual OpenAI API), and anthropic. Crush supports additional provider types (e.g. gemini, azure, vertexai); see Crush's provider docs for the full list.
{
"providers": {
"provider-name": {
"type": "openai-compat",
"base_url": "http://localhost:40114/olla/openai/v1",
"api_key": "..."
}
}
}
Test both endpoints:
# OpenAI
curl http://localhost:40114/olla/openai/v1/models
# Anthropic
curl http://localhost:40114/olla/anthropic/v1/models
Slow Responses¶
Switch to high-performance proxy engine:
Use smaller, faster models:
Increase timeout for large models:
Check backend performance:
Connection Refused¶
From Crush CLI to Olla:
# Test from host
curl http://localhost:40114/internal/health
# If this works but Crush fails, check firewall
From Olla to Ollama (Docker):
# Test from Olla container
docker exec olla wget -q -O- http://ollama:11434/api/tags
# If this fails, check Docker network
docker network inspect crush-olla_default
Streaming Issues¶
Enable streaming profile:
Check Crush CLI streaming support: - Ensure you're using a recent version - Streaming should work automatically with both OpenAI and Anthropic formats
Test streaming directly:
# OpenAI format
curl -N -X POST http://localhost:40114/olla/openai/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:latest",
"messages": [{"role":"user","content":"Count to 5"}],
"stream": true
}'
# Anthropic format
curl -N -X POST http://localhost:40114/olla/anthropic/v1/messages \
-H "Content-Type: application/json" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "llama3.2:latest",
"max_tokens": 50,
"messages": [{"role":"user","content":"Count to 5"}],
"stream": true
}'
API Key Issues¶
Olla doesn't enforce API keys by default. If Crush CLI requires one:
{
"providers": {
"olla-openai": {
"type": "openai-compat",
"base_url": "http://localhost:40114/olla/openai/v1",
"api_key": "not-required-but-can-be-anything"
}
}
}
Any placeholder value will work.
Advanced Configuration¶
Using Non-Docker Backends¶
If your backends run outside Docker:
olla.yaml with host services:
discovery:
static:
endpoints:
# Linux: Use host IP
- url: http://192.168.1.100:11434
name: ollama-workstation
type: ollama
priority: 100
# macOS/Windows: Use host.docker.internal
- url: http://host.docker.internal:11434
name: ollama-local
type: ollama
priority: 100
Load Balancing Across Multiple GPUs¶
Setup multiple backend instances:
discovery:
static:
endpoints:
- url: http://gpu1-ollama:11434
name: gpu1
type: ollama
priority: 100
- url: http://gpu2-ollama:11434
name: gpu2
type: ollama
priority: 100
- url: http://gpu3-vllm:8000
name: gpu3-vllm
type: vllm
priority: 90
proxy:
load_balancer: least-connections # Distribute load evenly
Multiple Crush Profiles¶
Crush checks .crush.json or crush.json in your current directory before falling back to the global config. You can maintain per-project configs that override models and providers:
my-project/crush.json (coding-focused):
{
"providers": {
"coding": {
"type": "openai-compat",
"base_url": "http://localhost:40114/olla/openai/v1",
"api_key": "not-required",
"models": [
{ "id": "qwen2.5-coder:32b", "name": "Qwen 2.5 Coder 32B" },
{ "id": "qwen2.5-coder:7b", "name": "Qwen 2.5 Coder 7B" }
]
}
},
"models": {
"large": { "model": "qwen2.5-coder:32b", "provider": "coding" },
"small": { "model": "qwen2.5-coder:7b", "provider": "coding" }
}
}
Integration with Development Tools¶
Shell alias for quick access:
See the Crush CLI repository for the full list of supported CLI flags.
Monitoring and Observability¶
Check Olla metrics:
# Endpoint status
curl http://localhost:40114/internal/status/endpoints | jq
# Model statistics
curl http://localhost:40114/internal/status/models | jq
# Health
curl http://localhost:40114/internal/health
View logs:
# Olla logs
docker compose logs -f olla
# Ollama logs
docker compose logs -f ollama
# Filter for errors
docker compose logs olla | grep -i error
Custom logging:
Best Practices¶
1. Model Management¶
- Start small: Test with smaller models (3-8B) before using larger ones
- Specialised models: Use code-specific models (e.g.,
qwen2.5-coder) for better results - Clean up: Remove unused models to save disk space
- Version models: Use specific tags (
:v1.2) rather than:latestfor consistency
2. Performance Optimisation¶
- GPU acceleration: Use CUDA-enabled Ollama image for GPU support
- Resource limits: Set Docker memory/CPU limits to prevent host resource exhaustion
- Connection pooling: Use
ollaproxy engine for better connection handling - Streaming profile: Enable for real-time response feel
3. Development Workflow¶
- Local-first: Configure highest priority for local backends
- Fallback remotes: Add lower-priority remote endpoints for reliability
- Provider separation: Use OpenAI for standard tasks, Anthropic for structured outputs
- Configuration profiles: Create separate configs for different workflows
4. Security¶
- Network isolation: Use Docker networks to isolate services
- Rate limiting: Enable in production to prevent abuse
- No public exposure: Don't expose Olla directly to the internet without authentication
- API gateway: Use nginx/Traefik with auth for external access
5. Cost Efficiency¶
- Local models: Save on API costs whilst maintaining privacy
- Hybrid setup: Use local for dev/test, cloud for production if needed
- Model caching: Keep frequently used models loaded
- Resource sharing: One Olla instance can serve multiple developers
Next Steps¶
Related Documentation¶
- OpenAI Chat Completions API Reference - OpenAI API documentation
- Anthropic Messages API Reference - Anthropic API documentation
- API Translation Concept - How translation works
- Load Balancing - Understanding request distribution
- Model Routing - How models are selected
Integration Examples¶
- Crush CLI + vLLM Example - High-performance backend setup
- Claude Code + Ollama Example - Alternative CLI assistant
- OpenCode Integration - Predecessor to Crush CLI
Backend Guides¶
- Ollama Integration - Ollama-specific configuration
- LM Studio Integration - LM Studio setup
- vLLM Integration - High-performance inference
Advanced Topics¶
- Health Checking - Endpoint monitoring
- Circuit Breaking - Failure handling
- Provider Metrics - Performance metrics
Support¶
Community: - GitHub Issues: https://github.com/thushan/olla/issues - Discussions: https://github.com/thushan/olla/discussions
Common Resources: - Crush CLI Repository - Charmbracelet Projects - Olla Project Home
Quick Help:
# Verify setup
curl http://localhost:40114/internal/health
curl http://localhost:40114/olla/openai/v1/models | jq
curl http://localhost:40114/olla/anthropic/v1/models | jq
# Test OpenAI format
curl -X POST http://localhost:40114/olla/openai/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"llama3.2:latest","messages":[{"role":"user","content":"Hi"}]}' | jq
# Test Anthropic format
curl -X POST http://localhost:40114/olla/anthropic/v1/messages \
-H "Content-Type: application/json" \
-H "anthropic-version: 2023-06-01" \
-d '{"model":"llama3.2:latest","max_tokens":50,"messages":[{"role":"user","content":"Hi"}]}' | jq
# Check logs
docker compose logs -f olla