LM Studio Integration¶
| Home | lmstudio.ai |
|---|---|
| Since | Olla v0.0.12 |
| Type | lm-studio (use in endpoint configuration) |
| Profile | lmstudio.yaml (see latest) |
| Features |
|
| Unsupported |
|
| Attributes |
|
| Prefixes |
|
| Endpoints | See below |
Configuration¶
Basic Setup¶
Add LM Studio to your Olla configuration:
discovery:
static:
endpoints:
- url: "http://localhost:1234"
name: "local-lm-studio"
type: "lm-studio"
priority: 90
model_url: "/api/v0/models"
health_check_url: "/v1/models"
check_interval: 2s
check_timeout: 1s
Multiple LM Studio Instances¶
Run multiple LM Studio servers on different ports:
discovery:
static:
endpoints:
- url: "http://localhost:1234"
name: "lm-studio-1"
type: "lm-studio"
priority: 100
- url: "http://localhost:1235"
name: "lm-studio-2"
type: "lm-studio"
priority: 90
- url: "http://192.168.1.10:1234"
name: "lm-studio-remote"
type: "lm-studio"
priority: 50
Anthropic Messages API Support¶
LM Studio v0.4.1+ natively supports the Anthropic Messages API, enabling Olla to forward Anthropic-format requests directly without translation overhead (passthrough mode). This was added specifically for Claude Code integration, enabling native Anthropic API support without requiring translation middleware.
When Olla detects that a LM Studio endpoint supports native Anthropic format (via the anthropic_support section in config/profiles/lmstudio.yaml), it will bypass the Anthropic-to-OpenAI translation pipeline and forward requests directly to /v1/messages on the backend.
Profile configuration (from config/profiles/lmstudio.yaml):
api:
anthropic_support:
enabled: true
messages_path: /v1/messages
token_count: false
min_version: "0.4.1"
Key details:
- Minimum LM Studio version: v0.4.1
- Token counting (
/v1/messages/count_tokens): Not supported - Passthrough mode is automatic -- no client-side configuration needed
- Responses include
X-Olla-Mode: passthroughheader when passthrough is active - Falls back to translation mode if passthrough conditions are not met
For more information, see API Translation and Anthropic API Reference.
Endpoints Supported¶
The following endpoints are supported by the LM Studio integration profile:
| Path | Description |
|---|---|
/v1/models | List Models & Health Check |
/v1/chat/completions | Chat Completions (OpenAI format) |
/v1/completions | Text Completions (OpenAI format) |
/v1/embeddings | Generate Embeddings |
/api/v0/models | Legacy Models Endpoint |
Usage Examples¶
Chat Completion¶
curl -X POST http://localhost:40114/olla/lmstudio/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b-instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"temperature": 0.7
}'
Streaming Response¶
curl -X POST http://localhost:40114/olla/lm-studio/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral-7b-instruct",
"messages": [
{"role": "user", "content": "Write a short poem about coding"}
],
"stream": true
}'
List Available Models¶
LM Studio Specifics¶
Model Loading Behaviour¶
LM Studio differs from other backends:
- Preloaded Models: Models must be loaded in LM Studio before use
- Single Concurrency: Only one request processed at a time
- Fast Response: No model loading delay during requests
Resource Configuration¶
The LM Studio profile includes optimised resource settings:
characteristics:
timeout: 3m
max_concurrent_requests: 1 # LM Studio handles one at a time
streaming_support: true
Memory Requirements¶
LM Studio uses quantised models with reduced memory requirements:
| Model Size | Memory Required | Recommended |
|---|---|---|
| 70B | 42GB | 52GB |
| 34B | 20GB | 25GB |
| 13B | 8GB | 10GB |
| 7B | 5GB | 6GB |
| 3B | 2GB | 3GB |
Profile Customisation¶
To customise LM Studio behaviour, create config/profiles/lmstudio-custom.yaml. See Profile Configuration for detailed explanations of each section.
Example Customisation¶
name: lm-studio
version: "1.0"
# Add custom prefixes
routing:
prefixes:
- lmstudio
- lm-studio
- lm_studio
- studio # Add custom prefix
# Adjust timeouts for slower hardware
characteristics:
timeout: 5m # Increase from 3m
# Modify resource limits
resources:
concurrency_limits:
- min_memory_gb: 0
max_concurrent: 1 # Always single-threaded
See Profile Configuration for complete customisation options.
Troubleshooting¶
Models Not Appearing¶
Issue: Models don't show in Olla's model list
Solution: 1. Ensure models are loaded in LM Studio UI 2. Check LM Studio is running on the configured port 3. Verify with: curl http://localhost:1234/v1/models
Request Timeout¶
Issue: Requests timeout on large models
Solution: Increase timeout in profile:
Connection Refused¶
Issue: Cannot connect to LM Studio
Solution:
- Verify LM Studio is running
- Check "Enable CORS" in LM Studio settings
- Ensure firewall allows the port
- Test direct connection:
curl http://localhost:1234/v1/models
Single Request Limitation¶
Issue: Concurrent requests fail
Solution: LM Studio processes one request at a time. Use priority load balancing to route overflow to other endpoints:
proxy:
load_balancer: "priority"
discovery:
static:
endpoints:
- url: "http://localhost:1234"
name: "lm-studio"
type: "lm-studio"
priority: 100
- url: "http://localhost:11434"
name: "ollama-backup"
type: "ollama"
priority: 50 # Fallback for concurrent requests
Best Practices¶
1. Use for Interactive Sessions¶
LM Studio excels at:
- Development and testing
- Interactive chat sessions
- Quick model switching via UI
2. Configure Appropriate Timeouts¶
proxy:
response_timeout: 600s # 10 minutes for long generations
read_timeout: 300s # 5 minutes read timeout
3. Monitor Memory Usage¶
LM Studio shows real-time memory usage in its UI. Monitor this to:
- Prevent out-of-memory errors
- Choose appropriate model sizes
- Optimise quantisation levels
4. Combine with Other Backends¶
Use LM Studio for development and Ollama/vLLM for production:
discovery:
static:
endpoints:
# Development - high priority
- url: "http://localhost:1234"
name: "lm-studio-dev"
type: "lm-studio"
priority: 100
# Production - lower priority
- url: "http://localhost:11434"
name: "ollama-prod"
type: "ollama"
priority: 50
Integration with Tools¶
OpenAI SDK¶
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:40114/olla/lmstudio/v1",
api_key="not-needed" # LM Studio doesn't require API keys
)
response = client.chat.completions.create(
model="llama-3.2-3b-instruct",
messages=[
{"role": "user", "content": "Hello!"}
]
)
LangChain¶
from langchain.llms import OpenAI
llm = OpenAI(
openai_api_base="http://localhost:40114/olla/lm-studio/v1",
openai_api_key="not-needed",
model_name="mistral-7b-instruct"
)
Continue.dev¶
Configure Continue to use Olla with LM Studio:
{
"models": [{
"title": "LM Studio via Olla",
"provider": "openai",
"model": "llama-3.2-3b-instruct",
"apiBase": "http://localhost:40114/olla/lmstudio/v1"
}]
}
Next Steps¶
- Profile Configuration - Customise LM Studio behaviour
- Model Unification - Understand model management
- Load Balancing - Configure multi-backend setups