LM Studio Integration¶
Home | lmstudio.ai |
---|---|
Type | lm-studio (use in endpoint configuration) |
Profile | lmstudio.yaml (see latest) |
Features |
|
Unsupported |
|
Attributes |
|
Prefixes |
|
Endpoints | See below |
Configuration¶
Basic Setup¶
Add LM Studio to your Olla configuration:
discovery:
static:
endpoints:
- url: "http://localhost:1234"
name: "local-lm-studio"
type: "lm-studio"
priority: 90
model_url: "/api/v0/models"
health_check_url: "/v1/models"
check_interval: 2s
check_timeout: 1s
Multiple LM Studio Instances¶
Run multiple LM Studio servers on different ports:
discovery:
static:
endpoints:
- url: "http://localhost:1234"
name: "lm-studio-1"
type: "lm-studio"
priority: 100
- url: "http://localhost:1235"
name: "lm-studio-2"
type: "lm-studio"
priority: 90
- url: "http://192.168.1.10:1234"
name: "lm-studio-remote"
type: "lm-studio"
priority: 50
Endpoints Supported¶
The following endpoints are supported by the LM Studio integration profile:
Path | Description |
---|---|
/v1/models | List Models & Health Check |
/v1/chat/completions | Chat Completions (OpenAI format) |
/v1/completions | Text Completions (OpenAI format) |
/v1/embeddings | Generate Embeddings |
/api/v0/models | Legacy Models Endpoint |
Usage Examples¶
Chat Completion¶
curl -X POST http://localhost:40114/olla/lmstudio/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b-instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"temperature": 0.7
}'
Streaming Response¶
curl -X POST http://localhost:40114/olla/lm-studio/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral-7b-instruct",
"messages": [
{"role": "user", "content": "Write a short poem about coding"}
],
"stream": true
}'
List Available Models¶
LM Studio Specifics¶
Model Loading Behaviour¶
LM Studio differs from other backends:
- Preloaded Models: Models must be loaded in LM Studio before use
- Single Concurrency: Only one request processed at a time
- Fast Response: No model loading delay during requests
Resource Configuration¶
The LM Studio profile includes optimised resource settings:
characteristics:
timeout: 3m
max_concurrent_requests: 1 # LM Studio handles one at a time
streaming_support: true
Memory Requirements¶
LM Studio uses quantised models with reduced memory requirements:
Model Size | Memory Required | Recommended |
---|---|---|
70B | 42GB | 52GB |
34B | 20GB | 25GB |
13B | 8GB | 10GB |
7B | 5GB | 6GB |
3B | 2GB | 3GB |
Profile Customisation¶
To customise LM Studio behaviour, create config/profiles/lmstudio-custom.yaml
. See Profile Configuration for detailed explanations of each section.
Example Customisation¶
name: lm-studio
version: "1.0"
# Add custom prefixes
routing:
prefixes:
- lmstudio
- lm-studio
- lm_studio
- studio # Add custom prefix
# Adjust timeouts for slower hardware
characteristics:
timeout: 5m # Increase from 3m
# Modify resource limits
resources:
concurrency_limits:
- min_memory_gb: 0
max_concurrent: 1 # Always single-threaded
See Profile Configuration for complete customisation options.
Troubleshooting¶
Models Not Appearing¶
Issue: Models don't show in Olla's model list
Solution: 1. Ensure models are loaded in LM Studio UI 2. Check LM Studio is running on the configured port 3. Verify with: curl http://localhost:1234/v1/models
Request Timeout¶
Issue: Requests timeout on large models
Solution: Increase timeout in profile:
Connection Refused¶
Issue: Cannot connect to LM Studio
Solution:
- Verify LM Studio is running
- Check "Enable CORS" in LM Studio settings
- Ensure firewall allows the port
- Test direct connection:
curl http://localhost:1234/v1/models
Single Request Limitation¶
Issue: Concurrent requests fail
Solution: LM Studio processes one request at a time. Use priority load balancing to route overflow to other endpoints:
proxy:
load_balancer: "priority"
discovery:
static:
endpoints:
- url: "http://localhost:1234"
name: "lm-studio"
type: "lm-studio"
priority: 100
- url: "http://localhost:11434"
name: "ollama-backup"
type: "ollama"
priority: 50 # Fallback for concurrent requests
Best Practices¶
1. Use for Interactive Sessions¶
LM Studio excels at:
- Development and testing
- Interactive chat sessions
- Quick model switching via UI
2. Configure Appropriate Timeouts¶
proxy:
response_timeout: 600s # 10 minutes for long generations
read_timeout: 300s # 5 minutes read timeout
3. Monitor Memory Usage¶
LM Studio shows real-time memory usage in its UI. Monitor this to:
- Prevent out-of-memory errors
- Choose appropriate model sizes
- Optimise quantisation levels
4. Combine with Other Backends¶
Use LM Studio for development and Ollama/vLLM for production:
discovery:
static:
endpoints:
# Development - high priority
- url: "http://localhost:1234"
name: "lm-studio-dev"
type: "lm-studio"
priority: 100
# Production - lower priority
- url: "http://localhost:11434"
name: "ollama-prod"
type: "ollama"
priority: 50
Integration with Tools¶
OpenAI SDK¶
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:40114/olla/lmstudio/v1",
api_key="not-needed" # LM Studio doesn't require API keys
)
response = client.chat.completions.create(
model="llama-3.2-3b-instruct",
messages=[
{"role": "user", "content": "Hello!"}
]
)
LangChain¶
from langchain.llms import OpenAI
llm = OpenAI(
openai_api_base="http://localhost:40114/olla/lm-studio/v1",
openai_api_key="not-needed",
model_name="mistral-7b-instruct"
)
Continue.dev¶
Configure Continue to use Olla with LM Studio:
{
"models": [{
"title": "LM Studio via Olla",
"provider": "openai",
"model": "llama-3.2-3b-instruct",
"apiBase": "http://localhost:40114/olla/lmstudio/v1"
}]
}
Next Steps¶
- Profile Configuration - Customise LM Studio behaviour
- Model Unification - Understand model management
- Load Balancing - Configure multi-backend setups