Skip to content

LM Studio Integration

Home lmstudio.ai
Since Olla v0.0.12
Type lm-studio (use in endpoint configuration)
Profile lmstudio.yaml (see latest)
Features
  • Proxy Forwarding
  • Health Check (native)
  • Model Unification
  • Model Detection & Normalisation
  • OpenAI API Compatibility
  • Native Anthropic Messages API (v0.4.1+)
Unsupported
  • Model Management (loading/unloading)
  • Instance Management
  • Model Download
Attributes
  • OpenAI Compatible
  • Single Model Concurrency
  • Preloaded Models
Prefixes
  • /lmstudio
  • /lm-studio
  • /lm_studio
(see Routing Prefixes)
Endpoints See below

Configuration

Basic Setup

Add LM Studio to your Olla configuration:

discovery:
  static:
    endpoints:
      - url: "http://localhost:1234"
        name: "local-lm-studio"
        type: "lm-studio"
        priority: 90
        model_url: "/api/v0/models"
        health_check_url: "/v1/models"
        check_interval: 2s
        check_timeout: 1s

Multiple LM Studio Instances

Run multiple LM Studio servers on different ports:

discovery:
  static:
    endpoints:
      - url: "http://localhost:1234"
        name: "lm-studio-1"
        type: "lm-studio"
        priority: 100

      - url: "http://localhost:1235"
        name: "lm-studio-2"
        type: "lm-studio"
        priority: 90

      - url: "http://192.168.1.10:1234"
        name: "lm-studio-remote"
        type: "lm-studio"
        priority: 50

Anthropic Messages API Support

LM Studio v0.4.1+ natively supports the Anthropic Messages API, enabling Olla to forward Anthropic-format requests directly without translation overhead (passthrough mode). This was added specifically for Claude Code integration, enabling native Anthropic API support without requiring translation middleware.

When Olla detects that a LM Studio endpoint supports native Anthropic format (via the anthropic_support section in config/profiles/lmstudio.yaml), it will bypass the Anthropic-to-OpenAI translation pipeline and forward requests directly to /v1/messages on the backend.

Profile configuration (from config/profiles/lmstudio.yaml):

api:
  anthropic_support:
    enabled: true
    messages_path: /v1/messages
    token_count: false
    min_version: "0.4.1"

Key details:

  • Minimum LM Studio version: v0.4.1
  • Token counting (/v1/messages/count_tokens): Not supported
  • Passthrough mode is automatic -- no client-side configuration needed
  • Responses include X-Olla-Mode: passthrough header when passthrough is active
  • Falls back to translation mode if passthrough conditions are not met

For more information, see API Translation and Anthropic API Reference.

Endpoints Supported

The following endpoints are supported by the LM Studio integration profile:

Path Description
/v1/models List Models & Health Check
/v1/chat/completions Chat Completions (OpenAI format)
/v1/completions Text Completions (OpenAI format)
/v1/embeddings Generate Embeddings
/api/v0/models Legacy Models Endpoint

Usage Examples

Chat Completion

curl -X POST http://localhost:40114/olla/lmstudio/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.2-3b-instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "temperature": 0.7
  }'

Streaming Response

curl -X POST http://localhost:40114/olla/lm-studio/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-7b-instruct",
    "messages": [
      {"role": "user", "content": "Write a short poem about coding"}
    ],
    "stream": true
  }'

List Available Models

curl http://localhost:40114/olla/lm_studio/v1/models

LM Studio Specifics

Model Loading Behaviour

LM Studio differs from other backends:

  • Preloaded Models: Models must be loaded in LM Studio before use
  • Single Concurrency: Only one request processed at a time
  • Fast Response: No model loading delay during requests

Resource Configuration

The LM Studio profile includes optimised resource settings:

characteristics:
  timeout: 3m
  max_concurrent_requests: 1  # LM Studio handles one at a time
  streaming_support: true

Memory Requirements

LM Studio uses quantised models with reduced memory requirements:

Model Size Memory Required Recommended
70B 42GB 52GB
34B 20GB 25GB
13B 8GB 10GB
7B 5GB 6GB
3B 2GB 3GB

Profile Customisation

To customise LM Studio behaviour, create config/profiles/lmstudio-custom.yaml. See Profile Configuration for detailed explanations of each section.

Example Customisation

name: lm-studio
version: "1.0"

# Add custom prefixes
routing:
  prefixes:
    - lmstudio
    - lm-studio
    - lm_studio
    - studio      # Add custom prefix

# Adjust timeouts for slower hardware
characteristics:
  timeout: 5m     # Increase from 3m

# Modify resource limits
resources:
  concurrency_limits:
    - min_memory_gb: 0
      max_concurrent: 1  # Always single-threaded

See Profile Configuration for complete customisation options.

Troubleshooting

Models Not Appearing

Issue: Models don't show in Olla's model list

Solution: 1. Ensure models are loaded in LM Studio UI 2. Check LM Studio is running on the configured port 3. Verify with: curl http://localhost:1234/v1/models

Request Timeout

Issue: Requests timeout on large models

Solution: Increase timeout in profile:

characteristics:
  timeout: 10m  # Increase for large models

Connection Refused

Issue: Cannot connect to LM Studio

Solution:

  1. Verify LM Studio is running
  2. Check "Enable CORS" in LM Studio settings
  3. Ensure firewall allows the port
  4. Test direct connection: curl http://localhost:1234/v1/models

Single Request Limitation

Issue: Concurrent requests fail

Solution: LM Studio processes one request at a time. Use priority load balancing to route overflow to other endpoints:

proxy:
  load_balancer: "priority"

discovery:
  static:
    endpoints:
      - url: "http://localhost:1234"
        name: "lm-studio"
        type: "lm-studio"
        priority: 100

      - url: "http://localhost:11434"
        name: "ollama-backup"
        type: "ollama"
        priority: 50  # Fallback for concurrent requests

Best Practices

1. Use for Interactive Sessions

LM Studio excels at:

  • Development and testing
  • Interactive chat sessions
  • Quick model switching via UI

2. Configure Appropriate Timeouts

proxy:
  response_timeout: 600s  # 10 minutes for long generations
  read_timeout: 300s      # 5 minutes read timeout

3. Monitor Memory Usage

LM Studio shows real-time memory usage in its UI. Monitor this to:

  • Prevent out-of-memory errors
  • Choose appropriate model sizes
  • Optimise quantisation levels

4. Combine with Other Backends

Use LM Studio for development and Ollama/vLLM for production:

discovery:
  static:
    endpoints:
      # Development - high priority
      - url: "http://localhost:1234"
        name: "lm-studio-dev"
        type: "lm-studio"
        priority: 100

      # Production - lower priority
      - url: "http://localhost:11434"
        name: "ollama-prod"
        type: "ollama"
        priority: 50

Integration with Tools

OpenAI SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:40114/olla/lmstudio/v1",
    api_key="not-needed"  # LM Studio doesn't require API keys
)

response = client.chat.completions.create(
    model="llama-3.2-3b-instruct",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

LangChain

from langchain.llms import OpenAI

llm = OpenAI(
    openai_api_base="http://localhost:40114/olla/lm-studio/v1",
    openai_api_key="not-needed",
    model_name="mistral-7b-instruct"
)

Continue.dev

Configure Continue to use Olla with LM Studio:

{
  "models": [{
    "title": "LM Studio via Olla",
    "provider": "openai",
    "model": "llama-3.2-3b-instruct",
    "apiBase": "http://localhost:40114/olla/lmstudio/v1"
  }]
}

Next Steps