OpenAI API¶

Proxy endpoints for OpenAI and OpenAI-compatible services. Available through the /olla/openai/ and /olla/openai-compatible/ prefixes.

Endpoints Overview¶

Method	URI	Description
GET	`/olla/openai/v1/models`	List available models
POST	`/olla/openai/v1/chat/completions`	Chat completion
POST	`/olla/openai/v1/completions`	Text completion
POST	`/olla/openai/v1/embeddings`	Generate embeddings
POST	`/olla/openai/v1/images/generations`	Generate images
POST	`/olla/openai/v1/audio/transcriptions`	Transcribe audio
POST	`/olla/openai/v1/audio/translations`	Translate audio
POST	`/olla/openai/v1/moderations`	Content moderation

GET /olla/openai/v1/models¶

List all available models from OpenAI or compatible endpoints.

Request¶

curl -X GET http://localhost:40114/olla/openai/v1/models \
  -H "Authorization: Bearer YOUR_API_KEY"

Response¶

{
  "object": "list",
  "data": [
    {
      "id": "gpt-4-turbo-preview",
      "object": "model",
      "created": 1705334400,
      "owned_by": "openai",
      "permission": [],
      "root": "gpt-4-turbo-preview",
      "parent": null
    },
    {
      "id": "gpt-3.5-turbo",
      "object": "model",
      "created": 1677649963,
      "owned_by": "openai",
      "permission": [],
      "root": "gpt-3.5-turbo",
      "parent": null
    }
  ]
}

POST /olla/openai/v1/chat/completions¶

Create a chat completion with GPT models.

Request¶

curl -X POST http://localhost:40114/olla/openai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Explain quantum computing in simple terms"
      }
    ],
    "temperature": 0.7,
    "max_tokens": 300,
    "stream": false
  }'

Response¶

{
  "id": "chatcmpl-8q9ABC123",
  "object": "chat.completion",
  "created": 1705334400,
  "model": "gpt-3.5-turbo-0125",
  "system_fingerprint": "fp_abc123",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Quantum computing is a revolutionary type of computing that uses quantum bits or 'qubits' instead of traditional bits. While classical bits can only be 0 or 1, qubits can exist in multiple states simultaneously through a property called superposition. This allows quantum computers to process many calculations at once, potentially solving certain complex problems much faster than traditional computers. They're particularly promising for tasks like drug discovery, cryptography, and optimizing complex systems."
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 27,
    "completion_tokens": 85,
    "total_tokens": 112
  }
}

Streaming Response¶

When "stream": true:

data: {"id":"chatcmpl-8q9ABC123","object":"chat.completion.chunk","created":1705334400,"model":"gpt-3.5-turbo-0125","system_fingerprint":"fp_abc123","choices":[{"index":0,"delta":{"role":"assistant","content":"Quantum"},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-8q9ABC123","object":"chat.completion.chunk","created":1705334400,"model":"gpt-3.5-turbo-0125","system_fingerprint":"fp_abc123","choices":[{"index":0,"delta":{"content":" computing"},"logprobs":null,"finish_reason":null}]}

...

data: {"id":"chatcmpl-8q9ABC123","object":"chat.completion.chunk","created":1705334400,"model":"gpt-3.5-turbo-0125","system_fingerprint":"fp_abc123","choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}]}

data: [DONE]

Advanced Features¶

Function Calling¶

curl -X POST http://localhost:40114/olla/openai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [
      {
        "role": "user",
        "content": "What is the weather in London?"
      }
    ],
    "functions": [
      {
        "name": "get_weather",
        "description": "Get the current weather",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city name"
            }
          },
          "required": ["location"]
        }
      }
    ],
    "function_call": "auto"
  }'

JSON Mode¶

curl -X POST http://localhost:40114/olla/openai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [
      {
        "role": "user",
        "content": "List 3 programming languages with their year of creation"
      }
    ],
    "response_format": { "type": "json_object" }
  }'

POST /olla/openai/v1/completions¶

Legacy text completion endpoint.

Request¶

curl -X POST http://localhost:40114/olla/openai/v1/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "gpt-3.5-turbo-instruct",
    "prompt": "Write a haiku about programming:",
    "max_tokens": 50,
    "temperature": 0.9
  }'

Response¶

{
  "id": "cmpl-8q9XYZ789",
  "object": "text_completion",
  "created": 1705334400,
  "model": "gpt-3.5-turbo-instruct",
  "choices": [
    {
      "text": "\n\nCode flows like water\nLogic builds the foundation\nBugs hide in shadows",
      "index": 0,
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 7,
    "completion_tokens": 17,
    "total_tokens": 24
  }
}

POST /olla/openai/v1/embeddings¶

Generate embeddings for text input.

Request¶

curl -X POST http://localhost:40114/olla/openai/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "text-embedding-3-small",
    "input": "The quick brown fox jumps over the lazy dog",
    "encoding_format": "float"
  }'

Response¶

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.0234, -0.0156, 0.0891, ...]
    }
  ],
  "model": "text-embedding-3-small",
  "usage": {
    "prompt_tokens": 9,
    "total_tokens": 9
  }
}

POST /olla/openai/v1/images/generations¶

Generate images using DALL-E (if configured).

Request¶

curl -X POST http://localhost:40114/olla/openai/v1/images/generations \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "dall-e-3",
    "prompt": "A serene landscape with mountains and a lake at sunset",
    "n": 1,
    "size": "1024x1024",
    "quality": "standard"
  }'

Response¶

{
  "created": 1705334400,
  "data": [
    {
      "url": "https://...",
      "revised_prompt": "A tranquil scene featuring majestic mountains reflected in a calm lake during a vibrant sunset, with warm orange and pink hues painting the sky"
    }
  ]
}

Request Parameters¶

Common Parameters¶

Parameter	Type	Default	Description
`model`	string	required	Model ID to use
`temperature`	float	1.0	Sampling temperature (0.0-2.0)
`top_p`	float	1.0	Nucleus sampling
`n`	integer	1	Number of completions
`stream`	boolean	false	Stream response
`stop`	string/array	null	Stop sequences
`max_tokens`	integer	inf	Maximum tokens to generate
`presence_penalty`	float	0	Penalize new tokens (-2.0 to 2.0)
`frequency_penalty`	float	0	Penalize repeated tokens (-2.0 to 2.0)
`logit_bias`	object	null	Token bias adjustments
`user`	string	null	End-user identifier

Chat-Specific Parameters¶

Parameter	Type	Description
`messages`	array	Conversation messages
`functions`	array	Available functions for calling
`function_call`	string/object	Function calling behavior
`response_format`	object	Output format (text/json_object)
`seed`	integer	Reproducibility seed
`tools`	array	Available tools (GPT-4)
`tool_choice`	string/object	Tool selection behavior

Authentication¶

The Authorization header is forwarded to the backend. Configure API keys in your endpoints:

endpoints:
  - url: "https://api.openai.com"
    name: "openai-production"
    type: "openai"
    headers:
      Authorization: "Bearer ${OPENAI_API_KEY}"

Rate Limits¶

OpenAI rate limits are enforced by the backend service. Olla adds its own configurable limits:

Default: 200 requests per minute per endpoint
Configurable per-endpoint limits
Burst handling for traffic spikes

Error Handling¶

OpenAI errors are forwarded with additional context:

{
  "error": {
    "message": "Invalid API key provided",
    "type": "invalid_request_error",
    "param": null,
    "code": "invalid_api_key"
  },
  "olla_context": {
    "endpoint": "openai-production",
    "request_id": "req_abc123",
    "timestamp": "2024-01-15T10:30:00Z"
  }
}

Response Headers¶

All responses include standard Olla headers:

X-Olla-Endpoint - Backend endpoint name
X-Olla-Model - Model used
X-Olla-Backend-Type - Always "openai" for these endpoints
X-Olla-Response-Time - Total processing time
X-Olla-Request-ID - Request tracking ID