Architecture¶

Olla follows Hexagonal Architecture (Ports & Adapters) principles, ensuring clean separation of concerns, testability, and maintainability.

High-Level Architecture¶

graph TB
    subgraph "External"
        Client[Clients]
        Ollama[Ollama Nodes]
        LMStudio[LM Studio]
        OpenAI[OpenAI APIs]
    end

    subgraph "Olla Proxy"
        subgraph "Application Layer"
            Handler[HTTP Handlers]
            Router[Request Router]
        end

        subgraph "Core Domain"
            LB[Load Balancer]
            HC[Health Checker]
            Registry[Model Registry]
            Stats[Statistics]
        end

        subgraph "Adapters"
            ProxySherpa[Sherpa Engine]
            ProxyOlla[Olla Engine]
            Discovery[Service Discovery]
            Security[Rate Limiter]
        end
    end

    Client --> Handler
    Handler --> Router
    Router --> LB
    LB --> ProxySherpa
    LB --> ProxyOlla
    ProxySherpa --> Ollama
    ProxyOlla --> LMStudio
    ProxySherpa --> OpenAI
    HC --> Ollama
    HC --> LMStudio
    HC --> OpenAI
    Discovery --> Registry
    Security --> Handler

Hexagonal Architecture Implementation¶

Layer Structure¶

┌─────────────────────────────────────────────────────────┐
│                    External Clients                      │
│         (CLI, API Clients, OpenWebUI, Continue)         │
└─────────────────────────┬───────────────────────────────┘
                          │
┌─────────────────────────▼───────────────────────────────┐
│                   Application Layer                      │
│                  (HTTP Handlers, Routes)                 │
│                   internal/app/handlers                  │
└─────────────────────────┬───────────────────────────────┘
                          │
┌─────────────────────────▼───────────────────────────────┐
│                      Core Domain                         │
│              (Business Logic, Entities, Ports)           │
│                     internal/core                        │
└──────┬──────────────────────────────────────┬───────────┘
       │                                      │
┌──────▼────────────┐                ┌───────▼───────────┐
│   Adapter Layer   │                │   Adapter Layer   │
│  (Proxy Engines)  │                │ (Load Balancers)  │
│ internal/adapter  │                │ internal/adapter  │
└───────────────────┘                └───────────────────┘
       │                                      │
┌──────▼────────────────────────────────────▼────────────┐
│                 External Systems                        │
│        (Ollama, LM Studio, vLLM, OpenAI API)          │
└─────────────────────────────────────────────────────────┘

Key Principles¶

Dependencies point inward: Core has no dependencies on outer layers
Ports define contracts: Interfaces in core, implementations in adapters
Domain isolation: Business logic independent of infrastructure
Testability: Each layer can be tested in isolation

Core Components¶

Application Layer (`/internal/app/`)¶

The application layer handles HTTP requests and coordinates between components:

HTTP Handlers: Process incoming requests and format responses
Request Router: Routes requests to appropriate endpoints based on models and availability
Middleware: Security, logging, and request validation
Service Manager: Manages service lifecycle with dependency injection

Core Domain (`/internal/core/`)¶

Contains the business logic and domain models.

Domain Entities¶

// internal/core/domain/endpoint.go
type Endpoint struct {
    URL            string
    Name           string
    Type           EndpointType  // ollama, lm-studio, vllm, openai
    Priority       int
    Health         HealthStatus
    CircuitBreaker *CircuitBreaker
    Models         []Model
}

// internal/core/domain/model.go
type Model struct {
    ID           string
    Name         string
    Family       string      // llama, mistral, etc
    Size         int64       // Model size in bytes
    Capabilities []string    // vision, embeddings, code
    Context      int         // Context window size
    Endpoints    []string    // Available on these endpoints
}

// internal/core/domain/routing.go
type RoutingDecision struct {
    Endpoint    *Endpoint
    Model       string
    Strategy    string      // How the decision was made
    Alternatives []Endpoint  // Fallback options
}

Port Interfaces¶

Ports define contracts between layers:

// internal/core/ports/proxy.go
type ProxyService interface {
    ProxyRequest(ctx context.Context, w http.ResponseWriter, 
                 r *http.Request, stats *RequestStats, 
                 logger StyledLogger) error
    GetStats(ctx context.Context) (ProxyStats, error)
    UpdateConfig(configuration ProxyConfiguration)
}

// internal/core/ports/discovery.go
type DiscoveryService interface {
    GetHealthyEndpoints(ctx context.Context) ([]*domain.Endpoint, error)
    RefreshEndpoints(ctx context.Context) error
    RegisterEndpoint(endpoint *domain.Endpoint) error
    UnregisterEndpoint(name string) error
}

// internal/core/ports/balancer.go
type EndpointSelector interface {
    Select(ctx context.Context, endpoints []*domain.Endpoint, 
           model string) (*domain.Endpoint, error)
    UpdateMetrics(endpoint string, latency time.Duration, success bool)
    GetType() string
}

// internal/core/ports/health.go
type HealthChecker interface {
    CheckEndpoint(ctx context.Context, endpoint *domain.Endpoint) error
    StartMonitoring(ctx context.Context) error
    GetStatus() map[string]HealthStatus
}

Adapter Layer (`/internal/adapter/`)¶

Infrastructure implementations of the core ports.

Proxy Engines (`/internal/adapter/proxy/`)¶

Two implementations with different trade-offs:

Sherpa Engine - Simple and maintainable:

   } // func

id=__span-3-1>type SherpaProxy struct { client *http.Client // Shared HTTP client bufferPool *pool.Pool[*[]byte] config *Configuration pan> Simple implementation with shared transport class=w> (s *SherpaProxy) ProxyRequest(ctx context.Context, w http.ResponseWriter, r *http.Request, stats *RequestStats, logger StyledLogger) error { req := s.createBackendRequest(r) resp, err := s.client.Do(req) if err != nil { return err } defer resp.Body.Close() // Stream response with new signature // Returns: // - bytesWritten: total bytes successfully written to client // - lastChunk: final bytes of response (up to 8KB) for metrics extraction // - err: streaming error if any buffer := make([]byte, 32*1024) bytesWritten, lastChunk, err := s.streamResponse(ctx, ctx, w, resp, buffer, logger) if err != nil { return fmt.Errorf("streaming failed after %d bytes: %w", bytesWritten, err) } // lastChunk contains the tail of the response (for extracting provider metrics) // This avoids buffering the entire response while still capturing completion stats if len(lastChunk) > 0 { // Extract provider metrics from the last chunk of response s.extractMetrics(lastChunk, stats) } return nil class=p>}

Olla Engine - High-performance:

type OllaProxy struct {
    pools      map[string]*ConnectionPool  // Per-endpoint pools
    bufferPool *pool.Pool[*[]byte]
    config     *Configuration
}

// Advanced implementation with connection pooling
func (o *OllaProxy) ProxyRequest(ctx context.Context,
    w http.ResponseWriter, r *http.Request,
    stats *RequestStats, logger StyledLogger) error {
    endpoint := o.selectEndpoint()
    pool := o.getPool(endpoint)

    resp, err := pool.RoundTrip(r)
    if err != nil {
        return err
    }
    defer resp.Body.Close()

    // Get buffer from pool for zero-allocation streaming
    buffer := o.bufferPool.Get()
    defer o.bufferPool.Put(buffer)

    // Stream with optimized backpressure handling
    // Returns: bytes written, last chunk for metrics, error
    bytesWritten, lastChunk, err := o.streamResponse(
        r.Context(),           // client context
        resp.Request.Context(), // upstream context
        w, resp, *buffer, logger)

    if err != nil && !errors.Is(err, context.Canceled) {
        return fmt.Errorf("stream failed: %w", err)
    }

    // Extract metrics from last chunk (Olla buffers only final bytes)
    if len(lastChunk) > 0 {
        o.extractProviderMetrics(lastChunk, endpoint, stats)
    }

    stats.TotalBytes = bytesWritten
    return nil
}

See Proxy Engines for detailed comparison.

Load Balancers (`/internal/adapter/balancer/`)¶

Three strategies available:

// Priority balancer - selects highest priority
type PriorityBalancer struct {
    mu sync.RWMutex
}

func (p *PriorityBalancer) Select(ctx context.Context, 
    endpoints []*domain.Endpoint, model string) (*domain.Endpoint, error) {

    healthy := filterHealthy(endpoints)
    if len(healthy) == 0 {
        return nil, ErrNoHealthyEndpoints
    }

    // Sort by priority (highest first)
    sort.Slice(healthy, func(i, j int) bool {
        return healthy[i].Priority > healthy[j].Priority
    })

    return healthy[0], nil
}

Priority: Select highest priority available endpoint
Round Robin: Cycle through available endpoints
Least Connections: Route to endpoint with fewest active connections

Health Checking (`/internal/adapter/health/`)¶

Periodic health checks with configurable intervals
Circuit breaker pattern for failing endpoints
Automatic recovery detection
Health status caching

Service Discovery (`/internal/adapter/discovery/`)¶

Static: Configuration-based endpoint discovery
Dynamic: Future support for service discovery systems
Model discovery and registry updates

Security (`/internal/adapter/security/`)¶

Rate limiting per IP and globally
Request size validation
Header validation
Trusted proxy support

Statistics (`/internal/adapter/stats/`)¶

Lock-free atomic counters for performance:

type StatsCollector struct {
    // Using xsync for lock-free operations
    endpoints *xsync.Map[string, *endpointStats]
    total     *xsync.Counter
}

type endpointStats struct {
    requests   int64  // atomic
    errors     int64  // atomic
    totalTime  int64  // atomic nanoseconds
    lastError  int64  // atomic unix timestamp
}

func (s *StatsCollector) RecordRequest(endpoint string, duration time.Duration, err error) {
    // Lock-free increment
    s.total.Add(1)

    // Get or create endpoint stats
    stats, _ := s.endpoints.LoadOrStore(endpoint, &endpointStats{})

    // Atomic updates
    atomic.AddInt64(&stats.requests, 1)
    atomic.AddInt64(&stats.totalTime, int64(duration))

    if err != nil {
        atomic.AddInt64(&stats.errors, 1)
        atomic.StoreInt64(&stats.lastError, time.Now().Unix())
    }
}

Request Flow¶

sequenceDiagram
    participant C as Client
    participant H as Handler
    participant S as Security
    participant R as Router
    participant LB as Load Balancer
    participant HC as Health Check
    participant P as Proxy Engine
    participant E as Endpoint

    C->>H: HTTP Request
    H->>S: Validate Request
    S->>H: Validation Result
    H->>R: Route Request
    R->>LB: Select Endpoint
    LB->>HC: Check Endpoint Health
    HC->>LB: Health Status
    LB->>R: Selected Endpoint
    R->>P: Proxy Request
    P->>E: Forward Request
    E->>P: Response
    P->>R: Proxy Response
    R->>H: Formatted Response
    H->>C: HTTP Response

Request Processing Pipeline¶

// Simplified request flow
func (h *ProxyHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    // 1. Extract request metadata
    ctx := context.WithValue(r.Context(), "request-id", generateID())

    // 2. Apply security policies
    if err := h.security.ValidateRequest(r); err != nil {
        http.Error(w, "Forbidden", http.StatusForbidden)
        return
    }

    // 3. Select endpoint
    endpoint, err := h.router.Route(ctx, r)
    if err != nil {
        http.Error(w, "No endpoints available", http.StatusServiceUnavailable)
        return
    }

    // 4. Proxy request
    stats := &RequestStats{StartTime: time.Now()}
    err = h.proxy.ProxyRequest(ctx, w, r, stats, h.logger)

    // 5. Record metrics
    h.stats.RecordRequest(endpoint, stats)
}

Service Lifecycle¶

Services follow a managed lifecycle with dependency injection:

// internal/app/app.go
type ManagedService interface {
    Name() string
    Dependencies() []string
    Start(ctx context.Context) error
    Stop(ctx context.Context) error
}

// Service Manager uses Kahn's algorithm for topological sorting
type ServiceManager struct {
    services  map[string]ManagedService
    order     []string  // Startup order
    mu        sync.RWMutex
}

The service manager:

Resolves dependencies using topological sorting
Starts services in dependency order
Stops services in reverse order
Handles graceful shutdown

Concurrency Model¶

Olla uses Go's goroutine-based concurrency:

Request Handling: Each request runs in its own goroutine
Health Checking: Background goroutines monitor endpoint health
Statistics: Lock-free atomic operations for high-performance metrics
Connection Pooling: Shared connection pools across goroutines
Circuit Breakers: Thread-safe state management

Connection Pool Management¶

type ConnectionPool struct {
    endpoint  string
    available chan net.Conn
    factory   ConnectionFactory

    // Metrics
    created   int64  // atomic
    active    int64  // atomic
    destroyed int64  // atomic
}

func (p *ConnectionPool) Get(ctx context.Context) (net.Conn, error) {
    select {
    case conn := <-p.available:
        if p.isHealthy(conn) {
            atomic.AddInt64(&p.active, 1)
            return conn, nil
        }
        p.destroy(conn)
        return p.create(ctx)

    case <-ctx.Done():
        return nil, ctx.Err()

    default:
        return p.create(ctx)
    }
}

Memory Optimisation¶

Object Pooling¶

Reducing GC pressure through object reuse:

// Generic pool implementation
type Pool[T any] struct {
    pool sync.Pool
    new  func() T
    reset func(T)
}

// Buffer pool for streaming
var bufferPool = &Pool[*[]byte]{
    new: func() *[]byte {
        buf := make([]byte, 8192)
        return &buf
    },
    reset: func(buf *[]byte) {
        // Clear sensitive data
        clear(*buf)
    },
}

Memory Layout Optimisation¶

// Optimised struct layout for cache efficiency
type Endpoint struct {
    // Hot path fields (frequently accessed)
    Health    int32   // 4 bytes, atomic access
    Priority  int32   // 4 bytes, fits in same cache line

    // Warm path fields
    URL       string  // 16 bytes (string header)
    Name      string  // 16 bytes

    // Cold path fields (rarely accessed)
    Type      EndpointType
    Models    []Model
    CreatedAt time.Time
}

Error Handling¶

Structured error handling throughout the system:

// Domain errors
var (
    ErrNoHealthyEndpoints = errors.New("no healthy endpoints available")
    ErrCircuitOpen        = errors.New("circuit breaker is open")
    ErrModelNotFound      = errors.New("model not found")
    ErrTimeout            = errors.New("request timeout")
)

// Error wrapping for context
type ProxyError struct {
    Op       string    // Operation that failed
    Endpoint string    // Which endpoint
    Err      error     // Underlying error
    Time     time.Time // When it occurred
}

func (e *ProxyError) Error() string {
    return fmt.Sprintf("%s failed for %s: %v", e.Op, e.Endpoint, e.Err)
}

Graceful Degradation: Continue serving from healthy endpoints
Circuit Breakers: Automatically isolate failing endpoints
Retry Logic: Configurable retry strategies with backoff
Error Propagation: Structured error responses to clients

Configuration Architecture¶

Configuration flows through the system using dependency injection:

graph LR
    Config[config.yaml] --> App[Application]
    App --> Handlers[HTTP Handlers]
    App --> Proxy[Proxy Engines]
    App --> LB[Load Balancer]
    App --> HC[Health Checker]
    App --> Disc[Discovery]
    App --> Sec[Security]

Observability¶

Built-in observability features:

Request Tracing: Unique request IDs and correlation
Metrics: Performance and health metrics
Logging: Structured JSON logging
Health Endpoints: /internal/health and /internal/status
Response Headers: Debugging information in HTTP headers

Testing Architecture¶

Comprehensive testing strategy:

Unit Tests: Test individual components in isolation
Integration Tests: Full request flow testing
Benchmark Tests: Performance testing of critical paths
Contract Tests: Ensure adapter implementations meet port contracts

Contract Testing¶

Ensuring adapters meet port contracts:

// Shared test suite for proxy implementations
func TestProxyContract(t *testing.T, factory ProxyFactory) {
    tests := []struct {
        name string
        test func(t *testing.T, proxy ports.ProxyService)
    }{
        {"handles successful request", testSuccessfulRequest},
        {"handles streaming response", testStreamingResponse},
        {"handles connection failure", testConnectionFailure},
        {"respects timeout", testTimeout},
        {"preserves headers", testHeaderPreservation},
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            proxy := factory.Create()
            tt.test(t, proxy)
        })
    }
}

Performance Considerations¶

Critical Path Optimisations¶

Endpoint Selection: O(1) for priority, O(n) worst case
Health Checks: Cached with TTL, async updates
Statistics: Lock-free atomic operations
Connection Pooling: Pre-warmed connections
Buffer Management: Object pooling to reduce allocations

Security Considerations¶

Rate Limiting: Protect against abuse and DoS
Request Validation: Size limits and content validation
Header Sanitisation: Clean and validate HTTP headers
Circuit Breakers: Protect downstream services
Trusted Proxies: Secure proxy header handling

Next Steps¶

Review Technical Patterns for implementation patterns
See Circuit Breaker for resilience patterns
Check Testing Guide for testing strategies
Explore Benchmarking for performance testing