Skip to main content

Architecture

Ollama is built on a lightweight, efficient architecture designed to run large language models locally with minimal resource overhead. The system employs a client-server model where the Ollama daemon manages model loading, inference, and resource allocation.

The core architecture consists of a model management layer that handles downloading, caching, and version control of language models, an inference engine optimized for CPU and GPU acceleration, and a REST API server that provides standardized endpoints for model interaction. The system utilizes quantization techniques to reduce memory footprint while maintaining model performance.

Key architectural components include:

  • Model registry for tracking available and installed models

  • Memory-mapped file system for efficient model loading

  • Request queue management for concurrent inference

  • Built-in model quantization and optimization

  • Cross-platform compatibility layer

The architecture supports both CPU-only and GPU-accelerated inference, with automatic detection and utilization of available hardware resources including NVIDIA CUDA, AMD ROCm, and Apple Metal.