Architecture
Ollama is built on a lightweight, efficient architecture designed to run large language models locally with minimal resource overhead. The system employs a client-server model where the Ollama daemon manages model loading, inference, and resource allocation.
The core architecture consists of a model management layer that handles downloading, caching, and version control of language models, an inference engine optimized for CPU and GPU acceleration, and a REST API server that provides standardized endpoints for model interaction. The system utilizes quantization techniques to reduce memory footprint while maintaining model performance.
Key architectural components include:
-
Model registry for tracking available and installed models
-
Memory-mapped file system for efficient model loading
-
Request queue management for concurrent inference
-
Built-in model quantization and optimization
-
Cross-platform compatibility layer
The architecture supports both CPU-only and GPU-accelerated inference, with automatic detection and utilization of available hardware resources including NVIDIA CUDA, AMD ROCm, and Apple Metal.