Engineer and Explorer. Solving hard engineering problems

Three-Tier Storage Architecture for Fast LLM Inference in the Cloud

Introduction and Motivation

Large Language Model (LLM) inference workloads deal with extremely large model files (often many gigabytes) that must be loaded quickly and repeatedly across distributed GPU instances. Traditional single-tier storage (either solely local disk or only remote cloud storage) cannot meet the throughput and latency demands of serving these models at scale. For example, if thousands of inference nodes each pull a 20–100 GB model from a cloud object store, network bandwidth and storage I/O quickly become bottlenecks. To tackle this, modern AI infrastructure employs a three-tiered storage architecture optimized for both performance and cost:

  1. Hot Tier – Local NVMe with POSIX Interface: Each compute node has a fast NVMe SSD (or NVMe-based instance storage) mounted as a POSIX filesystem. This provides “hot” data access with extremely high throughput (on the order of ~2.5 GB/s or more per node) and low latency, ensuring model files can be read at near in-memory speeds.
  2. Warm Tier – Intra-Cluster File Sharing: On the cluster level, nodes share data with each other. Each machine exposes a service (or participates in a distributed filesystem) so that peers can retrieve model files from one another over the high-speed network. This “warm” layer means if one node has a model cached on its NVMe, other nodes can fetch it at ~2 GB/s over the LAN instead of all hitting the cloud storage.
  3. Cold Tier – Cloud Object Storage: A durable object store (like AWS S3, Azure Blob, GCS, etc.) serves as the “cold” storage for all model and data artifacts. It provides virtually infinite capacity and durability, but with higher latency (hundreds of milliseconds) and lower per-connection throughput. The cold tier is the source of truth from which data is pulled into the faster tiers on demand, and it’s used as a fallback for cache misses or to populate the upper tiers. Here you can use multithreaded approaches and still achieve 1 GB/s with parallel pulls

By combining these layers, we can dramatically accelerate model loading while minimizing repeated transfers from slow storage. The local NVMe tier handles the bulk of reads at maximal speed, the cluster-sharing tier prevents duplicate downloads across nodes, and the cloud object store ensures data persistence and easy synchronization. This architecture is increasingly considered a best practice for LLM serving in cloud environments, where we aim to achieve on the order of 2.5 GB/s read throughput from local disk and ~1.5 GB/s from peer nodes, versus ~1 GB/s directly from remote storage. In the sections below, we break down each tier, discuss design considerations, and survey technologies and products implementing such multi-tier storage systems.

Tier 1 - Hot Storage – Local NVMe with POSIX File Interface

What it is: The hot tier is typically a local NVMe SSD on each GPU server, formatted with a POSIX filesystem (e.g. ext4 or XFS) so that applications can read and write normally. This tier acts as a cache for the most actively accessed data (model weights, vocabulary files, etc.) and delivers data directly over the PCIe bus to the CPU/GPU with minimal overhead. Many cloud GPU instance types come with high-performance ephemeral NVMe storage, or support attaching NVMe-based volumes, which can be leveraged for this purpose.

Performance: NVMe SSDs provide very high sequential throughput and low latency. In practice, a single NVMe drive can stream data at multiple gigabytes per second. About 2.5 GB/s read throughput from disk when loading a '.safetensor' model file from the page cache on NVMe​ This meant a 512 MB model file loaded in only ~0.2s from the local cache (versus ~1s originally). Modern NVMe drives (PCIe 4.0/5.0) can sustain 5–7 GB/s in sequential reads, so the target of 2 GB/s per model file is quite attainable with proper tuning. The local NVMe tier also offers latency in the hundreds of microseconds, which is only marginally higher than system memory and dramatically lower than network or cloud storage latencies (which are milliseconds or more)​

POSIX interface: Exposing this cache as a standard filesystem is important for compatibility – LLM inference frameworks (TensorFlow, PyTorch, Hugging Face, etc.) can memory-map or read model files without code changes. A POSIX-like interface with full support for file operations and mmap ensures even libraries that demand a filesystem (for example, torch.load of a model checkpoint, or mmap for a Safetensors file) will work seamlessly. In fact, the Safetensors format specifically benefits from mmap and random-access reads, which perform best when the file is on a local SSD or memory, rather than a high-latency remote store

Usage: In practice, when deploying an LLM service, one would stage the model files onto each node’s NVMe if possible (either ahead of time or on first use). This might involve copying from the warm tier or cold tier to the local disk. Once on NVMe, all subsequent reads hit the local filesystem cache. It’s common to use techniques like read-ahead and prefetching to optimize this tier. For instance, one can sequentially read or the model file at startup to load it into the OS page cache entirely, so that future accesses are served from memory or disk with minimal waiting​. Ensuring the NVMe is formatted with an a

Tier 2: Warm Storage – Cluster-Wide Sharing of Model Data

What it is: The warm tier consists of storage within the cluster (or availability zone) that enables nodes to retrieve data from peers or a central cache at LAN speeds. In effect, it creates a distributed cache of “warm” data that is not necessarily on the local disk, but is much closer and faster than the remote cloud object store. There are a couple of design patterns for this tier:

Peer-to-Peer File Serving: Each node that downloads a model to its NVMe can act as a file server for others. This could be as simple as running an NFS server or HTTP server on each machine’s cache directory, or a more sophisticated P2P system. When a node needs a model it doesn’t have locally, it first checks if any peer already has it (via a lookup service or hash table) and then pulls the file from that peer over the network. Because intra-cloud networks can be high-bandwidth (10 Gbps ~ 1.25 GB/s, 25 Gbps, or more) and low-latency, this yields much faster throughput than fetching from a distant blob store. It also avoids redundant traffic: only one node pulls from the cold tier, and the rest can piggy-back off that copy. This approach was hinted at in the question and is akin to BitTorrent-style distribution or Kubernetes image caches used in large clusters.

Performance: The warm tier should deliver on the order of gigabytes per second of throughput over the network. A well-designed intra-cluster sharing can approach line-rate of the NICs. A system that allows data cached on one node’s NVMe to be accessed by other nodes; in a production scenario achieved ~1.5 GB/s throughput for warm reads of a 100 GB model file (far exceeding what their HDFS or earlier solution could do)​. In theoretical terms example, the system can serve data at up to 10 GB/s to the local cluster.

Tier 3: Cold Storage – Cloud Blob/Object Stores

What it is: The cold tier is typically a cloud object storage service or distributed blob store that holds the persistent copies of all your models, checkpoints, and datasets. Examples include Amazon S3, Google Cloud Storage, Azure Blob Storage, or on-premise object stores like Ceph RADOS Gateway, MinIO, or Cloudian. This layer is cold in the sense that data here is not immediately accessed for every inference; it is pulled rarely – only when a cache miss occurs or when initially loading a new model into the system. Cold storage is optimized for durability, scalability, and cost rather than low latency. It can store petabytes of data with 11 nines of durability, but individual read requests might have 50–200 ms latency and limited single-stream throughput (often 100–200 MB/s per thread to S3 is common, though many threads or ranged GETs can scale this up).

Role in the architecture: The object store is the single source of truth. All data originates here (either uploaded from training jobs or published by model developers) and the higher tiers are caches of subsets of this data. The cold tier ensures that if a node cache is lost (say an instance is terminated) or if the warm layer does not have a particular model cached, there is a reliable copy to pull from. In cloud deployments, using a managed object store is extremely convenient due to its durability and global accessibility – for example, you might have trained a model and saved it to an S3 bucket, and your inference cluster (possibly in another region or cloud) can fetch it from that bucket when needed. Cold storage also often ends up serving as the archive for rarely used models – older versions can be kept in S3 and not occupy expensive NVMe space until needed.

Performance characteristics: Cold storage is the slowest of the tiers, but can be parallelized. Cloud object stores have high aggregate bandwidth (since they are distributed systems) – AWS S3, for instance, can scale to couple of gigabits per second for large requests or many concurrent requests​. Thus, best practices for this tier include using concurrent connections and range requests to speed up transfers

Cost considerations: Cloud object stores charge not only for storage capacity but also for data egress and API requests. Pulling a 50 GB model to 100 nodes could incur significant egress fees if those nodes are in a different region or if the data is not cached. Thus, minimizing direct cold reads isn’t just about speed – it’s also about cost savings. Every cache hit on the warm tier saves potentially dozens of S3 GET requests and dozens of gigabytes of data transfer from S3.

Best Practices for Implementation

Designing and operating a three-tier storage stack for LLM inference requires careful attention to a few best practices to get the most benefit:

  • Prefetch and Warm-Up: Proactively warm up caches for known workloads. If you know a particular model will be deployed to N nodes, try to have one node fetch it from cold storage ahead of time, then use the warm layer to distribute it. Some systems let you do an explicit prefetch cache on all nodes. Even without special commands, you might simply start one instance of the model, let it pull from S3 (filling the warm cache), then start the rest which will hit the cache.
  • High-Bandwidth Networking: Ensure the cluster network can handle the load. Intra-cluster transfers will spike when a new model is broadcast. If each of 8 machines needs to get 50 GB at ~1 GB/s, that’s ~400 Gbps of aggregate traffic potentially. Use at least 10 GbE (preferably 25 GbE or higher for serious LLM clusters)
  • Concurrent I/O and Async Loading: LLM frameworks often allow loading in parallel with other work. You might spawn multiple threads to read different parts of the model file from disk or from peers concurrently (taking advantage of multiple NVMe queues or TCP connections). Many caching systems automatically do this, but if rolling your own, don’t rely on single-threaded copy for large files. Also, use large read sizes or enable read-ahead (e.g., Linux’s readahead setting for the block device or file) to fully utilize disk bandwidth
  • Monitor and Tune Cache Hit Rates: Measure how often your inference nodes are still hitting the cold tier. If you see frequent accesses to S3 for the same file, that indicates the warm tier might be under-sized or not configured right. The goal is that after an initial flurry, steady-state reads come from NVMe or at worst from another node’s NVMe. Have metrics for cache hit ratio, and you can also check S3 access logs or cloud monitoring to see if multiple GETs are occurring for the same object. Aim for near 100% cache hit on repeated model loads If not, you may need to increase the cache size allocated or revise cache eviction policies
  • Cache Eviction and Space Management: Decide how much of each node’s NVMe to devote to caching and how to evict. If you have a limited SSD size and many models, use LRU or LFU eviction so that least-used models get removed when space is needed for new ones. Some systems do this automatically. Ensure that the eviction doesn’t block critical data – e.g., if using the same NVMe for both model cache and something like OS or scratch, partition it logically. It’s often wise to dedicate a separate mount or directory solely for the model cache so you can manage its size. Also consider compressing cold copies (some systems can compress data when tiering to object storage to save space).

Conclusion

Deploying large-scale LLM inference in the cloud demands a storage approach that can keep hundreds of GPUs fed with data without undue delay. A three-tier storage architecture – combining a fast local NVMe layer, a cluster-wide sharing/caching layer, and a scalable cloud object store – has emerged as a best practice to achieve this. This design takes advantage of the strengths of each layer: the speed of local SSDs, the high bandwidth of LAN, and the durability and elasticity of cloud storage. By using techniques like caching, prefetching, and distributed file serving, it’s possible to approach hardware limits (gigabytes per second throughput) for model loading, thus minimizing GPU idle time waiting for weights to load.

For organizations building their LLM infrastructure, the path to a unified storage stack usually involves some trade-off between convenience, cost, and control. Managed services can offload the heavy lifting but might be less flexible in fine-tuning. Open-source gives full control but requires operational expertise. The good news is that the patterns are well-established – even if one starts simple (say, an NFS server caching S3), it’s straightforward to evolve toward more distributed and scalable solutions as needed. Adopting this layered approach is key to avoiding storage becoming the bottleneck in AI deployments. When done right, your GPUs spend more time inferencing and less time waiting on data, and your cloud costs for bandwidth are kept in check by the efficiency of caching.