Container Image Acceleration and Smart Caching

Most of us are familiar with Docker and deploying containers for microservices, yet scaling container deployment for machine learning presents unique challenges. Before exploring container acceleration, let's examine the hurdles involved in deploying containers at scale.

What constitutes a container image?

Filesystem Layers: A container image is constructed from a series of read-only layers stacked on top of one another. Each layer captures changes such as adding files or modifying configurations.
Manifest: This is a JSON document that outlines the image, detailing its layers, configuration, and additional metadata.
Large Model Weights: Machine learning models often have weights ranging from 20GB to 100GB, which complicates the process of loading the model into GPU memory.

When deploying containers at scale, the complexity increases because each service typically relies on a custom container image. Managing these images throughout their lifecycle—from building and downloading to loading—introduces significant overhead during service startup.

"Pulling image is one of the time-consuming steps in the container lifecycle. Research shows that time to take for pull operation accounts for 76% of container startup time[FAST ‘16]"

Consider the time required to boot up an ML model—with an average size of 20GB—using a custom runtime and model weights. Now, imagine a truly multi-tenant system hosting over 1,000 models and 1,000 containers. How can we minimize the boot-up time for any given service in such an environment?

Metric	Both Together	Container Image Optimised	Weights Optimised	Default
Scheduling and Mounts	3 sec	3 sec	3 sec	3 sec
Image Pull	5 sec	5 sec	120 sec	120 sec
Container Start	5 sec	5 sec	20 sec	20 sec
Model Weights Load	5 sec	60 sec	5 sec	60 sec
Total	18 sec	73 sec	148 sec	203 sec

Understand the all 4 different setups and how you can implement each of the optimizations :

Default :

Most open-source frameworks—such as MLFlow, KServe, and Seldon.io—follow a common deployment model where each model is deployed as an individual service. As traffic for a particular model increases, the service scales according to its deployment configuration. This process involves retrieving the container image, booting the container, and then loading the model weights.

Weights Optimized :

To achieve this, you must implement two high-level controllers. First, a Router Controller manages the API request routing data for each model service. Next, a Model Weights Lifecycle Manager, combined with a node labeling strategy, orchestrates the lifecycle of model weights. This setup ensures that pods are scheduled only on nodes that already have the necessary weights. For new model deployments, a cron job is triggered to download the model onto all nodes labeled for that specific model, with the number of labeled nodes and overall cluster configuration being dynamically determined based on traffic metrics.

Container Image Acceleration :

To accomplish this, we need to break down the Docker image by extracting its individual layers into files. For instance, imagine you have multiple Docker images that share the same CUDA and Python versions but differ in their specific layers.

Image 1 :

RUN apt install python-pip && pip install transformers

Image 2

RUN apt install python-pip
RUN pip install transformers

Although the both containers are the same they have different layers which prevent caching and reuse.

To enable this we have to deconstruct the concept of layer and re imagine layers a group of file. Using oss framework such as nydus we can instruct the runc not use layers but use a different snapshot that uses the files metadata to reconstruct the layers. Using this we can have multiple container images use the same. So now lets assume 2 models each needing different runtimes but the runtime share 80$ of the same files so instead of redownloading the entire image we only need to download the 20% of the files need to run the new container.

Furthermore we can tract the utilization for these images an have to common files caches across the the nodes. This would help us reduce the pull time for each image

But what happens in the cache miss scenario, The container would have to go back to the a central server to pull the files that are not there, No of there are multiple new instance the central server may also throttle. To solve for the this we can also have a p2p acceleration where we have the node first look at the adjacent node for the files and try to get it and its its not found then only

Combining Container Image Acceleration and Weights caching :

For us to effective scale ML Services one of the key component is the traffic data, For this we can have a plugin in isotio that collect the usage data and then a separate controller that takes care for caching and distribution. Traffic data can then be used to create shards for GPU cluster with a overall cluster such that each shard has a mix bag for utilization of models. Have the nodes pre build the cache for models weights and container ensure great p50 and p90 for the model load to 10x faster than a typical setup. In the Worst case it would still be 2-3x faster because for container image acceleration not having to deal with large layered container images.