Back to Deep Dives
Technical Deep Dive

Cold Start Engineering

From 26-minute boots to 7-second snaps — how memory snapshots, JIT caching, and volume symlinks unlock scale-to-zero GPU economics.

Systems & Infrastructure·5 min read·Production Verified

TL;DR Summary

  • Highest-Leverage Optimization: GPU memory snapshots are the most impactful optimization, delivering 5× to 24× cold start reductions across diffusers, llama.cpp, and SGLang.
  • In-process Loading Mandatory: Capturing full GPU state with CRIU requires in-process model loading; child processes (e.g. subprocess.Popen) are invisible to the snapshot system.
  • SGLang & llama-server Hacks: SGLang requires --enable-memory-saver to offload KV cache to CPU during capture. Llama-server requires --no-mmap for checkpointing compatibility.
  • Warmup Requests: Running warmup queries before snapshotting forces CUDA JIT kernel compilation, keeping it out of the cold start path.

GPU memory snapshots are the single most impactful optimization, dominating all other cold start improvements combined. Across 4+ different inference engines and model architectures, snapshots deliver 5× to 24× cold start reductions.

Key Metrics

ModelEngineBeforeAfterImprovement
FLUX.2-klein-9Bdiffusers48s~7s6.9×
GLM-4.7-Flash GGUFllama-cpp-python110-168s2-7s24×
Gemma 4 26B GGUFllama-server60-120s5-15s5-10×
Qwen3.5-35B-A3B FP8SGLang2-5 min12-20s~10×

The Discovery

The baseline for FLUX.2-klein-9B was 48.3s cold start. CPU→GPU transfer of ~18 GB bf16 weights dominated with 35s of overhead. Enabling GPU memory snapshots captured full CUDA state: containers restored in ~3s, dropping overhead from 35.0s to 3.1s (11× less).

Marginal improvements (pre-importing libraries, reducing wait_ms, baking weights into image) combined saved another few seconds, but the snapshots did 90%+ of the work.

Architectural Insight: Subprocess Kills Snapshots

The GLM-4.7-Flash GGUF deployment initially used subprocess.Popen to run llama.cpp server. Cold starts took 110-168s because Modal's GPU snapshot system (CRIU) cannot capture child process GPU state.

Switching to in-process Llama(...) loading so Modal captures the entire Python process including its CUDA context cut cold starts to 13-33s. Further optimizations (runtime image, pre-built wheels, single warmup) brought it to 2-7s.

SGLang Memory-Saver Hooks

For SGLang deployments (Qwen3.5-35B), the snapshot path uses SGLang's built-in hooks: the server calls /release_memory_occupation to offload KV cache to CPU, Modal captures the full container state including GPU memory, and on restore the server calls /resume_memory_occupation.

This requires --enable-memory-saver and --enable-weights-cpu-backup flags plus TORCHINDUCTOR_COMPILE_THREADS=1 to prevent OOM during snapshot creation.

llama-server with --no-mmap

For Gemma 4 running llama-server (subprocess), snapshots work because CRIU captures the entire process tree. The critical requirement is --no-mmap: memory-mapped model files prevent CRIU from properly checkpointing the GPU memory state.

Key Learnings

  1. GPU snapshots are the single highest-leverage optimization: everything else is marginal. The most impactful action across all deployments was enabling GPU memory snapshots.
  2. In-process loading is mandatory for snapshot capture: child processes started via subprocess are invisible to CRIU.
  3. Warmup before snapshot: CUDA kernel JIT compilation happens on first inference. Run warmup requests with varied sequence lengths before the snapshot is taken.
  4. Snapshot rebuild is a one-time cost per deploy: first request after deploy takes 60-190s to rebuild. Subsequent requests use the cached snapshot.
  5. scaledown_window is the first line of defence: keeping containers alive for 5 minutes after last request avoids most cold starts entirely. Costs ~$0.002/min for idle L40S.
  6. Runtime images beat devel images: using nvidia/cuda:12.4.1-runtime-ubuntu22.04 instead of the devel variant saves ~1.5 GB and eliminates build toolchain dependencies.
  7. --no-mmap is required for CRIU compatibility with llama.cpp-based deployments.
  8. Baking model weights into the image is a tradeoff: beneficial under ~15-20 GB, counter-productive above ~25 GB.

Error Catalog

ErrorRoot CauseResolution
SIGSEGV on restoreGPU snapshot CUDA handle incompatibilityUpdated Modal API; no recurrence
Child process GPU state not capturedsubprocess.Popen invisible to CRIURewrote to in-process model loading
OOM during snapshot creationTorchInductor thread pool competing for VRAMSet TORCHINDUCTOR_COMPILE_THREADS=1
libgomp.so.1 not foundRuntime image missing OpenMPAdded apt_install("libgomp1")
SSL certificate verification failureLocal testing without valid certificatesCreated _make_ssl_ctx() for local dev

Source: §1 (FLUX.2-klein-9B), §3 (GLM-4.7-Flash GGUF), §5 (Gemma 4 26B GGUF), §7 (Qwen3.5-35B-A3B FP8).

Hi! I'm Yuvraj's AI assistant. I know everything about his projects, experience, and technical work. Ask me anything!