✨TL;DR Summary
- Highest-Leverage Optimization: GPU memory snapshots are the most impactful optimization, delivering 5× to 24× cold start reductions across diffusers, llama.cpp, and SGLang.
- In-process Loading Mandatory: Capturing full GPU state with CRIU requires in-process model loading; child processes (e.g.
subprocess.Popen) are invisible to the snapshot system. - SGLang & llama-server Hacks: SGLang requires
--enable-memory-saverto offload KV cache to CPU during capture. Llama-server requires--no-mmapfor checkpointing compatibility. - Warmup Requests: Running warmup queries before snapshotting forces CUDA JIT kernel compilation, keeping it out of the cold start path.
GPU memory snapshots are the single most impactful optimization, dominating all other cold start improvements combined. Across 4+ different inference engines and model architectures, snapshots deliver 5× to 24× cold start reductions.
Key Metrics
| Model | Engine | Before | After | Improvement |
|---|---|---|---|---|
| FLUX.2-klein-9B | diffusers | 48s | ~7s | 6.9× |
| GLM-4.7-Flash GGUF | llama-cpp-python | 110-168s | 2-7s | 24× |
| Gemma 4 26B GGUF | llama-server | 60-120s | 5-15s | 5-10× |
| Qwen3.5-35B-A3B FP8 | SGLang | 2-5 min | 12-20s | ~10× |
The Discovery
The baseline for FLUX.2-klein-9B was 48.3s cold start. CPU→GPU transfer of ~18 GB bf16 weights dominated with 35s of overhead. Enabling GPU memory snapshots captured full CUDA state: containers restored in ~3s, dropping overhead from 35.0s to 3.1s (11× less).
Marginal improvements (pre-importing libraries, reducing wait_ms, baking weights into image) combined saved another few seconds, but the snapshots did 90%+ of the work.
Architectural Insight: Subprocess Kills Snapshots
The GLM-4.7-Flash GGUF deployment initially used subprocess.Popen to run llama.cpp server. Cold starts took 110-168s because Modal's GPU snapshot system (CRIU) cannot capture child process GPU state.
Switching to in-process Llama(...) loading so Modal captures the entire Python process including its CUDA context cut cold starts to 13-33s. Further optimizations (runtime image, pre-built wheels, single warmup) brought it to 2-7s.
SGLang Memory-Saver Hooks
For SGLang deployments (Qwen3.5-35B), the snapshot path uses SGLang's built-in hooks: the server calls /release_memory_occupation to offload KV cache to CPU, Modal captures the full container state including GPU memory, and on restore the server calls /resume_memory_occupation.
This requires --enable-memory-saver and --enable-weights-cpu-backup flags plus TORCHINDUCTOR_COMPILE_THREADS=1 to prevent OOM during snapshot creation.
llama-server with --no-mmap
For Gemma 4 running llama-server (subprocess), snapshots work because CRIU captures the entire process tree. The critical requirement is --no-mmap: memory-mapped model files prevent CRIU from properly checkpointing the GPU memory state.
Key Learnings
- GPU snapshots are the single highest-leverage optimization: everything else is marginal. The most impactful action across all deployments was enabling GPU memory snapshots.
- In-process loading is mandatory for snapshot capture: child processes started via subprocess are invisible to CRIU.
- Warmup before snapshot: CUDA kernel JIT compilation happens on first inference. Run warmup requests with varied sequence lengths before the snapshot is taken.
- Snapshot rebuild is a one-time cost per deploy: first request after deploy takes 60-190s to rebuild. Subsequent requests use the cached snapshot.
scaledown_windowis the first line of defence: keeping containers alive for 5 minutes after last request avoids most cold starts entirely. Costs ~$0.002/min for idle L40S.- Runtime images beat devel images: using
nvidia/cuda:12.4.1-runtime-ubuntu22.04instead of the devel variant saves ~1.5 GB and eliminates build toolchain dependencies. --no-mmapis required for CRIU compatibility with llama.cpp-based deployments.- Baking model weights into the image is a tradeoff: beneficial under ~15-20 GB, counter-productive above ~25 GB.
Error Catalog
| Error | Root Cause | Resolution |
|---|---|---|
| SIGSEGV on restore | GPU snapshot CUDA handle incompatibility | Updated Modal API; no recurrence |
| Child process GPU state not captured | subprocess.Popen invisible to CRIU | Rewrote to in-process model loading |
| OOM during snapshot creation | TorchInductor thread pool competing for VRAM | Set TORCHINDUCTOR_COMPILE_THREADS=1 |
libgomp.so.1 not found | Runtime image missing OpenMP | Added apt_install("libgomp1") |
| SSL certificate verification failure | Local testing without valid certificates | Created _make_ssl_ctx() for local dev |
Source: §1 (FLUX.2-klein-9B), §3 (GLM-4.7-Flash GGUF), §5 (Gemma 4 26B GGUF), §7 (Qwen3.5-35B-A3B FP8).