Yuvraj Garg is an AI Systems and Infrastructure Architect based in Bengaluru, India with 5 years of experience leading ML engineering at Styldod. He specializes in production GPU serving pipelines, multi-agent orchestration with LangGraph and MCP, cold start optimization, and cost reduction for large-scale LLM and vision model deployments.

What is Yuvraj Garg's biggest technical achievement?

Yuvraj architected REimagineHome.AI, scaling it from 0 to 2.1M+ users with 30M+ designs generated, while cutting hosting costs by 65%+ through self-hosted LLMs and VLMs on AWS EKS. He also built a DocumentAI pipeline achieving 18.6x cost savings ($0.025/doc vs $0.466 on AWS Bedrock) processing 50K documents per day.

What is Yuvraj Garg's expertise in GPU infrastructure?

Yuvraj specializes in GPU cold start optimization (achieving 6.9x faster startup using memory snapshots on L40S and B200 GPUs), serving massive MoE models like GLM-5.1 (754B) on 8x B200 clusters, and eliminating JIT compilation overhead (DeepGEMM, FlashInfer) via persistent volume caches. He uses vLLM, SGLang, and llama.cpp in production.

What tech stack does Yuvraj Garg use?

Yuvraj's core stack includes vLLM, SGLang, llama.cpp for LLM serving; LangGraph, MCP for agentic orchestration; AWS EKS, KubeRay, Karpenter for GPU cluster management; PyTorch for model work; FastAPI for backend services; and Next.js/React for frontend. He also holds Red Hat certifications in OpenShift (EX280) and Ansible (EX407).

Is Yuvraj Garg available for hire?

Yes. Yuvraj Garg is open to senior and staff Machine Learning Engineer roles covering GPU serving, cold-start optimization, multi-agent frameworks (LangGraph/MCP), and system optimization. He is based in Bengaluru, India and is available for hybrid, remote, or relocation roles, as well as contract consulting. Contact: yuvraj97.ml@gmail.com

What is REimagineHome.AI?

REimagineHome.AI is an agentic virtual interior staging platform built and architected by Yuvraj Garg at Styldod. It uses a LangGraph + MCP multi-agent pipeline (planning, execution, and quality review agents) orchestrating 20+ tools. The platform scaled from 0 to 2.1M+ users generating 30M+ designs, with hosting costs cut by 65%+ through self-hosted LLMs and VLMs on AWS EKS.

How did Yuvraj Garg reduce LLM cold start time from 26 minutes to 7 minutes?

For the Qwen3.5-397B model on a 4x B200 Modal cluster, Yuvraj reduced cold start from 26 minutes to 7 minutes by caching FlashInfer JIT kernels via persistent volume symlinks. This eliminated repetitive JIT compilation on every container boot and unlocked scale-to-zero economics, saving 74% on GPU costs.

What certifications does Yuvraj Garg hold?

Yuvraj holds a MITx MicroMasters in Statistics and Data Science (Statistics, Probability, Machine Learning, Data Analysis), a DeepLearning.AI Deep Learning Specialization, and three Red Hat certifications: EX280 (OpenShift Specialist), EX407 (Ansible Specialist), and EX200 (RHCSA System Administrator).

What is the most effective way to reduce GPU cold start times for LLM inference?

GPU memory snapshots (via CRIU) are the single highest-leverage optimization. Across vLLM, SGLang, diffusers, and llama.cpp deployments, snapshots deliver 5× to 24× cold start reductions. In-process model loading is required when child processes started via subprocess.Popen create CUDA contexts invisible to the parent snapshot.

How do GPU memory snapshots work for LLM cold start reduction?

GPU memory snapshots capture the full CUDA state of a running container using CRIU. On restore, the container resumes from snapshot instead of reloading weights from disk. For FLUX.2-klein-9B on Modal L40S GPUs, CPU-to-GPU transfer dropped from 35s to 3s, achieving 6.9× total cold start reduction (48s to 7s).

How do you cache FlashInfer JIT kernels to avoid long boot times on large models?

For Qwen3.5-397B on Modal 4× B200 GPUs, FlashInfer JIT compilation added significant cold-start overhead. Caching compiled kernels to a persistent volume and symlinking at boot reduced cold start from 26 minutes to 7 minutes, unlocking scale-to-zero and saving ~74% on GPU costs.

When can subprocess-based model servers still use GPU snapshots?

When CRIU captures the entire process tree (e.g. llama-server as subprocess), snapshots can work if --no-mmap is set so memory-mapped weights do not block checkpointing. Popen-launched children whose CUDA context is invisible to the parent process cannot be snapshotted; use in-process loading instead.

Cold Start Engineering: 26m to 7m GPU Boot Optimization

How do you reduce GPU cold start times for LLM inference in production?

GPU memory snapshots (CRIU) are the highest-leverage optimization, delivering 5×–24× cold start reductions across diffusers, llama.cpp, and SGLang. In-process model loading is mandatory when child CUDA contexts are invisible to snapshots; llama-server subprocess trees can work with --no-mmap. Verified on Modal L40S and B200 GPUs, Feb–May 2026.

✨TL;DR Summary

Highest-Leverage Optimization: GPU memory snapshots are the most impactful optimization, delivering 5× to 24× cold start reductions across diffusers, llama.cpp, and SGLang.
In-process Loading Mandatory: Capturing full GPU state with CRIU requires in-process model loading; child processes (e.g. subprocess.Popen) are invisible to the snapshot system.
SGLang & llama-server Hacks: SGLang requires --enable-memory-saver to offload KV cache to CPU during capture. Llama-server requires --no-mmap for checkpointing compatibility.
Warmup Requests: Running warmup queries before snapshotting forces CUDA JIT kernel compilation, keeping it out of the cold start path.

GPU memory snapshots are the single most impactful optimization, dominating all other cold start improvements combined. Across 4+ different inference engines and model architectures, snapshots deliver 5× to 24× cold start reductions.

How much can GPU memory snapshots improve cold start times?

Across four engines and models, snapshots deliver 5× to 24× cold start reductions, from 48s to ~7s on FLUX.2 and 110s to 2–7s on GLM-4.7-Flash GGUF.

Model	Engine	Before	After	Improvement
FLUX.2-klein-9B	diffusers	48s	~7s	6.9×
GLM-4.7-Flash GGUF	llama-cpp-python	110-168s	2-7s	24×
Gemma 4 26B GGUF	llama-server	60-120s	5-15s	5-10×
Qwen3.5-35B-A3B FP8	SGLang	2-5 min	12-20s	~10×

What dominates cold start time before snapshots?

CPU→GPU weight transfer dominates (35s of a 48.3s FLUX.2 baseline), and snapshots cut that phase to ~3s. The baseline for FLUX.2-klein-9B was 48.3s cold start. CPU→GPU transfer of ~18 GB bf16 weights dominated with 35s of overhead. Enabling GPU memory snapshots captured full CUDA state: containers restored in ~3s, dropping overhead from 35.0s to 3.1s (11× less).

Marginal improvements (pre-importing libraries, reducing wait_ms, baking weights into image) combined saved another few seconds, but the snapshots did 90%+ of the work.

Why do subprocess-based model servers break GPU snapshots?

CRIU cannot capture child-process CUDA contexts started via subprocess.Popen. In-process model loading is mandatory for snapshot capture. The GLM-4.7-Flash GGUF deployment initially used subprocess.Popen to run llama.cpp server. Cold starts took 110-168s because Modal's GPU snapshot system (CRIU) cannot capture child process GPU state.

Switching to in-process Llama(...) loading so Modal captures the entire Python process including its CUDA context cut cold starts to 13-33s. Further optimizations (runtime image, pre-built wheels, single warmup) brought it to 2-7s.

How do SGLang memory-saver hooks enable GPU snapshots?

Call /release_memory_occupation before capture and /resume_memory_occupation after restore, with --enable-memory-saver and TORCHINDUCTOR_COMPILE_THREADS=1. For SGLang deployments (Qwen3.5-35B), the snapshot path uses SGLang's built-in hooks: the server calls /release_memory_occupation to offload KV cache to CPU, Modal captures the full container state including GPU memory, and on restore the server calls /resume_memory_occupation.

This requires --enable-memory-saver and --enable-weights-cpu-backup flags plus TORCHINDUCTOR_COMPILE_THREADS=1 to prevent OOM during snapshot creation.

When can llama-server subprocess trees still use GPU snapshots?

When CRIU captures the full process tree, snapshots work if --no-mmap is set so memory-mapped weights do not block checkpointing. For Gemma 4 running llama-server (subprocess), snapshots work because CRIU captures the entire process tree. The critical requirement is --no-mmap: memory-mapped model files prevent CRIU from properly checkpointing the GPU memory state.

What are the key cold start optimization lessons?

GPU snapshots are the highest-leverage optimization; everything else is marginal. Warmup before snapshot, in-process loading, and scaledown_window are the supporting tactics.

GPU snapshots are the single highest-leverage optimization: everything else is marginal. The most impactful action across all deployments was enabling GPU memory snapshots.
In-process loading is mandatory for snapshot capture: child processes started via subprocess are invisible to CRIU.
Warmup before snapshot: CUDA kernel JIT compilation happens on first inference. Run warmup requests with varied sequence lengths before the snapshot is taken.
Snapshot rebuild is a one-time cost per deploy: first request after deploy takes 60-190s to rebuild. Subsequent requests use the cached snapshot.
scaledown_window is the first line of defence: keeping containers alive for 5 minutes after last request avoids most cold starts entirely. Costs ~$0.002/min for idle L40S.
Runtime images beat devel images: using nvidia/cuda:12.4.1-runtime-ubuntu22.04 instead of the devel variant saves ~1.5 GB and eliminates build toolchain dependencies.
--no-mmap is required for CRIU compatibility with llama.cpp-based deployments.
Baking model weights into the image is a tradeoff: beneficial under ~15-20 GB, counter-productive above ~25 GB.

What errors block GPU snapshot deployments?

SIGSEGV on restore, invisible child CUDA contexts, OOM during snapshot creation, and mmap-incompatible weight loading are the most common blockers. Each has a documented fix below.

Error	Root Cause	Resolution
SIGSEGV on restore	GPU snapshot CUDA handle incompatibility	Updated Modal API; no recurrence
Child process GPU state not captured	`subprocess.Popen` invisible to CRIU	Rewrote to in-process model loading
OOM during snapshot creation	TorchInductor thread pool competing for VRAM	Set `TORCHINDUCTOR_COMPILE_THREADS=1`
`libgomp.so.1` not found	Runtime image missing OpenMP	Added `apt_install("libgomp1")`
SSL certificate verification failure	Local testing without valid certificates	Created `_make_ssl_ctx()` for local dev

Source: §1 (FLUX.2-klein-9B), §3 (GLM-4.7-Flash GGUF), §5 (Gemma 4 26B GGUF), §7 (Qwen3.5-35B-A3B FP8).

Frequently Asked Questions

What is the most effective way to reduce GPU cold start times for LLM inference?: GPU memory snapshots (via CRIU) are the single highest-leverage optimization. Across vLLM, SGLang, diffusers, and llama.cpp deployments, snapshots deliver 5× to 24× cold start reductions. In-process model loading is required when child processes started via subprocess.Popen create CUDA contexts invisible to the parent snapshot.
How do GPU memory snapshots work for LLM cold start reduction?: GPU memory snapshots capture the full CUDA state of a running container using CRIU. On restore, the container resumes from snapshot instead of reloading weights from disk. For FLUX.2-klein-9B on Modal L40S GPUs, CPU-to-GPU transfer dropped from 35s to 3s, achieving 6.9× total cold start reduction (48s to 7s).
How do you cache FlashInfer JIT kernels to avoid long boot times on large models?: For Qwen3.5-397B on Modal 4× B200 GPUs, FlashInfer JIT compilation added significant cold-start overhead. Caching compiled kernels to a persistent volume and symlinking at boot reduced cold start from 26 minutes to 7 minutes, unlocking scale-to-zero and saving ~74% on GPU costs.
When can subprocess-based model servers still use GPU snapshots?: When CRIU captures the entire process tree (e.g. llama-server as subprocess), snapshots can work if --no-mmap is set so memory-mapped weights do not block checkpointing. Popen-launched children whose CUDA context is invisible to the parent process cannot be snapshotted; use in-process loading instead.

Related deep dives

Serving Engine Internals

3 Engines · 10 Models

Source-level comparisons of vLLM, SGLang, and llama.cpp at production scale: quantization tradeoffs, memory behavior, cold starts, and when each engine wins on AWS EKS and Modal.

Hardening a $50/hr GPU Cluster

754B on 8× B200

Building crash monitors, DeepGEMM/FlashInfer kernel caches, and automated container recycling to keep 754B MoE and 397B FP8 clusters alive on Modal 8× and 4× B200 at $50/hr and $25/hr.

Image Generation at Scale

48s → 7s · FLUX.2

Memory snapshots, async GPU transfers, and pipeline pre-loading for production FLUX.2 image generation on serverless L40S GPUs, with 6.9× cold start reduction verified in production.