Yuvraj Garg is an AI Systems and Infrastructure Architect based in Bengaluru, India with 5 years of experience leading ML engineering at Styldod. He specializes in production GPU serving pipelines, multi-agent orchestration with LangGraph and MCP, cold start optimization, and cost reduction for large-scale LLM and vision model deployments.

What is Yuvraj Garg's biggest technical achievement?

Yuvraj architected REimagineHome.AI, scaling it from 0 to 2.1M+ users with 30M+ designs generated, while cutting hosting costs by 65%+ through self-hosted LLMs and VLMs on AWS EKS. He also built a DocumentAI pipeline achieving 18.6x cost savings ($0.025/doc vs $0.466 on AWS Bedrock) processing 50K documents per day.

What is Yuvraj Garg's expertise in GPU infrastructure?

Yuvraj specializes in GPU cold start optimization (achieving 6.9x faster startup using memory snapshots on L40S and B200 GPUs), serving massive MoE models like GLM-5.1 (754B) on 8x B200 clusters, and eliminating JIT compilation overhead (DeepGEMM, FlashInfer) via persistent volume caches. He uses vLLM, SGLang, and llama.cpp in production.

What tech stack does Yuvraj Garg use?

Yuvraj's core stack includes vLLM, SGLang, llama.cpp for LLM serving; LangGraph, MCP for agentic orchestration; AWS EKS, KubeRay, Karpenter for GPU cluster management; PyTorch for model work; FastAPI for backend services; and Next.js/React for frontend. He also holds Red Hat certifications in OpenShift (EX280) and Ansible (EX407).

Is Yuvraj Garg available for hire?

Yes. Yuvraj Garg is open to senior and staff Machine Learning Engineer roles covering GPU serving, cold-start optimization, multi-agent frameworks (LangGraph/MCP), and system optimization. He is based in Bengaluru, India and is available for hybrid, remote, or relocation roles, as well as contract consulting. Contact: yuvraj97.ml@gmail.com

What is REimagineHome.AI?

REimagineHome.AI is an agentic virtual interior staging platform built and architected by Yuvraj Garg at Styldod. It uses a LangGraph + MCP multi-agent pipeline (planning, execution, and quality review agents) orchestrating 20+ tools. The platform scaled from 0 to 2.1M+ users generating 30M+ designs, with hosting costs cut by 65%+ through self-hosted LLMs and VLMs on AWS EKS.

How did Yuvraj Garg reduce LLM cold start time from 26 minutes to 7 minutes?

For the Qwen3.5-397B model on a 4x B200 Modal cluster, Yuvraj reduced cold start from 26 minutes to 7 minutes by caching FlashInfer JIT kernels via persistent volume symlinks. This eliminated repetitive JIT compilation on every container boot and unlocked scale-to-zero economics, saving 74% on GPU costs.

What certifications does Yuvraj Garg hold?

Yuvraj holds a MITx MicroMasters in Statistics and Data Science (Statistics, Probability, Machine Learning, Data Analysis), a DeepLearning.AI Deep Learning Specialization, and three Red Hat certifications: EX280 (OpenShift Specialist), EX407 (Ansible Specialist), and EX200 (RHCSA System Administrator).

How much do GPU memory snapshots improve FLUX.2 cold starts?

Snapshots reduced restore time from 35s to ~3s for weight transfer alone, yielding 6.9× total cold start improvement (48s to ~7s) for FLUX.2-klein-9B on Modal L40S after warmup-before-snapshot.

Should you bake FLUX weights into the container image?

Baking weights into the image helps under ~15–20 GB (faster snapshot restore, no network fetch) but is counter-productive above ~25 GB due to image size and build time. Use volume mounts or snapshots instead.

FLUX.2 Image Generation at Scale: 48s to 7s

How do you reduce FLUX.2 cold start time on serverless GPUs?

Enable GPU memory snapshots to capture full CUDA state after pipeline warmup. FLUX.2-klein-9B on Modal L40S dropped from 48s to ~7s (6.9×). Async GPU transfers and pre-imported libraries provide marginal gains; snapshots deliver 90%+ of the improvement.

✨TL;DR Summary

Serverless Diffusion Bottleneck: The baseline cold start for loading the 18GB FLUX.2-Kontext-Dev (9B) model was ~50 seconds, making real-time serverless deployment impossible.
Memory Snapshot Acceleration: Using Modal's GPU memory snapshots to capture the fully initialized VRAM and system memory state reduced cold starts by 4.2× (11.4s).
Async VRAM Transfers: Rewriting HuggingFace's synchronous blocking device transfers to use custom PyTorch non-blocking CUDA streams further reduced cold starts to 7.2s.
Batching Strategy: Abandoned automatic batching (which introduced a 500ms latency tax and VRAM OOMs) in favor of single-request concurrency for stable, progressive generation.

Deploying massive diffusion models like FLUX.2-Kontext-Dev (9B parameters) on serverless GPUs for real-time applications requires a different set of optimization strategies than LLMs. Rather than token-by-token streaming, we are dealing with high-bandwidth image tensors, large weight shifts, and multi-second diffusion step blocks.

How much can FLUX.2 cold starts be reduced in production?

GPU snapshots plus async VRAM transfers cut cold starts from 48.3s to 7.2s (6.7×) on Modal L40S. Snapshots alone deliver 4.2× of that improvement.

Stage	Architecture Detail	Latency (Cold)	Latency (Warm)	Throughput Speedup
Baseline	Unoptimized HuggingFace Pipeline	48.3s	11.6s	1.0× (Baseline)
Memory Snapshot	CRIU GPU state restoration	11.4s	6.2s	4.2× cold reduction
Full Pipeline	GPU Snapshot + Async VRAM transfers	7.2s	4.1s	6.7× cold reduction

What causes 48-second cold starts for FLUX.2 on serverless GPUs?

CPU-to-GPU transfer of ~18 GB bf16 weights dominates (~35s of a 48.3s baseline). Library import, pipeline init, and scheduler setup account for the remainder. On startup, loading the ~18 GB of weights for the FLUX.2 pipeline from disk into system memory and then transferring them to the GPU (L40S) takes nearly 50 seconds. Since users expect images under 5 seconds, scale-to-zero serverless was unusable.

GPU Memory Snapshots (CRIU)

By using Modal's GPU memory snapshots, we capture a physical image of the GPU VRAM and system memory after the model is loaded and fully initialized. When a new container is spawned, it bypasses the entire loading process, restoring directly from the memory snapshot in 3.1 seconds.

# Pre-loading hook in Modal class
@app.cls(gpu="L40S", enable_gpu_snapshot=True)
class FluxModel:
    @modal.enter()
    def enter(self):
        # This code runs ONCE when compiling the snapshot
        from diffusers import FluxPipeline
        import torch

        # Pre-initialize pipeline into system memory
        self.pipe = FluxPipeline.from_pretrained(
            "black-forest-labs/FLUX.1-schnell", 
            torch_dtype=torch.bfloat16
        )
        # Warmup compiler and CUDA memory paths
        _ = self.pipe("warmup prompt", num_inference_steps=4)

How do async CUDA transfers reduce FLUX.2 warm-path latency?

Replace synchronous HuggingFace device transfers with non-blocking PyTorch CUDA streams so the CPU can process while weights move to VRAM, saving ~4.5s on the warm path. To squeeze another 4.5 seconds out of the warm path, we rewrote the default HuggingFace device transfer code. Instead of synchronous blocking transfers during inference, we implemented a custom non-blocking queue transfer mechanism using PyTorch streams:

# Custom async CUDA transfer stream
cuda_stream = torch.cuda.Stream()
with torch.cuda.stream(cuda_stream):
    # Non-blocking transfer allows CPU to process tokens while weights move to VRAM
    self.transformer.to(device="cuda", non_blocking=True)
    self.text_encoder.to(device="cuda", non_blocking=True)

What batching strategy works best for serverless FLUX.2?

Single-request concurrency (@modal.concurrent) wins for real-time generation. Automatic batching adds a 500ms latency tax and VRAM OOMs at scale. To handle production load from REimagineHome.AI, we evaluated three batching strategies for multi-concurrency:

1. Automatic Batching (`@modal.batched`): REVERTED

Modal intercepts requests and waits up to 500ms to group them. This introduced a mandatory **500ms latency tax** on all requests. If 20 concurrent requests arrived, it spun up 8 containers (leaving 5 idle due to batch over-provisioning) and crashed on batch sizes larger than 12 due to VRAM OOM.

2. Single Request Concurrency (`@modal.concurrent`): ADOPTED

No wait-time tax. Each container handles a single image generation request. Autoscaling is predictable and stable, and clients receive a progressive update for each image immediately.

3. Explicit Batch API (`generate_batch`): HIGH THROUGHPUT

For background processing pipelines where clients can group requests, we exposed a batch generation endpoint. Running batch generation on the GPU is highly efficient, cutting processing time to **3.22 seconds per image** at batch=10 compared to 4.40 seconds for single images.

What compilation anti-patterns slow down FLUX.2 inference?

torch.compile on step-distilled models and multiresolution warmup before snapshotting add 45–50s of JIT time with no inference payoff. CUDA kernels compile dynamically at runtime anyway.

torch.compile on distilled models: FLUX.2-Schnell/Klein is step-distilled (4 inference steps). Running torch.compile() requires up to 45 seconds of JIT time during inference. Because there are only 4 steps, the compiled kernel execution path has no amortization window, making it slower and unpredictable.
Multiresolution Warmup: Warming up the pipeline with multiple resolutions (e.g., 512px, 768px, 1024px) before snapshotting added 50 seconds to compilation, but did not speed up inference because CUDA kernels compile dynamically at runtime for variable-resolution inputs anyway.

What are the key lessons for scaling FLUX.2 image generation?

Use shared volume mounts for model weights, keep text encoders on GPU, and prefer snapshots over baking weights into container images above ~25 GB.

Use shared storage for model weights: Mounting model cache folders on a shared modal.Volume ensures that when snapshots are updated, containers do not download multi-gigabyte models from HuggingFace Hub again.
Avoid text encoders on CPU: Moving the T5-XXL text encoder to the CPU to save VRAM causes a severe execution bottleneck. The L40S GPU has more than enough memory (48GB) to house both the text encoder and the transformer. Keep them both on-device.

Source: §1 (FLUX.2-klein-9B).

Frequently Asked Questions

What causes 48-second cold starts for FLUX.2 on serverless GPUs?: CPU-to-GPU transfer of ~18 GB bf16 weights dominates (~35s of a 48.3s baseline). Library import, pipeline initialization, and scheduler setup account for the remainder.
How much do GPU memory snapshots improve FLUX.2 cold starts?: Snapshots reduced restore time from 35s to ~3s for weight transfer alone, yielding 6.9× total cold start improvement (48s to ~7s) for FLUX.2-klein-9B on Modal L40S after warmup-before-snapshot.
Should you bake FLUX weights into the container image?: Baking weights into the image helps under ~15–20 GB (faster snapshot restore, no network fetch) but is counter-productive above ~25 GB due to image size and build time. Use volume mounts or snapshots instead.

Related deep dives

Cold Start Engineering

26m → 7s

How GPU memory snapshots, CRIU checkpointing, JIT kernel caching, and volume symlinks cut cold starts from 26 minutes to 7 minutes on Modal B200 clusters. Production-verified across vLLM, SGLang, and llama.cpp.

Hardening a $50/hr GPU Cluster

754B on 8× B200

Building crash monitors, DeepGEMM/FlashInfer kernel caches, and automated container recycling to keep 754B MoE and 397B FP8 clusters alive on Modal 8× and 4× B200 at $50/hr and $25/hr.

Serving Engine Internals

3 Engines · 10 Models

Source-level comparisons of vLLM, SGLang, and llama.cpp at production scale: quantization tradeoffs, memory behavior, cold starts, and when each engine wins on AWS EKS and Modal.