Yuvraj Garg is an AI Systems and Infrastructure Architect based in Bengaluru, India with 5 years of experience leading ML engineering at Styldod. He specializes in production GPU serving pipelines, multi-agent orchestration with LangGraph and MCP, cold start optimization, and cost reduction for large-scale LLM and vision model deployments.

What is Yuvraj Garg's biggest technical achievement?

Yuvraj architected REimagineHome.AI, scaling it from 0 to 2.1M+ users with 30M+ designs generated, while cutting hosting costs by 65%+ through self-hosted LLMs and VLMs on AWS EKS. He also built a DocumentAI pipeline achieving 18.6x cost savings ($0.025/doc vs $0.466 on AWS Bedrock) processing 50K documents per day.

What is Yuvraj Garg's expertise in GPU infrastructure?

Yuvraj specializes in GPU cold start optimization (achieving 6.9x faster startup using memory snapshots on L40S and B200 GPUs), serving massive MoE models like GLM-5.1 (754B) on 8x B200 clusters, and eliminating JIT compilation overhead (DeepGEMM, FlashInfer) via persistent volume caches. He uses vLLM, SGLang, and llama.cpp in production.

What tech stack does Yuvraj Garg use?

Yuvraj's core stack includes vLLM, SGLang, llama.cpp for LLM serving; LangGraph, MCP for agentic orchestration; AWS EKS, KubeRay, Karpenter for GPU cluster management; PyTorch for model work; FastAPI for backend services; and Next.js/React for frontend. He also holds Red Hat certifications in OpenShift (EX280) and Ansible (EX407).

Is Yuvraj Garg available for hire?

Yes. Yuvraj Garg is open to senior and staff Machine Learning Engineer roles covering GPU serving, cold-start optimization, multi-agent frameworks (LangGraph/MCP), and system optimization. He is based in Bengaluru, India and is available for hybrid, remote, or relocation roles, as well as contract consulting. Contact: yuvraj97.ml@gmail.com

What is REimagineHome.AI?

REimagineHome.AI is an agentic virtual interior staging platform built and architected by Yuvraj Garg at Styldod. It uses a LangGraph + MCP multi-agent pipeline (planning, execution, and quality review agents) orchestrating 20+ tools. The platform scaled from 0 to 2.1M+ users generating 30M+ designs, with hosting costs cut by 65%+ through self-hosted LLMs and VLMs on AWS EKS.

How did Yuvraj Garg reduce LLM cold start time from 26 minutes to 7 minutes?

For the Qwen3.5-397B model on a 4x B200 Modal cluster, Yuvraj reduced cold start from 26 minutes to 7 minutes by caching FlashInfer JIT kernels via persistent volume symlinks. This eliminated repetitive JIT compilation on every container boot and unlocked scale-to-zero economics, saving 74% on GPU costs.

What certifications does Yuvraj Garg hold?

Yuvraj holds a MITx MicroMasters in Statistics and Data Science (Statistics, Probability, Machine Learning, Data Analysis), a DeepLearning.AI Deep Learning Specialization, and three Red Hat certifications: EX280 (OpenShift Specialist), EX407 (Ansible Specialist), and EX200 (RHCSA System Administrator).

How do you cache DeepGEMM and FlashInfer JIT kernels on Modal?

Mount persistent modal.Volume caches for compiled kernel artifacts and point DeepGEMM/FlashInfer cache directories at those volumes. On 8× B200 clusters this reduced JIT compilation overhead from ~15 minutes to near-zero on warm boots.

How do you handle silent subprocess death in GPU inference containers?

Run an active Python watchdog that monitors child process health, pipes stdout/stderr, and calls os._exit(1) on the parent container when a child dies, forcing Modal to recycle the container instead of serving from a zombie state.

Hardening a $50/hr GPU Cluster: 754B MoE on 8x B200

How do you keep a $50/hr 8× B200 GPU cluster reliable for 754B MoE inference?

Map JIT compilers (DeepGEMM, FlashInfer) to persistent Modal volumes, use asymmetric boot patterns for TP=8 Triton file-lock races, and run a Python watchdog that terminates the parent container on child failure. These techniques cut idle costs from ~$18K/month to ~$4.5K/month on 4× B200 clusters with scale-to-zero.

✨TL;DR Summary

JIT Compilation Caching: Reduced cold start overhead from 15 minutes to 0 seconds on 8x B200 clusters by mapping dynamic kernel compilers (DeepGEMM, FlashInfer) to persistent modal.Volume caches.
Race Condition Resolution: Bypassed a critical TP=8 Triton compilation file-lock crash by implementing a "Crash-on-First-Boot, Succeed-on-Second" asymmetric startup pattern.
Zombie Container Mitigation: Prevented silent subprocess deaths by building an active Python watchdog daemon that monitors child process health, pipes stdout/stderr, and terminates the parent container immediately upon failure via os._exit(1).
Scale-to-Zero Economics: Optimizations enabled scale-to-zero autoscaling, reducing idle costs from $18,000/month to $4,500/month on 4x B200 clusters.

Deploying massive Mixture-of-Experts (MoE) models like GLM-5.1 (754B total, 67B active) on an 8× B200 cluster or Qwen3.5-397B (17B active) on 4× B200 GPUs represents the absolute limits of serverless MLOps. At $50/hr and $25/hr respectively, startup delays, silent container crashes, and compilation bottlenecks are extremely costly.

What hardware topology is needed for 754B MoE on B200 clusters?

GLM-5.1-Open FP8 needs 8× B200 at TP=8 (~$50/hr); Qwen3.5-397B FP8 fits on 4× B200 at TP=4 (~$25/hr). The table below is the production topology we hardened on Modal.

Model	Params	GPU Nodes	Tensor Parallel (TP)	VRAM Alloc	Cluster Cost
GLM-5.1-Open FP8	754B (67B active)	8× B200 (80GB)	8 (Node-wide)	~640 GB / 640 GB	~$50/hr
Qwen3.5-397B FP8	397B (17B active)	4× B200 (180GB)	4	~397 GB / 720 GB	~$25/hr

What causes 15-minute cold starts on B200 GPU clusters?

MoE architectures compile custom kernels (DeepGEMM, FlashInfer) JIT on first boot. At $50/hr, each 15-minute warmup wastes $12.50 in idle compute. Mixture-of-Experts architectures use custom kernel engines like DeepGEMM (for GLM-5.1) or FlashInfer (for Qwen3.5) to run FP8 matrix multiplications dynamically. On first boot, these kernels compile Just-in-Time (JIT) to optimize execution shapes.

This JIT phase took 12 to 15 minutes. At $50/hr, every container warm-up cost $12.50 in idle compute.

The Fix: Caching JIT Kernels on Persistent Volumes

Instead of compiling kernels on every cold start, we redirected all dynamic kernel compilers to dump their output onto a persistent modal.Volume. This required mapping the environment variables of Triton, PyTorch Inductor, DeepGEMM, and FlashInfer to a mounted cache folder:

# Redirect JIT caches to persistent Modal volume
ENV TRITON_CACHE_DIR="/model-cache/.triton"
ENV TORCHINDUCTOR_CACHE_DIR="/model-cache/.inductor"
ENV FLASHINFER_WORKSPACE_DIR="/model-cache/.flashinfer"
ENV DEEPGEMM_CACHE_DIR="/model-cache/.deepgemm"

By mounting this volume with volume.reload() on container startup, subsequent container boots load the precompiled `.so` binary caches instantly. Cold start JIT overhead dropped from 15 minutes to 0 seconds.

How do you fix TP=8 DeepGEMM startup race conditions?

Use a crash-on-first-boot, succeed-on-second pattern: seed caches on TP=1, then let TP=8 containers read pre-written JIT artifacts from a persistent volume. During the compilation phase on GLM-5.1, launching SGLang or vLLM in a multi-GPU Tensor Parallel setting (TP=8) causes a race condition. Since all 8 GPUs attempt to write to the Triton/DeepGEMM JIT cache folder at the same millisecond, file locks collide, throwing a fatal cudaErrorIllegalAddress or NCCL timeout error.

The Solution: "Crash-on-First-Boot, Succeed-on-Second" Pattern

Since compiling TP=8 concurrently was unstable, we implemented a robust two-phase fallback:

Phase 1 (Offline Compilation): Run a single-GPU compilation script (TP=1) on a cheaper L4/L40S GPU to seed the matrix shape cache.
Phase 2 (Asymmetric Setup): On the main TP=8 cluster, the volume is loaded. If a file-lock crash occurs, the container terminates immediately. Modal automatically respawns the container, which reads the now-written cache files from the volume without trying to compile them, succeeding in 2-7 seconds.

How do you prevent silent subprocess death in GPU inference containers?

Run an active Python watchdog that streams child logs and calls os._exit(1) on the parent when the serving subprocess dies, forcing Modal to recycle zombie containers. Modal orchestrates containers via a Python entry point. Because SGLang or vLLM must be started as a child process (to manage C++ libraries and multi-GPU IPC), standard Python crash handlers cannot catch their errors.

If SGLang crashed mid-request due to a CUDA Out-of-Memory (OOM) or NCCL sync failure, the child process died, but the Modal parent container remained running. Client requests continued routing to the container, receiving endless 502 Connection Refused errors.

The Fix: Active Heartbeat & Log Streaming Daemon

We built an active watchdog wrapper in Python that monitors child process health, pipes stdout/stderr to the main Modal logger, and terminates the container on any failure:

import subprocess
import threading
import sys
import os
import time

def stream_logs(pipe):
    for line in iter(pipe.readline, b''):
        sys.stdout.write(line.decode())
        sys.stdout.flush()

# Start the serving engine process
cmd = ["python", "-m", "sglang.launch_server", "--model-path", "/model-cache/GLM-5.1-FP8", "--tp", "8"]
process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)

# Thread to stream logs in real-time to Modal dashboard
threading.Thread(target=stream_logs, args=(process.stdout,), daemon=True).start()

# Monitor thread
def monitor_engine():
    while True:
        if process.poll() is not None:
            print(f"CRITICAL: Serving process exited with code {process.returncode}")
            # Force exit the parent container immediately so Modal recycles the node
            os._exit(1) 
        time.sleep(10)

threading.Thread(target=monitor_engine, daemon=True).start()

Why do weight-loading watchdogs kill large MoE deployments?

Default 60-second SGLang watchdogs terminate workers during 5–7 minute weight loads. Set --watchdog-timeout 1200 for 754B FP8 models. Loading 754B parameters (even in FP8, it is ~400 GB of weights) from a persistent volume takes between 5 and 7 minutes. By default, SGLang's internal watchdog kills the worker processes if they do not communicate with the master node within 60 seconds.

We configured --watchdog-timeout 1200 (20 minutes) to prevent the coordinator from prematurely killing workers during the long weights-loading phase.

What scale-to-zero savings are achievable on B200 clusters?

With kernel caching and reliable recycling, scale-to-zero on 4× B200 cut idle costs from ~$18,000/month to ~$4,500/month while preserving sub-minute recovery. Running 4× B200 GPUs continuously costs $18,000/month. By implementing GPU snapshots and caching, we reduced the warm-up latency to the point where we could use Modal's auto-scaling with min_containers=0.

Metric	Continuous (24/7)	Scale-to-Zero (6 hrs active/day)	Savings %
Hourly rate	$25.00	$25.00	-
Monthly Hours	720 hours	180 hours	75%
Monthly Cost	$18,000	$4,500	$13,500 Saved

What are the key lessons for hardening production GPU clusters?

Stream subprocess logs, use os._exit(1) on child failure, and isolate JIT caches on persistent volumes. Never redirect serving engine output to /dev/null.

Subprocess logs are vital: Never redirect child process outputs to /dev/null. Pipe them to a stream logger to identify GPU errors immediately.
Watch your exit codes: Use os._exit(1) to instantly kill parent processes on subprocess failures. Standard sys.exit() executes cleanup handlers that can hang indefinitely if GPUs are in an error state.
Isolate caching: Always direct compiler cache folders to a persistent directory outside the ephemeral container root to guarantee warm-starts.

What errors appear most often on hardened B200 clusters?

TP=8 JIT file-lock races, NCCL timeouts during init, and watchdog kills during long weight loads are the top production failure modes. Each has a documented mitigation below.

Upstream Bug / Crash	Severity	Trigger Condition	Mitigation
`cudaErrorIllegalAddress`	Critical	Concurrent TP=8 Triton JIT compile writes	Single-GPU pre-compilation seed run
NCCL connection timeout	Critical	GPU node initialization delays	Set `NCCL_DEBUG=INFO` and check routing
Triton JIT compilation failure	High	Lock file corruption on concurrent writes	Added read-only mounts after compilation
`watchdog process terminated`	High	Model weight loading exceeding 60 seconds	Added `--watchdog-timeout 1200`

Source: §6 (GLM-5.1-FP8), §9 (Qwen3.5-397B-A17B).

Frequently Asked Questions

How do you cache DeepGEMM and FlashInfer JIT kernels on Modal?: Mount persistent modal.Volume caches for compiled kernel artifacts and point DeepGEMM/FlashInfer cache directories at those volumes. On 8× B200 clusters this reduced JIT compilation overhead from ~15 minutes to near-zero on warm boots.
How do you handle silent subprocess death in GPU inference containers?: Run an active Python watchdog that monitors child process health, pipes stdout/stderr, and calls os._exit(1) on the parent container when a child dies, forcing Modal to recycle the container instead of serving from a zombie state.
What scale-to-zero savings are achievable on B200 clusters?: With kernel caching and reliable container recycling, scale-to-zero autoscaling on 4× B200 clusters reduced idle costs from approximately $18,000/month to $4,500/month while preserving sub-minute recovery for production traffic.

Related deep dives

Cold Start Engineering

26m → 7s

How GPU memory snapshots, CRIU checkpointing, JIT kernel caching, and volume symlinks cut cold starts from 26 minutes to 7 minutes on Modal B200 clusters. Production-verified across vLLM, SGLang, and llama.cpp.

Serving Engine Internals

3 Engines · 10 Models

Source-level comparisons of vLLM, SGLang, and llama.cpp at production scale: quantization tradeoffs, memory behavior, cold starts, and when each engine wins on AWS EKS and Modal.

Systems Debugging

36 bugs · 7.5× regression

Flash attention investigation (7.5× regression), 36 bugs in GLM-5.1 code review, concurrency cliff discovery, and GPU snapshot SIGSEGV debugging on production Modal clusters.