Yuvraj Garg is an AI Systems and Infrastructure Architect based in Bengaluru, India with 5 years of experience leading ML engineering at Styldod. He specializes in production GPU serving pipelines, multi-agent orchestration with LangGraph and MCP, cold start optimization, and cost reduction for large-scale LLM and vision model deployments.

What is Yuvraj Garg's biggest technical achievement?

Yuvraj architected REimagineHome.AI, scaling it from 0 to 2.1M+ users with 30M+ designs generated, while cutting hosting costs by 65%+ through self-hosted LLMs and VLMs on AWS EKS. He also built a DocumentAI pipeline achieving 18.6x cost savings ($0.025/doc vs $0.466 on AWS Bedrock) processing 50K documents per day.

What is Yuvraj Garg's expertise in GPU infrastructure?

Yuvraj specializes in GPU cold start optimization (achieving 6.9x faster startup using memory snapshots on L40S and B200 GPUs), serving massive MoE models like GLM-5.1 (754B) on 8x B200 clusters, and eliminating JIT compilation overhead (DeepGEMM, FlashInfer) via persistent volume caches. He uses vLLM, SGLang, and llama.cpp in production.

What tech stack does Yuvraj Garg use?

Yuvraj's core stack includes vLLM, SGLang, llama.cpp for LLM serving; LangGraph, MCP for agentic orchestration; AWS EKS, KubeRay, Karpenter for GPU cluster management; PyTorch for model work; FastAPI for backend services; and Next.js/React for frontend. He also holds Red Hat certifications in OpenShift (EX280) and Ansible (EX407).

Is Yuvraj Garg available for hire?

Yes. Yuvraj Garg is open to senior and staff Machine Learning Engineer roles covering GPU serving, cold-start optimization, multi-agent frameworks (LangGraph/MCP), and system optimization. He is based in Bengaluru, India and is available for hybrid, remote, or relocation roles, as well as contract consulting. Contact: yuvraj97.ml@gmail.com

What is REimagineHome.AI?

REimagineHome.AI is an agentic virtual interior staging platform built and architected by Yuvraj Garg at Styldod. It uses a LangGraph + MCP multi-agent pipeline (planning, execution, and quality review agents) orchestrating 20+ tools. The platform scaled from 0 to 2.1M+ users generating 30M+ designs, with hosting costs cut by 65%+ through self-hosted LLMs and VLMs on AWS EKS.

How did Yuvraj Garg reduce LLM cold start time from 26 minutes to 7 minutes?

For the Qwen3.5-397B model on a 4x B200 Modal cluster, Yuvraj reduced cold start from 26 minutes to 7 minutes by caching FlashInfer JIT kernels via persistent volume symlinks. This eliminated repetitive JIT compilation on every container boot and unlocked scale-to-zero economics, saving 74% on GPU costs.

What certifications does Yuvraj Garg hold?

Yuvraj holds a MITx MicroMasters in Statistics and Data Science (Statistics, Probability, Machine Learning, Data Analysis), a DeepLearning.AI Deep Learning Specialization, and three Red Hat certifications: EX280 (OpenShift Specialist), EX407 (Ansible Specialist), and EX200 (RHCSA System Administrator).

What causes SIGSEGV during GPU snapshot restore?

Common causes: memory-mapped model files (--mmap) incompatible with CRIU checkpointing, CUDA contexts in child processes invisible to parent snapshots, and race conditions during concurrent JIT compilation. Use --no-mmap, in-process loading, and TORCHINDUCTOR_COMPILE_THREADS=1.

Systems Debugging: GPU Snapshots, Flash Attention, MoE Code Review

How do you debug GPU snapshot failures and attention kernel regressions?

Isolate regressions with bisected builds and concurrency sweeps. A Flash Attention backport caused 7.5× slowdown without prerequisite kernel patches. GPU snapshot SIGSEGVs trace to mmap and process-tree visibility. Systematic MoE code review found 36 bugs before production deploy.

✨TL;DR Summary

Flash Attention Regression: Investigated a massive latency scaling issue (6.7s to 600s+) in GLM-4.7-Flash. Discovered that llama.cpp was silently falling back to O(n²) attention because the model's GQA ratio (20) was not divisible by 16.
GLM-5.1 Code Review: Identified and fixed 36 critical infrastructure bugs in a production SGLang deployment, including missing watchdog timeouts, uncaptured subprocess stdout, and race conditions.
Concurrency Cliff Discovery: Profiled llama.cpp throughput and discovered a catastrophic performance collapse (0 tokens/s) at 4+ concurrent requests due to head-of-line blocking in its single-threaded architecture.

Debugging production ML systems requires going deep: into CUDA kernels, framework internals, subprocess management, and upstream source code. These investigations span multiple stacks and uncovered both critical bugs and fundamental architecture issues.

Why did a Flash Attention backport cause a 7.5× regression in llama.cpp?

PR #18953 relaxes GQA constraints but depends on prerequisite kernel changes. Backporting without them caused ABI incompatibilities and silent O(n²) fallback when GQA ratio 20 is not divisible by 16. GLM-4.7-Flash GGUF on L40S showed suspicious behavior: VRAM usage constant at 24.54 GiB regardless of context length, and latency scaling went from 6.7s at 1K tokens to 600s+ at 16K (timeout). flash_attn=True had no measurable effect.

Root cause: llama-cpp-python v0.3.16 ships vendored llama.cpp from August 2025. The CUDA flash attention kernel only supports GQA ratios divisible by 16. GLM-4.7-Flash's GQA ratio is 20. When unsupported, llama.cpp silently falls back to O(n²) standard attention with no warning.

The fix exists upstream: PR #18953 relaxes the constraint to GQA ratio divisible by 4. However, backporting caused ABI incompatibilities and a 7.5× regression when missing prerequisite kernel changes.

How do you find bugs before deploying 754B MoE models to production?

Run systematic code review against upstream serving engine sources with staged deploys and crash monitors. GLM-5.1 FP8 review surfaced 36 issues before live traffic. Two-pass review of a production SGLang deployment on Modal.

Critical Bugs (First Pass)

Bug	Severity	Fix
`serve()` / `startup()` race condition	Critical	Start subprocess inside `serve()`
No subprocess stdout/stderr capture	Critical	`stdout=PIPE, stderr=STDOUT` with daemon thread
`region=REGION` silently ignored	Medium	Not valid on `@app.cls`
Missing `--watchdog-timeout`	Critical	Set to 1200 (20 min) for long loads
No subprocess crash detection	Critical	Background monitor with `os._exit(1)`

Second-Pass Bugs

Bug	Severity	Fix
`bufsize=1` in binary mode (1-byte buffer)	Critical	`text=True, bufsize=1`
Monitor thread races with startup	Critical	Add `_startup_complete` gate
Missing `dg_volume.reload()` in compile	High	Add reload before compilation
10-15 min silent compilation	High	Stream output instead of `capture_output=True`
No per-request token cap	Medium	Add `--max-total-tokens 65536`

Why does llama.cpp throughput collapse at 4+ concurrent requests?

Inference is strictly single-threaded. At N≥4, queue depth causes head-of-line blocking that cascades into timeouts and 0 aggregate tok/s. llama-cpp-python throughput collapses at N≥4 concurrent requests:

Concurrent N	Wall Time	Agg tok/s
1	2.34s	54.8
2	2.27s	56.4
4	24.82s	0.0
8	54.80s	0.0

Root cause: Inference is strictly single-threaded. At N≥4, queue depth causes head-of-line blocking that cascades into timeouts.

What are the key lessons for debugging production ML systems?

Silent optimization fallbacks, missing subprocess log streaming, and untested concurrency cliffs are the highest-risk failure modes. Always verify kernels are active and test at realistic load.

Silent failures are the worst kind: flash attention silently fell back to O(n²). Always verify optimizations are active.
Read framework source code: the GQA ratio constraint was not documented. Found by reading CUDA kernel dispatch logic.
Source patches are valid but risky: ABI incompatibilities can make backports worse than the original problem.
Subprocess log streaming is non-negotiable: zero visibility into 5-7 minute weight loading is unacceptable.
Use text=True with bufsize=1: bufsize=1 in binary mode becomes a catastrophically slow 1-byte buffer.
Concurrency testing must cover realistic loads: the cliff at N≥4 only appeared under realistic load testing.

Source: §3 (Flash Attention Investigation, Concurrency Cliff), §4 (GLM-5.1 FP8 Code Review). Upstream issues: llama.cpp PR #18953.

Frequently Asked Questions

Why did a Flash Attention backport cause a 7.5× regression in llama.cpp?: PR #18953 relaxes GQA constraints but depends on prerequisite kernel changes. Backporting without those changes caused ABI incompatibilities and a 7.5× slowdown. Fix requires full upstream merge or waiting for dependent patches.
What causes SIGSEGV during GPU snapshot restore?: Common causes: memory-mapped model files (--mmap) incompatible with CRIU checkpointing, CUDA contexts in child processes invisible to parent snapshots, and race conditions during concurrent JIT compilation. Use --no-mmap, in-process loading, and TORCHINDUCTOR_COMPILE_THREADS=1.
How do you find bugs before deploying 754B MoE models to production?: Run systematic code review against upstream serving engine sources, concurrency cliff testing at multiple request rates, and staged deploys with crash monitors. GLM-5.1 FP8 review surfaced 36 issues before live traffic.

Related deep dives

Hardening a $50/hr GPU Cluster

754B on 8× B200

Building crash monitors, DeepGEMM/FlashInfer kernel caches, and automated container recycling to keep 754B MoE and 397B FP8 clusters alive on Modal 8× and 4× B200 at $50/hr and $25/hr.

Cold Start Engineering

26m → 7s

How GPU memory snapshots, CRIU checkpointing, JIT kernel caching, and volume symlinks cut cold starts from 26 minutes to 7 minutes on Modal B200 clusters. Production-verified across vLLM, SGLang, and llama.cpp.

Serving Engine Internals

3 Engines · 10 Models

Source-level comparisons of vLLM, SGLang, and llama.cpp at production scale: quantization tradeoffs, memory behavior, cold starts, and when each engine wins on AWS EKS and Modal.