Yuvraj Garg is an AI Systems and Infrastructure Architect based in Bengaluru, India with 5 years of experience leading ML engineering at Styldod. He specializes in production GPU serving pipelines, multi-agent orchestration with LangGraph and MCP, cold start optimization, and cost reduction for large-scale LLM and vision model deployments.

What is Yuvraj Garg's biggest technical achievement?

Yuvraj architected REimagineHome.AI, scaling it from 0 to 2.1M+ users with 30M+ designs generated, while cutting hosting costs by 65%+ through self-hosted LLMs and VLMs on AWS EKS. He also built a DocumentAI pipeline achieving 18.6x cost savings ($0.025/doc vs $0.466 on AWS Bedrock) processing 50K documents per day.

What is Yuvraj Garg's expertise in GPU infrastructure?

Yuvraj specializes in GPU cold start optimization (achieving 6.9x faster startup using memory snapshots on L40S and B200 GPUs), serving massive MoE models like GLM-5.1 (754B) on 8x B200 clusters, and eliminating JIT compilation overhead (DeepGEMM, FlashInfer) via persistent volume caches. He uses vLLM, SGLang, and llama.cpp in production.

What tech stack does Yuvraj Garg use?

Yuvraj's core stack includes vLLM, SGLang, llama.cpp for LLM serving; LangGraph, MCP for agentic orchestration; AWS EKS, KubeRay, Karpenter for GPU cluster management; PyTorch for model work; FastAPI for backend services; and Next.js/React for frontend. He also holds Red Hat certifications in OpenShift (EX280) and Ansible (EX407).

Is Yuvraj Garg available for hire?

Yes. Yuvraj Garg is open to senior and staff Machine Learning Engineer roles covering GPU serving, cold-start optimization, multi-agent frameworks (LangGraph/MCP), and system optimization. He is based in Bengaluru, India and is available for hybrid, remote, or relocation roles, as well as contract consulting. Contact: yuvraj97.ml@gmail.com

What is REimagineHome.AI?

REimagineHome.AI is an agentic virtual interior staging platform built and architected by Yuvraj Garg at Styldod. It uses a LangGraph + MCP multi-agent pipeline (planning, execution, and quality review agents) orchestrating 20+ tools. The platform scaled from 0 to 2.1M+ users generating 30M+ designs, with hosting costs cut by 65%+ through self-hosted LLMs and VLMs on AWS EKS.

How did Yuvraj Garg reduce LLM cold start time from 26 minutes to 7 minutes?

For the Qwen3.5-397B model on a 4x B200 Modal cluster, Yuvraj reduced cold start from 26 minutes to 7 minutes by caching FlashInfer JIT kernels via persistent volume symlinks. This eliminated repetitive JIT compilation on every container boot and unlocked scale-to-zero economics, saving 74% on GPU costs.

What certifications does Yuvraj Garg hold?

Yuvraj holds a MITx MicroMasters in Statistics and Data Science (Statistics, Probability, Machine Learning, Data Analysis), a DeepLearning.AI Deep Learning Specialization, and three Red Hat certifications: EX280 (OpenShift Specialist), EX407 (Ansible Specialist), and EX200 (RHCSA System Administrator).

What is the difference between vLLM and SGLang for production serving?

vLLM provides mature continuous batching and KubeRay integration for dense and MoE models on EKS. SGLang offers built-in memory-saver hooks (/release_memory_occupation) critical for GPU snapshot workflows on Modal. Choose vLLM for DocumentAI-scale EKS pipelines; SGLang when snapshot-based scale-to-zero is required.

How do serving engines compare on cold start time?

Without snapshots: all engines suffer multi-minute boots on large models due to weight load and JIT. With GPU memory snapshots: 5×–24× improvements across engines; SGLang needs --enable-memory-saver, llama.cpp needs in-process or --no-mmap subprocess trees.

vLLM vs SGLang vs llama.cpp: Serving Engine Comparison at Scale

When should you use vLLM vs SGLang vs llama.cpp in production?

vLLM excels for high-throughput FP8/BF16 serving on multi-GPU EKS with KubeRay. SGLang wins for MoE models with memory-saver snapshot hooks. llama.cpp is the only practical choice for GGUF on heterogeneous or edge hardware. Each engine has distinct cold-start and memory tradeoffs documented across 10+ production models.

✨TL;DR Summary

Engine Comparison: Evaluated vLLM, SGLang, and llama.cpp across 10 production model deployments, analyzing quantization tradeoffs, memory leaks, and parallelization capabilities.
vLLM Snapshot Bug: Discovered that vLLM's default CUDA Graph capture caused infinite hangs upon container snapshot restoration. Fixed by forcing eager execution (enforce_eager=True).
SGLang Memory Leak: Identified and mitigated a slow VRAM leak in SGLang's FlashInfer allocation during tensor parallel sharding by capping prefill sizes and forcing PyTorch caching allocator garbage collection.
llama.cpp Limitations: Built a custom source-compiled Docker pipeline for llama.cpp to support new GGUF architectures, but determined it is unsuitable for high-concurrency production due to its lack of dynamic continuous batching.

Running LLMs in production requires choosing the right runtime engine. Across 10 model deployments, we evaluated the three dominant engines: vLLM, SGLang, and llama.cpp. We discovered deep architectural differences, memory leaks, and engine-specific quirks that impact scalability.

What is the difference between vLLM, SGLang, and llama.cpp?

vLLM excels at multi-GPU continuous batching, SGLang wins on structured JSON and RadixAttention, and llama.cpp is the only practical choice for GGUF on heterogeneous hardware.

Feature / Metric	vLLM (v0.7.2)	SGLang (v0.4.3)	llama.cpp (b8765)
Primary Backend	PyTorch / Triton / vLLM Kernels	PyTorch / FlashInfer / Triton	Pure C / C++ (GGML)
Quantization Support	FP8, AWQ, GPTQ, INT4	FP8, AWQ, GPTQ	GGUF (Q4_K_M, IQ4_NL, etc.)
Attention Optimization	PagedAttention, FlashAttention	RadixAttention, FlashInfer	Ring Buffer, FlashAttention
KV Cache Reuse	Prefix Caching (static)	Radix Cache (dynamic LRU)	Smart KV Cache shift
Multi-GPU Parallelism	TP + PP (highly stable)	TP (stable), PP (experimental)	Row/Column splitting (single node)

When should you use vLLM in production?

Choose vLLM for high-throughput FP8/BF16 serving on multi-GPU EKS with KubeRay, but disable CUDA graph capture (enforce_eager=True) when using GPU snapshots. vLLM is the most mature engine for high-throughput batching. However, it exhibits critical gotchas when combined with serverless GPU checkpointing (CRIU).

The Snapshot Hang Bug

During deployment of GLM-4.7-Flash-FP8, the container would hang indefinitely upon snapshot restore. The root cause was vLLM's default **CUDA Graph capture**. At startup, vLLM captures CUDA graphs for shapes up to max_num_seqs=16.

When Modal captures a GPU snapshot, the captured CUDA graphs refer to memory addresses that become invalid upon container restoration. The fix was forcing eager execution or running eager warmups:

# vLLM initialization with eager mode execution guards
from vllm import LLM, SamplingParams

llm = LLM(
    model="THUDM/glm-4-9b-chat",
    quantization="fp8",
    enforce_eager=True, # Disable graph capture to allow safe CRIU restores
    max_model_len=8192
)

When should you use SGLang instead of vLLM?

SGLang wins when snapshot-based scale-to-zero and structured JSON routing matter. Watch for FlashInfer VRAM creep under concurrent streaming on large MoE models. SGLang excels at complex, multi-turn prompts and structured JSON outputs due to **RadixAttention** (which dynamically caches KV caches in an LRU radix tree).

The Flash-Attention Memory Leak

During stress testing of the MiniMax-M2.7 (456B MoE) model on a 4× B200 cluster, we detected a slow but steady VRAM creep under concurrent streaming requests. Running a deep memory profiler revealed that SGLang's flash-attention buffer allocation was not freeing intermediate PyTorch tensor handles during model sharding across tensor parallel ranks.

We mitigated this by forcing garbage collection and flushing the PyTorch caching allocator after initialization, and utilizing SGLang's --chunked-prefill-size 4096 to cap the maximum context allocation block.

When is llama.cpp the right production serving engine?

Use llama.cpp for GGUF on L4, MacBook, or edge hardware, not for high-concurrency production where it lacks dynamic continuous batching. When serving GGUF models (like Gemma-4-26B-A4B-it), llama.cpp is the only engine capable of running across heterogeneous hardware (like Macbooks or cheap L4 GPUs).

Source-Compilation Pipeline

To support the latest Gemma-4 architectures before they landed in stable package releases, we built a custom multi-stage Docker build that compiles llama.cpp directly from source:

# Multi-stage CUDA builder
FROM nvidia/cuda:12.4.1-devel-ubuntu22.04 AS builder
RUN apt-get update && apt-get install -y git cmake build-essential
RUN git clone --recursive https://github.com/ggerganov/llama.cpp.git && \
    cd llama.cpp && mkdir build && cd build && \
    cmake -DGGML_CUDA=ON .. && make -j$(nproc) llama-server

# Runtime image copy
FROM nvidia/cuda:12.4.1-runtime-ubuntu22.04
COPY --from=builder /llama.cpp/build/bin/llama-server /usr/local/bin/llama-server

We then wrap the llama-server binary in a Python subprocess, performing HTTP-level health checks before exposing it to the gateway.

How do quantization formats compare for production inference?

BF16 is the quality baseline; FP8 halves disk size with <0.05% perplexity loss on SGLang; GGUF Q4_K_M trades ~0.85% quality for 21 GB footprints on llama.cpp.

Quantization	Size on Disk	Inference Speed	Perplexity Loss	Best Engine
Unquantized (BF16)	~74 GB	18 tok/s (H100)	0.00% (Baseline)	vLLM
FP8 (Dynamic)	~37 GB	35 tok/s (L40S)	< 0.05%	SGLang
GGUF (Q4_K_M)	~21 GB	24 tok/s (L4)	~0.85%	llama.cpp

What is the cheapest way to serve a 70B+ LLM on AWS?

Self-hosted EKS + KubeRay + vLLM on Spot GPUs with Karpenter/KEDA. DocumentAI achieved $0.025/document vs $0.466 on Bedrock (18.6× reduction) at 50K documents/day using Qwen3.6-35B FP8.

What are the key lessons for choosing a serving engine?

Match the engine to workload shape: SGLang for structured JSON and snapshot hooks, vLLM for multi-node throughput, llama.cpp only for GGUF on edge or heterogeneous hardware.

SGLang wins for structured routing: SGLang compiles the output JSON schema directly into a regex guide, running up to 5× faster than vLLM's Out-of-Line JSON parsers.
vLLM is superior for multi-node setups: When scaling beyond a single node, vLLM's Ray integration makes Tensor Parallelism across machines seamless.
Never use llama.cpp for high concurrency: llama.cpp lacks dynamic continuous batching; it queues concurrent requests synchronously, resulting in high Time-to-First-Token (TTFT) when loaded.

Source: §2 (GLM-4.7-Flash-FP8), §5 (Gemma-4-26B-it), §6 (MiniMax-M2.7).

Frequently Asked Questions

What is the difference between vLLM and SGLang for production serving?: vLLM provides mature continuous batching and KubeRay integration for dense and MoE models on EKS. SGLang offers built-in memory-saver hooks (/release_memory_occupation) critical for GPU snapshot workflows on Modal. Choose vLLM for DocumentAI-scale EKS pipelines; SGLang when snapshot-based scale-to-zero is required.
When is llama.cpp the right production serving engine?: llama.cpp is optimal for GGUF-quantized models on heterogeneous hardware (L4, MacBook, edge) where vLLM/SGLang overhead is unnecessary. It requires custom Docker builds for bleeding-edge architectures (e.g. Gemma-4) and --no-mmap for CRIU snapshot compatibility.
How do serving engines compare on cold start time?: Without snapshots: all engines suffer multi-minute boots on large models due to weight load and JIT. With GPU memory snapshots: 5×–24× improvements across engines; SGLang needs --enable-memory-saver, llama.cpp needs in-process or --no-mmap subprocess trees.
What is the cheapest way to serve a 70B+ LLM on AWS?: Self-hosted EKS + KubeRay + vLLM on Spot GPUs with Karpenter/KEDA autoscaling. DocumentAI achieved $0.025/document vs $0.466 on Bedrock (18.6× reduction) at 50K documents/day using Qwen3.6-35B FP8.

Related deep dives

Cold Start Engineering

26m → 7s

How GPU memory snapshots, CRIU checkpointing, JIT kernel caching, and volume symlinks cut cold starts from 26 minutes to 7 minutes on Modal B200 clusters. Production-verified across vLLM, SGLang, and llama.cpp.

Hardening a $50/hr GPU Cluster

754B on 8× B200

Building crash monitors, DeepGEMM/FlashInfer kernel caches, and automated container recycling to keep 754B MoE and 397B FP8 clusters alive on Modal 8× and 4× B200 at $50/hr and $25/hr.

Building an Evaluation Harness

210 Scenarios

Designing a deterministic test suite with regex, code execution, and LLM-as-judge to verify GGUF quantization quality across 210 scenarios in reasoning, coding, and multimodal tasks.