Yuvraj Garg is an AI Systems and Infrastructure Architect based in Bengaluru, India with 5 years of experience leading ML engineering at Styldod. He specializes in production GPU serving pipelines, multi-agent orchestration with LangGraph and MCP, cold start optimization, and cost reduction for large-scale LLM and vision model deployments.

What is Yuvraj Garg's biggest technical achievement?

Yuvraj architected REimagineHome.AI, scaling it from 0 to 2.1M+ users with 30M+ designs generated, while cutting hosting costs by 65%+ through self-hosted LLMs and VLMs on AWS EKS. He also built a DocumentAI pipeline achieving 18.6x cost savings ($0.025/doc vs $0.466 on AWS Bedrock) processing 50K documents per day.

What is Yuvraj Garg's expertise in GPU infrastructure?

Yuvraj specializes in GPU cold start optimization (achieving 6.9x faster startup using memory snapshots on L40S and B200 GPUs), serving massive MoE models like GLM-5.1 (754B) on 8x B200 clusters, and eliminating JIT compilation overhead (DeepGEMM, FlashInfer) via persistent volume caches. He uses vLLM, SGLang, and llama.cpp in production.

What tech stack does Yuvraj Garg use?

Yuvraj's core stack includes vLLM, SGLang, llama.cpp for LLM serving; LangGraph, MCP for agentic orchestration; AWS EKS, KubeRay, Karpenter for GPU cluster management; PyTorch for model work; FastAPI for backend services; and Next.js/React for frontend. He also holds Red Hat certifications in OpenShift (EX280) and Ansible (EX407).

Is Yuvraj Garg available for hire?

Yes. Yuvraj Garg is open to senior and staff Machine Learning Engineer roles covering GPU serving, cold-start optimization, multi-agent frameworks (LangGraph/MCP), and system optimization. He is based in Bengaluru, India and is available for hybrid, remote, or relocation roles, as well as contract consulting. Contact: yuvraj97.ml@gmail.com

What is REimagineHome.AI?

REimagineHome.AI is an agentic virtual interior staging platform built and architected by Yuvraj Garg at Styldod. It uses a LangGraph + MCP multi-agent pipeline (planning, execution, and quality review agents) orchestrating 20+ tools. The platform scaled from 0 to 2.1M+ users generating 30M+ designs, with hosting costs cut by 65%+ through self-hosted LLMs and VLMs on AWS EKS.

How did Yuvraj Garg reduce LLM cold start time from 26 minutes to 7 minutes?

For the Qwen3.5-397B model on a 4x B200 Modal cluster, Yuvraj reduced cold start from 26 minutes to 7 minutes by caching FlashInfer JIT kernels via persistent volume symlinks. This eliminated repetitive JIT compilation on every container boot and unlocked scale-to-zero economics, saving 74% on GPU costs.

What certifications does Yuvraj Garg hold?

Yuvraj holds a MITx MicroMasters in Statistics and Data Science (Statistics, Probability, Machine Learning, Data Analysis), a DeepLearning.AI Deep Learning Specialization, and three Red Hat certifications: EX280 (OpenShift Specialist), EX407 (Ansible Specialist), and EX200 (RHCSA System Administrator).

Offline WebGPU Inference: Gemma-4 and Qwen3 in the Browser

How do you run LLMs entirely offline in the browser with WebGPU?

Use Transformers.js with WebGPU backend (10–15× faster than WASM), Web Workers for streaming outside the React main thread, and two-layer caching: Service Worker for app shell, Cache API for multi-GB ONNX weights. Production PWAs at gemma4.quantml.org achieve 30–60 tok/s desktop, 15–25 tok/s mobile.

✨TL;DR Summary

WebGPU vs WASM: WebGPU is 10-15× faster, achieving 30-60 tokens/s on desktop and 15-25 tokens/s on mobile.
Zero Latency Streaming: Implemented Web Workers and postMessage to bypass React main-thread blocking, enabling true progressive streaming.
Two-Layer Caching: Used a custom Service Worker for the app shell and the Browser Cache API for ONNX model files (~3.4 GB) to eliminate redundant network requests.
Multimodal Support: Built an alternating local sliding-window attention pipeline supporting text, image (downscaled canvas), and audio (16kHz PCM log-mel spectrogram) inputs.
Serverless Limits Avoided: Used Next.js static exports (output: 'export') to bypass the Vercel 250MB serverless function limit caused by native ONNX binaries.

Running LLMs directly in the browser with no server, no API calls, and fully offline requires solving hard problems around WebGPU compatibility, streaming architecture, PWA caching, and mobile constraints. I have built two production PWAs that do this: Qwen3.5-0.8B and Gemma 4 E2B (text + image + audio).

What production browser LLM deployments run entirely offline?

Two live PWAs, Qwen3.5-0.8B (~850 MB, text + image) and Gemma 4 E2B (~3.4 GB, text + image + audio), run with zero server calls using WebGPU and two-layer PWA caching.

Model	Size	Modalities	Backend	Live
Qwen3.5-0.8B ONNX (q4)	~850 MB	Text + Image	WebGPU / WASM	qwen.quantml.org
Gemma 4 E2B ONNX (q4f16)	~3.4 GB	Text + Image + Audio	WebGPU	gemma4.quantml.org

How is offline browser LLM inference architected?

Next.js static export runs Transformers.js in a Web Worker on ONNX Runtime Web, with postMessage token streaming and a two-layer cache split between Service Worker and Cache API. Both apps use the same core architecture: a Next.js 15 static export (no server needed) with a dedicated Web Worker running Transformers.js on ONNX Runtime Web. The main thread communicates with the worker via postMessage, and the service worker handles app-shell caching separately from model file caching.

Main Thread (React 19)           Web Worker                 Cache Layer
┌──────────────────────┐    ┌──────────────────┐    ┌──────────────────┐
│ AppShell → ChatInput │    │ Transformers.js  │    │ Service Worker   │
│ → MessageBubble      │◄──►│ ONNX Runtime Web │    │ (app shell: HTML,│
│ → ModelStatus        │    │ WebGPU or WASM   │    │  JS, CSS, icons) │
│                      │    │ TextStreamer     │    │                  │
│ useInferenceEngine   │    │ → postMessage    │    │ Transformers.js  │
│ → TransformersEngine │    │   per token      │    │ Cache API        │
└──────────────────────┘    └──────────────────┘    │ (model files)    │
                                                     └──────────────────┘

How much faster is WebGPU than WASM for browser LLM inference?

WebGPU is 10–15× faster than WASM: 30–60 tok/s on desktop and 15–25 tok/s on mobile versus single-digit tok/s on WASM for sub-4B models. WebGPU is 10-15× faster than WASM for models over 100M parameters. But it requires a secure context (HTTPS or localhost), Chrome 113+ on desktop or Chrome 121+ on Android 12+ with supported GPU, and the adapter must have sufficient buffer size.

The detection waterfall checks: secure context → navigator.gpu existence → adapter availability → maxBufferSize ≥ 256 MB → shader-f16 support. If WebGPU is selected but fails at runtime, the engine automatically retries with WASM. Network errors are re-thrown since they would fail on WASM too.

Metric	WebGPU (Desktop)	WebGPU (Pixel 9a)	WASM (Pixel 9a)
TTFT	1-3s	3-8s	10-30s
Decode speed	30-60 tok/s	15-25 tok/s	2-8 tok/s
Runtime memory	~1-2 GB	~1-2 GB	~0.8-1.5 GB

Why does WASM block progressive token streaming in React?

WASM session.run() blocks the main thread without yielding, so React batches all setState calls until generation finishes. Only a Web Worker fixes true progressive streaming. Transformers.js's model.generate() uses an internal autoregressive loop. With WASM, session.run() executes synchronously on the calling thread: the await resolves as a microtask without yielding to the macrotask queue. React's setMessages() calls batch but never flush until the entire loop finishes. Result: all text appears at once after a long delay.

Solution: Web Worker. Each token is sent from the worker via postMessage({ type: 'token', token }). Every postMessage delivery is a separate macrotask on the main thread's event loop. React processes each setMessages() update in its own render cycle. True progressive streaming on both WebGPU and WASM.

How do you cache multi-gigabyte ONNX models in a PWA?

Use a Service Worker for the app shell and Transformers.js Cache API for ONNX weights. Caches persist across sessions and must not double-cache HuggingFace URLs.

Layer 1: Service Worker (app shell). Hand-written sw.js, not Workbox. Caches all static assets (HTML, JS, CSS) using cache-first strategy. Navigation uses network-first with cache fallback. Explicitly skips HuggingFace URLs, .onnx, and .onnx_data since model files are handled by Layer 2.

Layer 2: Transformers.js internal cache (model files). Transformers.js uses the browser Cache API directly for ONNX model files. First load downloads ~850 MB (Qwen3.5) or ~3.4 GB (Gemma 4 E2B). Subsequent loads: instant from cache, no network needed. These caches persist across sessions and survive service worker updates.

Can you run multimodal models in the browser?

Yes. Gemma 4 E2B runs text, image (downscaled canvas), and audio (16 kHz PCM log-mel spectrogram) entirely client-side via WebGPU with alternating local/global attention. Gemma 4 E2B is a 2.3B effective parameter model supporting text, image, and audio input in the browser. It uses an alternating local sliding-window (512 tokens) and global full-context attention pattern, with a 128K token context window.

Audio Processing Pipeline

MediaRecorder captures WebM/Opus via getUserMedia
FileReader.readAsDataURL converts the blob to a data URL
AudioContext({ sampleRate: 16000 }) + decodeAudioData converts to 16kHz PCM
Stereo-to-mono downmix with sqrt(2)/2 scaling
Float32Array passed to the Gemma4 processor
Gemma4AudioFeatureExtractor computes log-mel spectrogram internally

Image Processing Pipeline

User selects image via file input
createImageBitmap decodes it
Canvas downscales to max 1024×1024 preserving aspect ratio
canvas.toDataURL('image/jpeg', 0.85) compresses
RawImage.read(dataUrl) converts to model format

How do you track progress for multi-gigabyte model downloads?

Use a shared tracker with a progress watermark, smoothed speed, and ETA. Throttle React updates to 150ms to avoid flooding the main thread during 3.4 GB downloads. Transformers.js progress_callback fires events per file, but event.total is frequently 0 (server does not send Content-Length for partial responses) and multiple files download concurrently. The solution: a single shared tracker with a progress water mark preventing the percentage from going backward, smoothed speed calculation, and ETA display.

Three phases shown in the UI: Loading processor (spinner, small files), Downloading model (progress ring with bytes, speed, ETA), and Compiling model (spinner, WebGPU shader compilation).

How do you prevent race conditions in browser inference engines?

Guard initialize(), generate(), and dispose() with promises, an epoch counter for stale callbacks, and disposeTensors() in finally blocks to prevent GPU memory leaks. The engine has three async operations that can race: initialize(), generate(), and dispose(). Guards prevent concurrent calls via init/generate promises, an epoch counter that increments on each init/dispose (stale callbacks bail), and a disposed flag checked at every async boundary.

GPU memory leaks are prevented by always calling disposeTensors(inputs) in a finally block after every generate() call, and disposing the old model before loading a new one.

How do you deploy browser LLMs on Vercel without hitting the 250MB limit?

Use output: 'export' for a fully static site and webpack aliases to exclude onnxruntime-node (~355 MB). Inference is 100% client-side, so no serverless functions are needed. @huggingface/transformers pulls in onnxruntime-node (~355MB of native binaries) as a dependency. Even though inference runs entirely client-side, Next.js traces it into the serverless function bundle, exceeding Vercel's 250MB limit.

Solution: output: 'export' in next.config.ts. This produces a fully static site (HTML + JS + CSS) with zero serverless functions. Since the app is 100% client-side, no server is needed. Combined with webpack aliases (sharp$: false, onnxruntime-node$: false) to prevent bundling server-only packages.

What mobile constraints block WebGPU LLM inference?

HTTPS, Chrome 121+ on Android 12+, hardware acceleration enabled, and sufficient device memory are all required. Each failure mode needs explicit detection and WASM fallback.

Constraint	Impact	Mitigation
HTTPS mandatory	WebGPU silently disabled on HTTP	Serve over HTTPS; localhost is a secure context
Android WebGPU requires Chrome 121+ on Android 12+	~30% of Android users cannot use WebGPU	Auto-fallback to WASM with mobile-specific diagnostics
Google Advanced Protection disables WebGPU	Entirely blocks `navigator.gpu`	Detection in capability check; suggest WASM fallback
Hardware acceleration must be ON	Chrome setting often disabled for battery	Check `chrome://gpu` diagnostic
Device memory < 4 GB	Model may OOM or run very slowly	Show warning banner via `navigator.deviceMemory`
Battery drain	Inference drains mobile battery significantly	Warning banner when on mobile

What are the most common browser AI pitfalls?

HTTPS for WebGPU, Web Workers for WASM streaming, no double-caching of model files, and Turbopack incompatibility with Web Workers are the top production blockers.

HTTPS is non-negotiable for WebGPU. Serving over plain HTTP makes navigator.gpu undefined. Always check isSecureContext first.
Turbopack breaks Web Workers. new Worker(new URL('./file.ts', import.meta.url)) only works with webpack. Dev script must use next dev not next dev --turbopack.
React batches kill streaming on WASM. When model.generate() blocks the main thread, React accumulates setState calls but never re-renders. Only a Web Worker fixes this: no amount of flushSync or requestAnimationFrame helps.
Do not double-cache model files. Transformers.js manages its own model cache in the browser Cache API. Skip HuggingFace URLs in the service worker to avoid double-caching ~850 MB of data.
Service Worker range requests fail. Large model files use HTTP range requests. The SW must bypass these URLs since Cache.put() throws TypeError on 206 Partial Content responses.
Progress callback flooding. Transformers.js fires download events very frequently. Throttle React state updates to 150ms with a trailing timer to prevent "Maximum update depth exceeded."
beforeinstallprompt fires before React mounts. Capture it globally in a vanilla JS file (pwa.js) loaded via <script defer>, not in useEffect.
Disable thinking mode on small models. Qwen3.5's chain-of-thought reasoning wastes tokens on a 0.8B model. Pass enable_thinking: false explicitly.
Context windowing is essential. With 2048 max tokens and potentially long conversations, implement a sliding window (20 messages) with summarization of older messages.
Audio must be 16kHz. The feature extractor's sampling_rate config requires 16kHz. Create AudioContext with { sampleRate: 16000 } for automatic resampling via decodeAudioData.

What are the key lessons for production browser AI?

WebGPU plus Web Workers plus two-layer caching makes real-time offline LLM inference feasible. Static export avoids serverless limits entirely.

WebGPU is transformative for browser AI: 10-15× faster than WASM, making real-time LLM inference feasible on consumer hardware.
Web Workers are the only real solution for streaming: the main thread cannot yield during WASM inference, no matter what React trick you use.
Two-layer caching is required: service worker for app shell, Cache API for model files. Do not mix them.
HTTPS is the #1 gotcha: the default Transformers.js error message is misleading. Always check isSecureContext first.
Static export solves deployment: output: 'export' avoids the 250MB serverless function limit entirely.
Mobile WebGPU is real but fragile: Android 12+, Chrome 121+, hardware acceleration ON, no Advanced Protection. Each constraint needs explicit detection and user-friendly diagnostics.
Progress UX matters for 3.4 GB downloads: byte-level progress with smoothed speed and ETA prevents users from thinking the app is broken.

Frequently Asked Questions

How much faster is WebGPU than WASM for browser LLM inference?: WebGPU achieves 10–15× higher throughput than WASM backends: approximately 30–60 tokens/s on desktop and 15–25 tokens/s on mobile for sub-4B models, versus single-digit tokens/s on WASM.
How do you cache multi-gigabyte ONNX models in a PWA?: Use a custom Service Worker for the app shell (HTML/JS/CSS) and Transformers.js internal Cache API for ONNX weight files (~850 MB for Qwen3.5-0.8B, ~3.4 GB for Gemma 4 E2B). Caches persist across sessions and survive service worker updates.
Can you run multimodal models in the browser?: Yes. Gemma 4 E2B supports text, image (downscaled canvas), and audio (16 kHz PCM log-mel spectrogram) inputs via WebGPU with alternating local/global attention, achieving multimodal inference entirely client-side.

Related deep dives

Building an Evaluation Harness

210 Scenarios

Designing a deterministic test suite with regex, code execution, and LLM-as-judge to verify GGUF quantization quality across 210 scenarios in reasoning, coding, and multimodal tasks.

Serving Engine Internals

3 Engines · 10 Models

Source-level comparisons of vLLM, SGLang, and llama.cpp at production scale: quantization tradeoffs, memory behavior, cold starts, and when each engine wins on AWS EKS and Modal.

Cold Start Engineering

26m → 7s

How GPU memory snapshots, CRIU checkpointing, JIT kernel caching, and volume symlinks cut cold starts from 26 minutes to 7 minutes on Modal B200 clusters. Production-verified across vLLM, SGLang, and llama.cpp.