Yuvraj Garg is an AI Systems and Infrastructure Architect based in Bengaluru, India with 5 years of experience leading ML engineering at Styldod. He specializes in production GPU serving pipelines, multi-agent orchestration with LangGraph and MCP, cold start optimization, and cost reduction for large-scale LLM and vision model deployments.

What is Yuvraj Garg's biggest technical achievement?

Yuvraj architected REimagineHome.AI, scaling it from 0 to 2.1M+ users with 30M+ designs generated, while cutting hosting costs by 65%+ through self-hosted LLMs and VLMs on AWS EKS. He also built a DocumentAI pipeline achieving 18.6x cost savings ($0.025/doc vs $0.466 on AWS Bedrock) processing 50K documents per day.

What is Yuvraj Garg's expertise in GPU infrastructure?

Yuvraj specializes in GPU cold start optimization (achieving 6.9x faster startup using memory snapshots on L40S and B200 GPUs), serving massive MoE models like GLM-5.1 (754B) on 8x B200 clusters, and eliminating JIT compilation overhead (DeepGEMM, FlashInfer) via persistent volume caches. He uses vLLM, SGLang, and llama.cpp in production.

What tech stack does Yuvraj Garg use?

Yuvraj's core stack includes vLLM, SGLang, llama.cpp for LLM serving; LangGraph, MCP for agentic orchestration; AWS EKS, KubeRay, Karpenter for GPU cluster management; PyTorch for model work; FastAPI for backend services; and Next.js/React for frontend. He also holds Red Hat certifications in OpenShift (EX280) and Ansible (EX407).

Is Yuvraj Garg available for hire?

Yes. Yuvraj Garg is open to senior and staff Machine Learning Engineer roles covering GPU serving, cold-start optimization, multi-agent frameworks (LangGraph/MCP), and system optimization. He is based in Bengaluru, India and is available for hybrid, remote, or relocation roles, as well as contract consulting. Contact: yuvraj97.ml@gmail.com

What is REimagineHome.AI?

REimagineHome.AI is an agentic virtual interior staging platform built and architected by Yuvraj Garg at Styldod. It uses a LangGraph + MCP multi-agent pipeline (planning, execution, and quality review agents) orchestrating 20+ tools. The platform scaled from 0 to 2.1M+ users generating 30M+ designs, with hosting costs cut by 65%+ through self-hosted LLMs and VLMs on AWS EKS.

How did Yuvraj Garg reduce LLM cold start time from 26 minutes to 7 minutes?

For the Qwen3.5-397B model on a 4x B200 Modal cluster, Yuvraj reduced cold start from 26 minutes to 7 minutes by caching FlashInfer JIT kernels via persistent volume symlinks. This eliminated repetitive JIT compilation on every container boot and unlocked scale-to-zero economics, saving 74% on GPU costs.

What certifications does Yuvraj Garg hold?

Yuvraj holds a MITx MicroMasters in Statistics and Data Science (Statistics, Probability, Machine Learning, Data Analysis), a DeepLearning.AI Deep Learning Specialization, and three Red Hat certifications: EX280 (OpenShift Specialist), EX407 (Ansible Specialist), and EX200 (RHCSA System Administrator).

How many test scenarios should an LLM evaluation harness include?

A production harness should cover diverse failure modes: this implementation uses 210 deterministic scenarios across reasoning, coding, instruction-following, and multimodal tasks, enough to catch quantization regressions without exhaustive manual review.

What evaluation methods work best for quantized GGUF models?

Combine regex-based output checks for structured tasks, sandboxed code execution for programming scenarios, and LLM-as-judge for open-ended quality. Use deterministic methods first; judge models only where necessary to control cost and variance.

Building an LLM Evaluation Harness: 210 Deterministic Scenarios

How do you verify GGUF quantization quality without manual testing?

Use a deterministic harness with 210 scenarios combining regex checks, sandboxed code execution, and LLM-as-judge scoring. This catches regressions when quantizing models to GGUF while retaining ~93% accuracy on validated benchmarks.

✨TL;DR Summary

Triple-Layer Evaluation: Built a harness running 210 scenarios scoring GGUF models via Regex Extraction, Sandboxed Subprocess Code Execution, and async LLM-as-a-Judge (Claude).
Custom Logits Processor: Designed a 3-State Finite State Machine (FSM) Logits Processor to forcefully cap "thinking" token budgets by overriding logits probabilities to -inf.
Latency & Accuracy: Hard-capping reasoning tokens at 50 reduced execution latency by 90% (e.g., 3.8s down to 1.1s) while retaining 96.2% accuracy in deterministic logic puzzles.

Quantizing a model (like GLM-4.7-Flash to GGUF) reduces memory footprints but can degrade reasoning, syntax parsing, and function-calling abilities. To verify that quantization did not destroy our deployment's capabilities, we built a custom, deterministic evaluation harness running 210 specific test scenarios.

How do you verify GGUF quantization quality without manual testing?

Use a deterministic harness with regex checks, sandboxed code execution, and LLM-as-judge scoring across 210 scenarios, enough to catch quantization regressions at scale. The harness tests the model across logic, coding, and reasoning categories. To ensure evaluation is fast and automated, we designed three scoring mechanisms:

Evaluation Method	Test Target	Implementation	Metric Tracked
Regex Extraction	Structured JSON & Math	Strict regex pattern matching	Format compliance
Code Execution	Algorithm & Syntax correctness	Subprocess shell execution in a sandbox	Run-time compile and output match
LLM-as-a-Judge	Open-ended responses	Asynchronous grading via Claude-3.5-Sonnet	Semantic accuracy

How does sandboxed code execution catch quantization regressions?

Parse generated code blocks, run them in a timeout-restricted subprocess, and fail the scenario on non-zero exit codes or timeouts. For coding tests, the model is asked to write an algorithm. The harness parses the generated code block, writes it to a temporary file, and runs it under a restricted subprocess environment with a strict CPU timeout:

import subprocess
import tempfile
import os

def run_sandboxed_code(code_str: str, timeout_sec=2.0) -> bool:
    with tempfile.NamedTemporaryFile(suffix=".py", delete=False) as f:
        f.write(code_str.encode('utf-8'))
        f_name = f.name
        
    try:
        # Run subprocess with resource caps
        result = subprocess.run(
            ["python", f_name],
            capture_output=True,
            text=True,
            timeout=timeout_sec
        )
        return result.returncode == 0
    except subprocess.TimeoutExpired:
        return False
    finally:
        os.unlink(f_name)

How do you cap internal reasoning tokens during evaluation?

Use a 3-state logits processor FSM that forces the closing </think> token once a token budget is exceeded, cutting latency up to 90% with minimal accuracy loss. During evaluation, we observed that GLM-4.7-Flash-GGUF spent up to 80% of its generation time on internal reasoning (wrapped inside <think>...</think> blocks) before writing a simple answer. For latency-sensitive APIs, this is unacceptable.

Since llama.cpp had no native parameter to stop thinking at a precise token threshold, we wrote a custom 3-State Logits Processor inside our Python serving handler.

The 3-State Finite State Machine (FSM)

The logits processor intercepts every token generated by the model's forward pass:

COUNTING: The processor counts token IDs. If they belong to the thinking channel, we increment the token count.
FORCING: Once the token budget (e.g., 50 tokens) is hit, the processor overrides the model's logits by setting the probability of the closing tag </think> to 0.0 (log-prob infinity) and all other tokens to -inf, forcing the model to exit its thinking loop.
DONE: Once the closing tag is emitted, the processor steps out of the way, returning the model to regular generation.

class ThinkingBudgetLogitsProcessor:
    def __init__(self, budget: int, think_end_token_id: int):
        self.budget = budget
        self.think_end_id = think_end_token_id
        self.tokens_seen = 0
        self.state = "COUNTING" # COUNTING -> FORCING -> DONE

    def __call__(self, input_ids, scores):
        if self.state == "DONE":
            return scores
            
        self.tokens_seen += 1
        
        if self.state == "COUNTING" and self.tokens_seen >= self.budget:
            self.state = "FORCING"
            
        if self.state == "FORCING":
            # Force the end-of-thought token
            new_scores = torch.full_like(scores, float('-inf'))
            new_scores[self.think_end_id] = 0.0
            self.state = "DONE"
            return new_scores
            
        return scores

How much accuracy is retained after GGUF quantization?

With a 50-token thinking cap, logic puzzles retained 96.2% accuracy at 1.1s vs 100% at 3.8s unlimited. Formatting tasks stayed above 98%. Injecting the logits processor achieved a 90% reduction in latency for short, deterministic queries while preserving reasoning accuracy.

Test Class	Thinking Budget	Execution Latency	Accuracy Score
Logic Puzzles	Unlimited	3.8s	100.0%
Logic Puzzles	50 Tokens	1.1s	96.2%
Coding Algorithms	Unlimited	4.5s	82.4%
Coding Algorithms	50 Tokens	1.3s	81.9%
Formatting / JSON	0 (Disabled)	0.4s	98.1%

What makes an LLM evaluation harness production-ready?

Deterministic sandboxes for code, regex for structured output, and dynamic thinking budgets, plus chat-template verification for GGUF special tokens.

Deterministic sandboxes catch compiler issues: Code execution checks found that some quantized weights generated invalid Python indentation blocks due to quantization noise in space character weights.
Reasoning tokens are not always necessary: Capping the thinking budget dynamically allows us to trade compute for speed on a per-request basis.
Chat templates must be verified: GGUF models are highly sensitive to missing special tokens like <|im_start|>. The harness tests verified these were correctly formatted.

Source: §3 (GLM-4.7-Flash-GGUF).

Frequently Asked Questions

How many test scenarios should an LLM evaluation harness include?: A production harness should cover diverse failure modes: this implementation uses 210 deterministic scenarios across reasoning, coding, instruction-following, and multimodal tasks, enough to catch quantization regressions without exhaustive manual review.
What evaluation methods work best for quantized GGUF models?: Combine regex-based output checks for structured tasks, sandboxed code execution for programming scenarios, and LLM-as-judge for open-ended quality. Use deterministic methods first; judge models only where necessary to control cost and variance.
How much accuracy is retained after GGUF quantization?: With careful quantization (q4_K_M and similar), validated harness runs show approximately 93% accuracy retained versus FP16 baselines on the 210-scenario suite, with failures clustered in specific task types identifiable via per-category reporting.

Related deep dives

Serving Engine Internals

3 Engines · 10 Models

Source-level comparisons of vLLM, SGLang, and llama.cpp at production scale: quantization tradeoffs, memory behavior, cold starts, and when each engine wins on AWS EKS and Modal.

Cold Start Engineering

26m → 7s

How GPU memory snapshots, CRIU checkpointing, JIT kernel caching, and volume symlinks cut cold starts from 26 minutes to 7 minutes on Modal B200 clusters. Production-verified across vLLM, SGLang, and llama.cpp.

Browser AI: Offline WebGPU

12× vs WASM

Deploying Gemma-4 and Qwen3.5 entirely client-side with WebGPU at 10–15× faster than WASM, with two-layer PWA caching and multimodal inference at zero server cost.