Yuvraj Garg is an AI Systems and Infrastructure Architect based in Bengaluru, India with 5 years of experience leading ML engineering at Styldod. He specializes in production GPU serving pipelines, multi-agent orchestration with LangGraph and MCP, cold start optimization, and cost reduction for large-scale LLM and vision model deployments.

What is Yuvraj Garg's biggest technical achievement?

Yuvraj architected REimagineHome.AI, scaling it from 0 to 2.1M+ users with 30M+ designs generated, while cutting hosting costs by 65%+ through self-hosted LLMs and VLMs on AWS EKS. He also built a DocumentAI pipeline achieving 18.6x cost savings ($0.025/doc vs $0.466 on AWS Bedrock) processing 50K documents per day.

What is Yuvraj Garg's expertise in GPU infrastructure?

Yuvraj specializes in GPU cold start optimization (achieving 6.9x faster startup using memory snapshots on L40S and B200 GPUs), serving massive MoE models like GLM-5.1 (754B) on 8x B200 clusters, and eliminating JIT compilation overhead (DeepGEMM, FlashInfer) via persistent volume caches. He uses vLLM, SGLang, and llama.cpp in production.

What tech stack does Yuvraj Garg use?

Yuvraj's core stack includes vLLM, SGLang, llama.cpp for LLM serving; LangGraph, MCP for agentic orchestration; AWS EKS, KubeRay, Karpenter for GPU cluster management; PyTorch for model work; FastAPI for backend services; and Next.js/React for frontend. He also holds Red Hat certifications in OpenShift (EX280) and Ansible (EX407).

Is Yuvraj Garg available for hire?

Yes. Yuvraj Garg is open to senior and staff Machine Learning Engineer roles covering GPU serving, cold-start optimization, multi-agent frameworks (LangGraph/MCP), and system optimization. He is based in Bengaluru, India and is available for hybrid, remote, or relocation roles, as well as contract consulting. Contact: yuvraj97.ml@gmail.com

What is REimagineHome.AI?

REimagineHome.AI is an agentic virtual interior staging platform built and architected by Yuvraj Garg at Styldod. It uses a LangGraph + MCP multi-agent pipeline (planning, execution, and quality review agents) orchestrating 20+ tools. The platform scaled from 0 to 2.1M+ users generating 30M+ designs, with hosting costs cut by 65%+ through self-hosted LLMs and VLMs on AWS EKS.

How did Yuvraj Garg reduce LLM cold start time from 26 minutes to 7 minutes?

For the Qwen3.5-397B model on a 4x B200 Modal cluster, Yuvraj reduced cold start from 26 minutes to 7 minutes by caching FlashInfer JIT kernels via persistent volume symlinks. This eliminated repetitive JIT compilation on every container boot and unlocked scale-to-zero economics, saving 74% on GPU costs.

What certifications does Yuvraj Garg hold?

Yuvraj holds a MITx MicroMasters in Statistics and Data Science (Statistics, Probability, Machine Learning, Data Analysis), a DeepLearning.AI Deep Learning Specialization, and three Red Hat certifications: EX280 (OpenShift Specialist), EX407 (Ansible Specialist), and EX200 (RHCSA System Administrator).

ML Systems ArchitectBengaluru, India · Remote · Open to relocation

Hi, I'm Yuvraj Garg

AI Systems & GPU Infrastructure Engineer. Production ML, vLLM, KubeRay, LangGraph

I engineer production-scale AI infrastructure, low-latency GPU serving pipelines, and robust multi-agent orchestration systems.

I own the translation layer from research to codebases people pay for, driving model cost and cold starts down, debugging GPU memory limits when clusters break, and ensuring 99.9% reliability.

Download Resume View Impact Metrics

Proven Expertise:Styldod · 5y LeadMITx MicroMastersRed Hat Certified × 34 Products Shipped Solo

6.9×
Cold Start Reduction
FLUX.2-klein-9B GPU memory snapshots
18.6×
Cost Optimization
DocumentAI cluster vs AWS Bedrock
24×
LLM Boot Improvement
110s → 2-7s on Modal GPU layers
30+
Models In Production
Image, VLM, and agentic microservices

Business Impact & Achievements

Engineering metrics verified in production.

Instead of abstract scores, I benchmark real-world parameters: hosting costs, cold start seconds, parallel worker execution, and custom inference compilation.

Cost & Scale

~70% saved

Inhouse Model Hosting

Self-hosted 30+ models at Styldod rather than utilizing external APIs, handling millions of monthly inference queries at 1/3 the cost.

Styldod Core Infra

18.6× cheaper

vLLM on EKS Cluster

Built a self-hosted pipeline (EKS + KubeRay + vLLM) for DocumentAI, processing 50K docs/day at $0.025/doc vs $0.466 on AWS Bedrock.

AWS EKS / vLLM Deployment · 2026

40%+ saved

Serverless GPU Orchestration

Eliminated always-on GPU standby instances by migrating image pipelines to AWS EKS. Used scale-to-zero during idle hours to cut vendor costs by 40%+.

Vendor Migration · 2025

Speed & Optimization

48s → 7s

FLUX.2 Memory Snapshots

Achieved 6.9× faster startup for image generation models on L40S GPUs using serverless runtime memory checkpointing.

Modal GPU Snapshots · Feb 2026

5m → 1.5m

DreamFurniture Inference

Architected a 6-model 2D→3D pipeline combining depth maps and layout alignment, using parallel workers and shared memory GLB transfers to cut latency by 70%.

Styldod 3D R&D · 2024

26m → 7m Boot

Qwen3.5-397B on Modal

Reduced cold start from 26m to 7m by caching FlashInfer JIT kernels via persistent volume symlinks, unlocking scale-to-zero to save 74% on 4× B200 GPUs.

4× B200 Cluster · May 2026

Systems Engineering

754B MoE (TP=8)

GLM-5.1-FP8 on 8× B200

Hardened a $50/hr serving cluster by building background crash monitors, log streams, and automated container recycling. Cached DeepGEMM kernels to bypass 15-minute boot JIT compilation.

8× B200 cluster · 2026

90% Latency Cut

Custom Logits Processor

Developed a custom token budget controller in llama-cpp-python to dynamically terminate internal reasoning chains once a limit is met, cutting latency and token costs by up to 90%.

llama-cpp-python PR · 2026

93% Eval Accuracy

GGUF Quality Benchmark

Developed an automated test suite running 210 deterministic evaluations (via code execution and regex) to verify that GGUF quantization retained 93% overall accuracy.

Model Evaluation Harness · Feb 2026

Experience

Professional history & engineering approach.

Lead Machine Learning Engineer

Styldod

Visual automation and 3D orchestration platform for real estate

Oct 2021 to May 2026

Bengaluru, India · 5 years

Led the AI research and engineering team on computer vision, multi-cluster distribured GPU training, distribured GPU inference, 3D scene reconstruction, and agentic workflow.
Architected REimagineHome.AI: Built a LangGraph + MCP multi-agent virtual staging platform (planning, execution, quality review) and scaled it from 0 to 2.1M+ users (30M+ designs); slashed hosting costs by >65% via self-hosted LLMs/VLMs on AWS EKS.
Architected 3D scene reconstruction platform (DreamFurniture): Researched 15+ SOTA models to design and build a 6-model 2D→3D pipeline (integrating camera calibration, depth estimation, and layout alignment), reducing processing latency to ~1.5 min.
Drove 70% hosting cost reductions: Replaced external API reliance by self-hosting models for image generation, segmentation, enhancement, classification, depth estimation, layout detection, and camera calibration on AWS.
Serverless GPU Orchestration: Eliminated always-on GPU standby instances by moving the inference pipelines to AWS EKS, saving 40%+ in vendor costs via scale-to-zero.
Boosted organic search traffic: Engineered automated content engines, elevating non-blog landing pages from page ~40 to the top 10 and securing an average rank of 4th for generated blog posts.

Computer VisionPyTorchLangGraphModel Context Protocol (MCP)vLLMSGLangKubeRayAWS EKSKarpenterRay ServeFastAPIDocker3D Scene ReconstructionPoint Cloud AlignmentCamera CalibrationDepth EstimationImage SegmentationImage EnhancementVLMsWebSocketsRedis

Research Engineer

RAx Labs

Academic research search & summarization engine

Jul 2021 to Oct 2021

Gandhinagar, India

Built NLP extraction pipelines: Coded summaries for fresh research papers, broadcasting updates to active users via microservices.
Configured Docker & AWS ECS: Deployed backend instances with Datadog, Prometheus, and Grafana monitoring stacks.
Optimized Elasticsearch databases: Speed up document searches across indexing grids with high concurrency.

NLPDockerAWS ECSElasticsearchPrometheusGrafanaPython

Core Engineering Values

Plan Before Coding

Draft specifications, dependencies, and baseline benchmarks first. Compare models on price, latency, and quality, never assumptions.

Manage Compute Wisely

Profile cold starts and measure hosting dollars. Actively migrating workloads off unoptimized vendors saves significant capital.

Ship Proofs Rapidly

Build operational end-to-end prototypes immediately (e.g. 12 days for budget-tracked chatbots) to validate architecture before scaling.

Own the Full Stack

Solve blockages wherever they occur, writing Streamlit/React tools, configuring API Gateways, or optimizing CDN caching to deliver results.

Projects

Systems built for scale and active usage.

Production-ready tools, self-hosted orchestrations, and cloud stacks serving real requests, not academic homework.

Personal ProjectVisit Site

2022 to 2025

ScaleWaveAI

AIaaS · Cloud Private GPU Clusters · B2B

B2B AIaaS orchestration platform providing private, isolated GPU clusters with scale-to-zero economics to eliminate idle compute costs. Features one-click model deployment, Stripe/Razorpay transactional billing, and data compliance isolation for enterprise tenants, capable of serving high-volume vision and training workloads.

AIaaSKubeRayAWS EKSKarpenterRay ServeSaaS BillingStripe & RazorpayMulti-tenant IsolationDistributed TrainingB2B

Personal ProjectVisit Site

2025 to 2026

KyoudaiClub

SaaS · LLM Roleplay Chat & PVP Quizzes · B2C

Built a gamified anime roleplay & trivia platform. Powered by a FastAPI + Beanie ODM (MongoDB) backend. Uses LangGraph to orchestrate character chat, PvP duels, and context compression. Features a WebSocket PvP engine, AniList watch-history integration, and dynamic GCS-hosted avatar generation.

FastAPIBeanie ODMMongoDBLangGraphLangChainWebSocketsSelf-Hosted LLMsGoogle Cloud StorageNext.jsB2C

Personal ProjectVisit Site

2021 to Present

QuantML

Learn ML in the browser · Deploy Guides

Interactive ML education platform. Simulates PCA & backprop training (Word2Vec, RNN) using KaTeX. Features serverless playbooks and fully offline client-side WebGPU inference executing LLMs inside browser workers, with PWA IndexedDB offline queueing and alternate markdown endpoints for LLM crawler search indexing.

WebGPUEdge AIllama.cppvLLMSGLangNext.jsMongoDBKaTeXPWAIndexedDBLLM Crawler Routing

Personal ProjectVisit Site

Early 2026

Overheard

Live Audience WebSocket Chat

Built a public real-time digital fishbowl chat platform with a Custom Node.js Socket.io server. Features spectator modes, a live hype meter, and a 4-stage moderation cascade combining browser-side ML (cached toxic-bert Web Workers) for zero-latency pre-flight blocks with server-side regex sanitization and async LLM quarantine verification.

Socket.ioNode.jsWeb WorkersEdge ML (toxic-bert)Firebase AuthWeb Push (VAPID)Framer MotionPWAReal-Time Systems

Client ProjectEnterprise

2026

DocumentAI

Hybrid Document Parsing Infrastructure

Built a hybrid platform processing 1,000-page PDFs under 5m via Step Functions Distributed Map chunking to bypass 128K token limits. Deployed KubeRay + vLLM (Qwen3.6-35B-A3B FP8) on EKS, autoscaling Spot GPUs with Karpenter and KEDA. Achieved 18.6x cost savings ($0.025/doc vs $0.465 Bedrock) at a capacity of 50K documents/day.

Architecture documented. Complete report available for review.

vLLMAWS EKSKubeRayKarpenterKEDAStep FunctionsDistributed MapHybrid RoutingAWS CDKvLLM MetricsPII RedactionMLOps

Styldod (Work)Visit Site

2022 to 2026

REimagineHome.AI

Agentic Interior Design Platform

Architected and led the ML lifecycle of a virtual staging platform from 0 to 2.1M+ users (30M+ designs, 70% download rate). Built a LangGraph + MCP multi-agent workflow (planning, execution, and quality review agents) orchestrating 20+ tools. Cut hosting costs by >65% by migrating from managed APIs to self-hosted LLMs/VLMs.

Agentic AIComputer VisionLLMVLMMemory ManagementMCP3D VisionLangGraphEKSKubeRayFastAPIECSLambdaGPU SnapshottingFSx for LustreB2C

Deep Dives

Case studies & source audit trails.

Comprehensive writeups with optimization logs, profiling, and direct documentation from production environments.

Hardening a $50/hr GPU Cluster

754B on 8× B200Reliability

Building crash monitors, DeepGEMM/FlashInfer kernel caches, and automated container recycling to keep 754B MoE and 397B FP8 clusters alive on Modal 8× and 4× B200 at $50/hr and $25/hr.

Read technical log →

Cold Start Engineering

26m → 7sInference

How GPU memory snapshots, CRIU checkpointing, JIT kernel caching, and volume symlinks cut cold starts from 26 minutes to 7 minutes on Modal B200 clusters. Production-verified across vLLM, SGLang, and llama.cpp.

Read technical log →

Building an Evaluation Harness

210 ScenariosQuality

Designing a deterministic test suite with regex, code execution, and LLM-as-judge to verify GGUF quantization quality across 210 scenarios in reasoning, coding, and multimodal tasks.

Read technical log →

Serving Engine Internals

3 Engines · 10 ModelsInference

Source-level comparisons of vLLM, SGLang, and llama.cpp at production scale: quantization tradeoffs, memory behavior, cold starts, and when each engine wins on AWS EKS and Modal.

Read technical log →

Browser AI: Offline WebGPU

12× vs WASMEdge AI

Deploying Gemma-4 and Qwen3.5 entirely client-side with WebGPU at 10–15× faster than WASM, with two-layer PWA caching and multimodal inference at zero server cost.

Read technical log →

Image Generation at Scale

48s → 7s · FLUX.2Diffusion

Memory snapshots, async GPU transfers, and pipeline pre-loading for production FLUX.2 image generation on serverless L40S GPUs, with 6.9× cold start reduction verified in production.

Read technical log →

Systems Debugging

36 bugs · 7.5× regressionDebugging

Flash attention investigation (7.5× regression), 36 bugs in GLM-5.1 code review, concurrency cliff discovery, and GPU snapshot SIGSEGV debugging on production Modal clusters.

Read technical log →

Skills & Credentials

Tooling, certifications, & education.

Technical Competencies

AI & ML

PyTorchvLLMSGLangllama.cppRay ServeRay TrainRay DataLangGraphLangChainLiteLLMMCPTransformersTransformers.jsLLMVLMMachine LearningComputer VisionNatural Language Processing

• Diffusion Models• Distributed Training• Distributed Inference• LLM/VLM evaluation framework• Langsmith• Langfuse• Multi-modal Models• Model Quantization• 3D Reconstruction• Image Segmentation• Depth Estimation• Camera Calibration• PCA Projections• Backpropagation Profiling• Speculative Decoding• KV Cache Control

Infrastructure & MLOps

AWS EKSKubeRayAWS KarpenterDockerFastAPIRedisLambda/SQS/S3KubernetesPrometheusGrafanaDatadogOpenTelemetryAWS ECS

• GPU Snapshots (CRIU)• Cold Start Optimization• DeepGEMM Compilation• System Health Monitoring• AWS CloudWatch• AWS X-Ray• AWS FSx for Lustre• Scale-to-Zero clustering• IndexedDB browser cache• PWA Service Workers• Offline Sync Queues

Full Stack & Core Eng

PythonTypeScriptNext.jsReactWebGPU/WASMWebSocketsREST APIsGitMongoDB AtlasMongoose ORMStripe APIRazorpay APIFirebase Admin SDK

• High-Concurrency Broadcast• Linux Admin• Ansible Automations• MDX System Layouts• KaTeX (LaTeX rendering)• NextAuth.js (OAuth)• ACID Database Transactions• Exponential Backoff Retries• High-Precision Decimal Billing• Web Push Notifications• IP Rate Limiting

Leadership & Strategy

Team Leadership (R&D)Product ArchitectureCompute Cost ModelingTechnical WritingB2B Product ManagementClient Engagement

• Growth Engineering (SEO)• Vendor Benchmarking• Project Scoping• Agile Execution• AI Engine Optimization (AEO)• RBAC (Access Control)

Education & Credentials

MITx MicroMasters

2021 - 2023

Statistics and Data Science

DeepLearning.AI

Specialization

Deep Learning Specialization

Red Hat Linux

ID: 180-142-705

Enterprise Systems & Automation

CS Foundations

Coursera

Algorithms & Data Structures

FAQ

Frequently Asked Questions

Common questions about Yuvraj's background, expertise, and availability.

Who is Yuvraj Garg?: Yuvraj Garg is an AI Systems and Infrastructure Architect based in Bengaluru, India with 5 years of experience leading ML engineering at Styldod. He specializes in production GPU serving pipelines, multi-agent orchestration with LangGraph and MCP, cold start optimization, and cost reduction for large-scale LLM and vision model deployments.
What is Yuvraj Garg's biggest technical achievement?: Yuvraj architected REimagineHome.AI, scaling it from 0 to 2.1M+ users with 30M+ designs generated, while cutting hosting costs by 65%+ through self-hosted LLMs and VLMs on AWS EKS. He also built a DocumentAI pipeline achieving 18.6x cost savings ($0.025/doc vs $0.466 on AWS Bedrock) processing 50K documents per day.
What is Yuvraj Garg's expertise in GPU infrastructure?: Yuvraj specializes in GPU cold start optimization (achieving 6.9x faster startup using memory snapshots on L40S and B200 GPUs), serving massive MoE models like GLM-5.1 (754B) on 8x B200 clusters, and eliminating JIT compilation overhead (DeepGEMM, FlashInfer) via persistent volume caches. He uses vLLM, SGLang, and llama.cpp in production.
What tech stack does Yuvraj Garg use?: Yuvraj's core stack includes vLLM, SGLang, llama.cpp for LLM serving; LangGraph, MCP for agentic orchestration; AWS EKS, KubeRay, Karpenter for GPU cluster management; PyTorch for model work; FastAPI for backend services; and Next.js/React for frontend. He also holds Red Hat certifications in OpenShift (EX280) and Ansible (EX407).
Is Yuvraj Garg available for hire?: Yes. Yuvraj Garg is open to senior and staff Machine Learning Engineer roles covering GPU serving, cold-start optimization, multi-agent frameworks (LangGraph/MCP), and system optimization. He is based in Bengaluru, India and is available for hybrid, remote, or relocation roles, as well as contract consulting. Contact: yuvraj97.ml@gmail.com
What is REimagineHome.AI?: REimagineHome.AI is an agentic virtual interior staging platform built and architected by Yuvraj Garg at Styldod. It uses a LangGraph + MCP multi-agent pipeline (planning, execution, and quality review agents) orchestrating 20+ tools. The platform scaled from 0 to 2.1M+ users generating 30M+ designs, with hosting costs cut by 65%+ through self-hosted LLMs and VLMs on AWS EKS.
How did Yuvraj Garg reduce LLM cold start time from 26 minutes to 7 minutes?: For the Qwen3.5-397B model on a 4x B200 Modal cluster, Yuvraj reduced cold start from 26 minutes to 7 minutes by caching FlashInfer JIT kernels via persistent volume symlinks. This eliminated repetitive JIT compilation on every container boot and unlocked scale-to-zero economics, saving 74% on GPU costs.
What certifications does Yuvraj Garg hold?: Yuvraj holds a MITx MicroMasters in Statistics and Data Science (Statistics, Probability, Machine Learning, Data Analysis), a DeepLearning.AI Deep Learning Specialization, and three Red Hat certifications: EX280 (OpenShift Specialist), EX407 (Ansible Specialist), and EX200 (RHCSA System Administrator).

Work Focus & Engagements

I thrive on small, high-autonomy teams shipping products where speed and cost metrics guide architecture decisions.

Open to senior/staff MLE roles covering GPU serving, cold-starts, multi-agent frameworks (LangGraph/MCP), and system optimization.

Based in Bengaluru, India. Open to hybrid, remote, or relocation roles. Available for contract consulting and full-time positions.

Let's Build Together

Review my infrastructure reports or query availability via direct mail.

yuvraj97.ml@gmail.com Connect on LinkedIn Explore my Code @yuvraj97_ml on X