ML Systems ArchitectBengaluru, India · Remote · Open to relocation

Hi, I'm Yuvraj Garg

I engineer production-scale AI infrastructure, low-latency GPU serving pipelines, and robust multi-agent orchestration systems.

I own the translation layer from research to codebases people pay for, driving model cost and cold starts down, debugging GPU memory limits when clusters break, and ensuring 99.9% reliability.

Proven Expertise:Styldod · 5y LeadMITx MicroMastersRed Hat Certified × 34 Products Shipped Solo
6.9×
Cold Start Reduction
FLUX.2-klein-9B GPU memory snapshots
18.6×
Cost Optimization
DocumentAI cluster vs AWS Bedrock
24×
LLM Boot Improvement
110s → 2-7s on Modal GPU layers
30+
Models In Production
Image, VLM, and agentic microservices
Business Impact & Achievements

Engineering metrics verified in production.

Instead of abstract scores, I benchmark real-world parameters: hosting costs, cold start seconds, parallel worker execution, and custom inference compilation.

Cost & Scale

~70% saved
Inhouse Model Hosting

Self-hosted 30+ models at Styldod rather than utilizing external APIs, handling millions of monthly inference queries at 1/3 the cost.

Styldod Core Infra
18.6× cheaper
vLLM on EKS Cluster

Built a self-hosted pipeline (EKS + KubeRay + vLLM) for DocumentAI, processing 50K docs/day at $0.025/doc vs $0.466 on AWS Bedrock.

AWS EKS / vLLM Deployment · 2026
40%+ saved
Serverless GPU Orchestration

Eliminated always-on GPU standby instances by migrating image pipelines to AWS EKS. Used scale-to-zero during idle hours to cut vendor costs by 40%+.

Vendor Migration · 2025

Speed & Optimization

48s → 7s
FLUX.2 Memory Snapshots

Achieved 6.9× faster startup for image generation models on L40S GPUs using serverless runtime memory checkpointing.

Modal GPU Snapshots · Feb 2026
5m → 1.5m
DreamFurniture Inference

Architected a 6-model 2D→3D pipeline combining depth maps and layout alignment, using parallel workers and shared memory GLB transfers to cut latency by 70%.

Styldod 3D R&D · 2024
26m → 7m Boot
Qwen3.5-397B on Modal

Reduced cold start from 26m to 7m by caching FlashInfer JIT kernels via persistent volume symlinks, unlocking scale-to-zero to save 74% on 4× B200 GPUs.

4× B200 Cluster · May 2026

Systems Engineering

754B MoE (TP=8)
GLM-5.1-FP8 on 8× B200

Hardened a $50/hr serving cluster by building background crash monitors, log streams, and automated container recycling. Cached DeepGEMM kernels to bypass 15-minute boot JIT compilation.

8× B200 cluster · 2026
90% Latency Cut
Custom Logits Processor

Developed a custom token budget controller in llama-cpp-python to dynamically terminate internal reasoning chains once a limit is met, cutting latency and token costs by up to 90%.

llama-cpp-python PR · 2026
93% Eval Accuracy
GGUF Quality Benchmark

Developed an automated test suite running 210 deterministic evaluations (via code execution and regex) to verify that GGUF quantization retained 93% overall accuracy.

Model Evaluation Harness · Feb 2026
Experience

Professional history & engineering approach.

Lead Machine Learning Engineer

Styldod

Visual automation and 3D orchestration platform for real estate

Oct 2021 to May 2026
Bengaluru, India · 5 years
  • Led the AI research and engineering team on computer vision, multi-cluster distribured GPU training, distribured GPU inference, 3D scene reconstruction, and agentic workflow.
  • Architected REimagineHome.AI: Built a LangGraph + MCP multi-agent virtual staging platform (planning, execution, quality review) and scaled it from 0 to 2.1M+ users (30M+ designs); slashed hosting costs by >65% via self-hosted LLMs/VLMs on AWS EKS.
  • Architected 3D scene reconstruction platform (DreamFurniture): Researched 15+ SOTA models to design and build a 6-model 2D→3D pipeline (integrating camera calibration, depth estimation, and layout alignment), reducing processing latency to ~1.5 min.
  • Drove 70% hosting cost reductions: Replaced external API reliance by self-hosting models for image generation, segmentation, enhancement, classification, depth estimation, layout detection, and camera calibration on AWS.
  • Serverless GPU Orchestration: Eliminated always-on GPU standby instances by moving the inference pipelines to AWS EKS, saving 40%+ in vendor costs via scale-to-zero.
  • Boosted organic search traffic: Engineered automated content engines, elevating non-blog landing pages from page ~40 to the top 10 and securing an average rank of 4th for generated blog posts.
Computer VisionPyTorchLangGraphModel Context Protocol (MCP)vLLMSGLangKubeRayAWS EKSKarpenterRay ServeFastAPIDocker3D Scene ReconstructionPoint Cloud AlignmentCamera CalibrationDepth EstimationImage SegmentationImage EnhancementVLMsWebSocketsRedis

Research Engineer

RAx Labs

Academic research search & summarization engine

Jul 2021 to Oct 2021
Gandhinagar, India
  • Built NLP extraction pipelines: Coded summaries for fresh research papers, broadcasting updates to active users via microservices.
  • Configured Docker & AWS ECS: Deployed backend instances with Datadog, Prometheus, and Grafana monitoring stacks.
  • Optimized Elasticsearch databases: Speed up document searches across indexing grids with high concurrency.
NLPDockerAWS ECSElasticsearchPrometheusGrafanaPython

Core Engineering Values

Plan Before Coding

Draft specifications, dependencies, and baseline benchmarks first. Compare models on price, latency, and quality, never assumptions.

Manage Compute Wisely

Profile cold starts and measure hosting dollars. Actively migrating workloads off unoptimized vendors saves significant capital.

Ship Proofs Rapidly

Build operational end-to-end prototypes immediately (e.g. 12 days for budget-tracked chatbots) to validate architecture before scaling.

Own the Full Stack

Solve blockages wherever they occur, writing Streamlit/React tools, configuring API Gateways, or optimizing CDN caching to deliver results.

Projects

Systems built for scale and active usage.

Production-ready tools, self-hosted orchestrations, and cloud stacks serving real requests, not academic homework.

Personal ProjectVisit Site
2022 to 2025

ScaleWaveAI

AIaaS · Cloud Private GPU Clusters · B2B

B2B AIaaS orchestration platform providing private, isolated GPU clusters with scale-to-zero economics to eliminate idle compute costs. Features one-click model deployment, Stripe/Razorpay transactional billing, and data compliance isolation for enterprise tenants, capable of serving high-volume vision and training workloads.

AIaaSKubeRayAWS EKSKarpenterRay ServeSaaS BillingStripe & RazorpayMulti-tenant IsolationDistributed TrainingB2B
Personal ProjectVisit Site
2025 to 2026

KyoudaiClub

SaaS · LLM Roleplay Chat & PVP Quizzes · B2C

Built a gamified anime roleplay & trivia platform. Powered by a FastAPI + Beanie ODM (MongoDB) backend. Uses LangGraph to orchestrate character chat, PvP duels, and context compression. Features a WebSocket PvP engine, AniList watch-history integration, and dynamic GCS-hosted avatar generation.

FastAPIBeanie ODMMongoDBLangGraphLangChainWebSocketsSelf-Hosted LLMsGoogle Cloud StorageNext.jsB2C
Personal ProjectVisit Site
2021 to Present

QuantML

Learn ML in the browser · Deploy Guides

Interactive ML education platform. Simulates PCA & backprop training (Word2Vec, RNN) using KaTeX. Features serverless playbooks and fully offline client-side WebGPU inference executing LLMs inside browser workers, with PWA IndexedDB offline queueing and alternate markdown endpoints for LLM crawler search indexing.

WebGPUEdge AIllama.cppvLLMSGLangNext.jsMongoDBKaTeXPWAIndexedDBLLM Crawler Routing
Personal ProjectVisit Site
Early 2026

Overheard

Live Audience WebSocket Chat

Built a public real-time digital fishbowl chat platform with a Custom Node.js Socket.io server. Features spectator modes, a live hype meter, and a 4-stage moderation cascade combining browser-side ML (cached toxic-bert Web Workers) for zero-latency pre-flight blocks with server-side regex sanitization and async LLM quarantine verification.

Socket.ioNode.jsWeb WorkersEdge ML (toxic-bert)Firebase AuthWeb Push (VAPID)Framer MotionPWAReal-Time Systems
Client ProjectEnterprise
2026

DocumentAI

Hybrid Document Parsing Infrastructure

Built a hybrid platform processing 1,000-page PDFs under 5m via Step Functions Distributed Map chunking to bypass 128K token limits. Deployed KubeRay + vLLM (Qwen3.6-35B-A3B FP8) on EKS, autoscaling Spot GPUs with Karpenter and KEDA. Achieved 18.6x cost savings ($0.025/doc vs $0.465 Bedrock) at a capacity of 50K documents/day.

Architecture documented. Complete report available for review.

vLLMAWS EKSKubeRayKarpenterKEDAStep FunctionsDistributed MapHybrid RoutingAWS CDKvLLM MetricsPII RedactionMLOps
Styldod (Work)Visit Site
2022 to 2026

REimagineHome.AI

Agentic Interior Design Platform

Architected and led the ML lifecycle of a virtual staging platform from 0 to 2.1M+ users (30M+ designs, 70% download rate). Built a LangGraph + MCP multi-agent workflow (planning, execution, and quality review agents) orchestrating 20+ tools. Cut hosting costs by >65% by migrating from managed APIs to self-hosted LLMs/VLMs.

Agentic AIComputer VisionLLMVLMMemory ManagementMCP3D VisionLangGraphEKSKubeRayFastAPIECSLambdaGPU SnapshottingFSx for LustreB2C
Skills & Credentials

Tooling, certifications, & education.

Technical Competencies

AI & ML

PyTorchvLLMSGLangllama.cppRay ServeRay TrainRay DataLangGraphLangChainLiteLLMMCPTransformersTransformers.jsLLMVLMMachine LearningComputer VisionNatural Language Processing
Diffusion ModelsDistributed TrainingDistributed InferenceLLM/VLM evaluation frameworkLangsmithLangfuseMulti-modal ModelsModel Quantization3D ReconstructionImage SegmentationDepth EstimationCamera CalibrationPCA ProjectionsBackpropagation ProfilingSpeculative DecodingKV Cache Control

Infrastructure & MLOps

AWS EKSKubeRayAWS KarpenterDockerFastAPIRedisLambda/SQS/S3KubernetesPrometheusGrafanaDatadogOpenTelemetryAWS ECS
GPU Snapshots (CRIU)Cold Start OptimizationDeepGEMM CompilationSystem Health MonitoringAWS CloudWatchAWS X-RayAWS FSx for LustreScale-to-Zero clusteringIndexedDB browser cachePWA Service WorkersOffline Sync Queues

Full Stack & Core Eng

PythonTypeScriptNext.jsReactWebGPU/WASMWebSocketsREST APIsGitMongoDB AtlasMongoose ORMStripe APIRazorpay APIFirebase Admin SDK
High-Concurrency BroadcastLinux AdminAnsible AutomationsMDX System LayoutsKaTeX (LaTeX rendering)NextAuth.js (OAuth)ACID Database TransactionsExponential Backoff RetriesHigh-Precision Decimal BillingWeb Push NotificationsIP Rate Limiting

Leadership & Strategy

Team Leadership (R&D)Product ArchitectureCompute Cost ModelingTechnical WritingB2B Product ManagementClient Engagement
Growth Engineering (SEO)Vendor BenchmarkingProject ScopingAgile ExecutionAI Engine Optimization (AEO)RBAC (Access Control)

Education & Credentials

Work Focus & Engagements

I thrive on small, high-autonomy teams shipping products where speed and cost metrics guide architecture decisions.

Open to senior/staff MLE roles covering GPU serving, cold-starts, multi-agent frameworks (LangGraph/MCP), and system optimization.

Based in Bengaluru, India. Open to hybrid, remote, or relocation roles. Available for contract consulting and full-time positions.

Let's Build Together

Review my infrastructure reports or query availability via direct mail.

Hi! I'm Yuvraj's AI assistant. I know everything about his projects, experience, and technical work. Ask me anything!