Back to Deep Dives
Technical Deep Dive

Browser AI: Offline WebGPU

Deploying Gemma-4 and Qwen3.5 entirely client-side — zero server costs, zero latency, fully offline.

Systems & Infrastructure·7 min read·Production Verified

TL;DR Summary

  • WebGPU vs WASM: WebGPU is 10-15× faster, achieving 30-60 tokens/s on desktop and 15-25 tokens/s on mobile.
  • Zero Latency Streaming: Implemented Web Workers and postMessage to bypass React main-thread blocking, enabling true progressive streaming.
  • Two-Layer Caching: Used a custom Service Worker for the app shell and the Browser Cache API for ONNX model files (~3.4 GB) to eliminate redundant network requests.
  • Multimodal Support: Built an alternating local sliding-window attention pipeline supporting text, image (downscaled canvas), and audio (16kHz PCM log-mel spectrogram) inputs.
  • Serverless Limits Avoided: Used Next.js static exports (output: 'export') to bypass the Vercel 250MB serverless function limit caused by native ONNX binaries.

Running LLMs directly in the browser with no server, no API calls, and fully offline requires solving hard problems around WebGPU compatibility, streaming architecture, PWA caching, and mobile constraints. I have built two production PWAs that do this: Qwen3.5-0.8B and Gemma 4 E2B (text + image + audio).

Deployments

ModelSizeModalitiesBackendLive
Qwen3.5-0.8B ONNX (q4)~850 MBText + ImageWebGPU / WASMqwen.quantml.org
Gemma 4 E2B ONNX (q4f16)~3.4 GBText + Image + AudioWebGPUgemma4.quantml.org

Architecture

Both apps use the same core architecture: a Next.js 15 static export (no server needed) with a dedicated Web Worker running Transformers.js on ONNX Runtime Web. The main thread communicates with the worker via postMessage, and the service worker handles app-shell caching separately from model file caching.

Main Thread (React 19)           Web Worker                 Cache Layer
┌──────────────────────┐    ┌──────────────────┐    ┌──────────────────┐
│ AppShell → ChatInput │    │ Transformers.js  │    │ Service Worker   │
│ → MessageBubble      │◄──►│ ONNX Runtime Web │    │ (app shell: HTML,│
│ → ModelStatus        │    │ WebGPU or WASM   │    │  JS, CSS, icons) │
│                      │    │ TextStreamer     │    │                  │
│ useInferenceEngine   │    │ → postMessage    │    │ Transformers.js  │
│ → TransformersEngine │    │   per token      │    │ Cache API        │
└──────────────────────┘    └──────────────────┘    │ (model files)    │
                                                     └──────────────────┘

WebGPU vs WASM: The Critical Backend Decision

WebGPU is 10-15× faster than WASM for models over 100M parameters. But it requires a secure context (HTTPS or localhost), Chrome 113+ on desktop or Chrome 121+ on Android 12+ with supported GPU, and the adapter must have sufficient buffer size.

The detection waterfall checks: secure context → navigator.gpu existence → adapter availability → maxBufferSize ≥ 256 MBshader-f16 support. If WebGPU is selected but fails at runtime, the engine automatically retries with WASM. Network errors are re-thrown since they would fail on WASM too.

MetricWebGPU (Desktop)WebGPU (Pixel 9a)WASM (Pixel 9a)
TTFT1-3s3-8s10-30s
Decode speed30-60 tok/s15-25 tok/s2-8 tok/s
Runtime memory~1-2 GB~1-2 GB~0.8-1.5 GB

The Streaming Problem

Transformers.js's model.generate() uses an internal autoregressive loop. With WASM, session.run() executes synchronously on the calling thread: the await resolves as a microtask without yielding to the macrotask queue. React's setMessages() calls batch but never flush until the entire loop finishes. Result: all text appears at once after a long delay.

Solution: Web Worker. Each token is sent from the worker via postMessage({ type: 'token', token }). Every postMessage delivery is a separate macrotask on the main thread's event loop. React processes each setMessages() update in its own render cycle. True progressive streaming on both WebGPU and WASM.

Two-Layer Caching Strategy

Layer 1: Service Worker (app shell). Hand-written sw.js, not Workbox. Caches all static assets (HTML, JS, CSS) using cache-first strategy. Navigation uses network-first with cache fallback. Explicitly skips HuggingFace URLs, .onnx, and .onnx_data since model files are handled by Layer 2.

Layer 2: Transformers.js internal cache (model files). Transformers.js uses the browser Cache API directly for ONNX model files. First load downloads ~850 MB (Qwen3.5) or ~3.4 GB (Gemma 4 E2B). Subsequent loads: instant from cache, no network needed. These caches persist across sessions and survive service worker updates.

Gemma 4 E2B: Multimodal in the Browser

Gemma 4 E2B is a 2.3B effective parameter model supporting text, image, and audio input in the browser. It uses an alternating local sliding-window (512 tokens) and global full-context attention pattern, with a 128K token context window.

Audio Processing Pipeline

  1. MediaRecorder captures WebM/Opus via getUserMedia
  2. FileReader.readAsDataURL converts the blob to a data URL
  3. AudioContext({ sampleRate: 16000 }) + decodeAudioData converts to 16kHz PCM
  4. Stereo-to-mono downmix with sqrt(2)/2 scaling
  5. Float32Array passed to the Gemma4 processor
  6. Gemma4AudioFeatureExtractor computes log-mel spectrogram internally

Image Processing Pipeline

  1. User selects image via file input
  2. createImageBitmap decodes it
  3. Canvas downscales to max 1024×1024 preserving aspect ratio
  4. canvas.toDataURL('image/jpeg', 0.85) compresses
  5. RawImage.read(dataUrl) converts to model format

Progress Tracking for Large Downloads

Transformers.js progress_callback fires events per file, but event.total is frequently 0 (server does not send Content-Length for partial responses) and multiple files download concurrently. The solution: a single shared tracker with a progress water mark preventing the percentage from going backward, smoothed speed calculation, and ETA display.

Three phases shown in the UI: Loading processor (spinner, small files), Downloading model (progress ring with bytes, speed, ETA), and Compiling model (spinner, WebGPU shader compilation).

Engine Lifecycle & Race Conditions

The engine has three async operations that can race: initialize(), generate(), and dispose(). Guards prevent concurrent calls via init/generate promises, an epoch counter that increments on each init/dispose (stale callbacks bail), and a disposed flag checked at every async boundary.

GPU memory leaks are prevented by always calling disposeTensors(inputs) in a finally block after every generate() call, and disposing the old model before loading a new one.

Deployment: Solving the 250MB Serverless Limit

@huggingface/transformers pulls in onnxruntime-node (~355MB of native binaries) as a dependency. Even though inference runs entirely client-side, Next.js traces it into the serverless function bundle, exceeding Vercel's 250MB limit.

Solution: output: 'export' in next.config.ts. This produces a fully static site (HTML + JS + CSS) with zero serverless functions. Since the app is 100% client-side, no server is needed. Combined with webpack aliases (sharp$: false, onnxruntime-node$: false) to prevent bundling server-only packages.

Mobile-Specific Gotchas

ConstraintImpactMitigation
HTTPS mandatoryWebGPU silently disabled on HTTPServe over HTTPS; localhost is a secure context
Android WebGPU requires Chrome 121+ on Android 12+~30% of Android users cannot use WebGPUAuto-fallback to WASM with mobile-specific diagnostics
Google Advanced Protection disables WebGPUEntirely blocks navigator.gpuDetection in capability check; suggest WASM fallback
Hardware acceleration must be ONChrome setting often disabled for batteryCheck chrome://gpu diagnostic
Device memory < 4 GBModel may OOM or run very slowlyShow warning banner via navigator.deviceMemory
Battery drainInference drains mobile battery significantlyWarning banner when on mobile

Key Gotchas and Lessons

  1. HTTPS is non-negotiable for WebGPU. Serving over plain HTTP makes navigator.gpu undefined. Always check isSecureContext first.
  2. Turbopack breaks Web Workers. new Worker(new URL('./file.ts', import.meta.url)) only works with webpack. Dev script must use next dev not next dev --turbopack.
  3. React batches kill streaming on WASM. When model.generate() blocks the main thread, React accumulates setState calls but never re-renders. Only a Web Worker fixes this: no amount of flushSync or requestAnimationFrame helps.
  4. Do not double-cache model files. Transformers.js manages its own model cache in the browser Cache API. Skip HuggingFace URLs in the service worker to avoid double-caching ~850 MB of data.
  5. Service Worker range requests fail. Large model files use HTTP range requests. The SW must bypass these URLs since Cache.put() throws TypeError on 206 Partial Content responses.
  6. Progress callback flooding. Transformers.js fires download events very frequently. Throttle React state updates to 150ms with a trailing timer to prevent "Maximum update depth exceeded."
  7. beforeinstallprompt fires before React mounts. Capture it globally in a vanilla JS file (pwa.js) loaded via <script defer>, not in useEffect.
  8. Disable thinking mode on small models. Qwen3.5's chain-of-thought reasoning wastes tokens on a 0.8B model. Pass enable_thinking: false explicitly.
  9. Context windowing is essential. With 2048 max tokens and potentially long conversations, implement a sliding window (20 messages) with summarization of older messages.
  10. Audio must be 16kHz. The feature extractor's sampling_rate config requires 16kHz. Create AudioContext with { sampleRate: 16000 } for automatic resampling via decodeAudioData.

Key Learnings

  1. WebGPU is transformative for browser AI: 10-15× faster than WASM, making real-time LLM inference feasible on consumer hardware.
  2. Web Workers are the only real solution for streaming: the main thread cannot yield during WASM inference, no matter what React trick you use.
  3. Two-layer caching is required: service worker for app shell, Cache API for model files. Do not mix them.
  4. HTTPS is the #1 gotcha: the default Transformers.js error message is misleading. Always check isSecureContext first.
  5. Static export solves deployment: output: 'export' avoids the 250MB serverless function limit entirely.
  6. Mobile WebGPU is real but fragile: Android 12+, Chrome 121+, hardware acceleration ON, no Advanced Protection. Each constraint needs explicit detection and user-friendly diagnostics.
  7. Progress UX matters for 3.4 GB downloads: byte-level progress with smoothed speed and ETA prevents users from thinking the app is broken.

Hi! I'm Yuvraj's AI assistant. I know everything about his projects, experience, and technical work. Ask me anything!