✨TL;DR Summary
- WebGPU vs WASM: WebGPU is 10-15× faster, achieving 30-60 tokens/s on desktop and 15-25 tokens/s on mobile.
- Zero Latency Streaming: Implemented Web Workers and
postMessageto bypass React main-thread blocking, enabling true progressive streaming. - Two-Layer Caching: Used a custom Service Worker for the app shell and the Browser Cache API for ONNX model files (~3.4 GB) to eliminate redundant network requests.
- Multimodal Support: Built an alternating local sliding-window attention pipeline supporting text, image (downscaled canvas), and audio (16kHz PCM log-mel spectrogram) inputs.
- Serverless Limits Avoided: Used Next.js static exports (
output: 'export') to bypass the Vercel 250MB serverless function limit caused by native ONNX binaries.
Running LLMs directly in the browser with no server, no API calls, and fully offline requires solving hard problems around WebGPU compatibility, streaming architecture, PWA caching, and mobile constraints. I have built two production PWAs that do this: Qwen3.5-0.8B and Gemma 4 E2B (text + image + audio).
Deployments
| Model | Size | Modalities | Backend | Live |
|---|---|---|---|---|
| Qwen3.5-0.8B ONNX (q4) | ~850 MB | Text + Image | WebGPU / WASM | qwen.quantml.org |
| Gemma 4 E2B ONNX (q4f16) | ~3.4 GB | Text + Image + Audio | WebGPU | gemma4.quantml.org |
Architecture
Both apps use the same core architecture: a Next.js 15 static export (no server needed) with a dedicated Web Worker running Transformers.js on ONNX Runtime Web. The main thread communicates with the worker via postMessage, and the service worker handles app-shell caching separately from model file caching.
Main Thread (React 19) Web Worker Cache Layer
┌──────────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ AppShell → ChatInput │ │ Transformers.js │ │ Service Worker │
│ → MessageBubble │◄──►│ ONNX Runtime Web │ │ (app shell: HTML,│
│ → ModelStatus │ │ WebGPU or WASM │ │ JS, CSS, icons) │
│ │ │ TextStreamer │ │ │
│ useInferenceEngine │ │ → postMessage │ │ Transformers.js │
│ → TransformersEngine │ │ per token │ │ Cache API │
└──────────────────────┘ └──────────────────┘ │ (model files) │
└──────────────────┘WebGPU vs WASM: The Critical Backend Decision
WebGPU is 10-15× faster than WASM for models over 100M parameters. But it requires a secure context (HTTPS or localhost), Chrome 113+ on desktop or Chrome 121+ on Android 12+ with supported GPU, and the adapter must have sufficient buffer size.
The detection waterfall checks: secure context → navigator.gpu existence → adapter availability → maxBufferSize ≥ 256 MB → shader-f16 support. If WebGPU is selected but fails at runtime, the engine automatically retries with WASM. Network errors are re-thrown since they would fail on WASM too.
| Metric | WebGPU (Desktop) | WebGPU (Pixel 9a) | WASM (Pixel 9a) |
|---|---|---|---|
| TTFT | 1-3s | 3-8s | 10-30s |
| Decode speed | 30-60 tok/s | 15-25 tok/s | 2-8 tok/s |
| Runtime memory | ~1-2 GB | ~1-2 GB | ~0.8-1.5 GB |
The Streaming Problem
Transformers.js's model.generate() uses an internal autoregressive loop. With WASM, session.run() executes synchronously on the calling thread: the await resolves as a microtask without yielding to the macrotask queue. React's setMessages() calls batch but never flush until the entire loop finishes. Result: all text appears at once after a long delay.
Solution: Web Worker. Each token is sent from the worker via postMessage({ type: 'token', token }). Every postMessage delivery is a separate macrotask on the main thread's event loop. React processes each setMessages() update in its own render cycle. True progressive streaming on both WebGPU and WASM.
Two-Layer Caching Strategy
Layer 1: Service Worker (app shell). Hand-written sw.js, not Workbox. Caches all static assets (HTML, JS, CSS) using cache-first strategy. Navigation uses network-first with cache fallback. Explicitly skips HuggingFace URLs, .onnx, and .onnx_data since model files are handled by Layer 2.
Layer 2: Transformers.js internal cache (model files). Transformers.js uses the browser Cache API directly for ONNX model files. First load downloads ~850 MB (Qwen3.5) or ~3.4 GB (Gemma 4 E2B). Subsequent loads: instant from cache, no network needed. These caches persist across sessions and survive service worker updates.
Gemma 4 E2B: Multimodal in the Browser
Gemma 4 E2B is a 2.3B effective parameter model supporting text, image, and audio input in the browser. It uses an alternating local sliding-window (512 tokens) and global full-context attention pattern, with a 128K token context window.
Audio Processing Pipeline
MediaRecordercaptures WebM/Opus viagetUserMediaFileReader.readAsDataURLconverts the blob to a data URLAudioContext({ sampleRate: 16000 })+decodeAudioDataconverts to 16kHz PCM- Stereo-to-mono downmix with
sqrt(2)/2scaling Float32Arraypassed to the Gemma4 processorGemma4AudioFeatureExtractorcomputes log-mel spectrogram internally
Image Processing Pipeline
- User selects image via file input
createImageBitmapdecodes it- Canvas downscales to max 1024×1024 preserving aspect ratio
canvas.toDataURL('image/jpeg', 0.85)compressesRawImage.read(dataUrl)converts to model format
Progress Tracking for Large Downloads
Transformers.js progress_callback fires events per file, but event.total is frequently 0 (server does not send Content-Length for partial responses) and multiple files download concurrently. The solution: a single shared tracker with a progress water mark preventing the percentage from going backward, smoothed speed calculation, and ETA display.
Three phases shown in the UI: Loading processor (spinner, small files), Downloading model (progress ring with bytes, speed, ETA), and Compiling model (spinner, WebGPU shader compilation).
Engine Lifecycle & Race Conditions
The engine has three async operations that can race: initialize(), generate(), and dispose(). Guards prevent concurrent calls via init/generate promises, an epoch counter that increments on each init/dispose (stale callbacks bail), and a disposed flag checked at every async boundary.
GPU memory leaks are prevented by always calling disposeTensors(inputs) in a finally block after every generate() call, and disposing the old model before loading a new one.
Deployment: Solving the 250MB Serverless Limit
@huggingface/transformers pulls in onnxruntime-node (~355MB of native binaries) as a dependency. Even though inference runs entirely client-side, Next.js traces it into the serverless function bundle, exceeding Vercel's 250MB limit.
Solution: output: 'export' in next.config.ts. This produces a fully static site (HTML + JS + CSS) with zero serverless functions. Since the app is 100% client-side, no server is needed. Combined with webpack aliases (sharp$: false, onnxruntime-node$: false) to prevent bundling server-only packages.
Mobile-Specific Gotchas
| Constraint | Impact | Mitigation |
|---|---|---|
| HTTPS mandatory | WebGPU silently disabled on HTTP | Serve over HTTPS; localhost is a secure context |
| Android WebGPU requires Chrome 121+ on Android 12+ | ~30% of Android users cannot use WebGPU | Auto-fallback to WASM with mobile-specific diagnostics |
| Google Advanced Protection disables WebGPU | Entirely blocks navigator.gpu | Detection in capability check; suggest WASM fallback |
| Hardware acceleration must be ON | Chrome setting often disabled for battery | Check chrome://gpu diagnostic |
| Device memory < 4 GB | Model may OOM or run very slowly | Show warning banner via navigator.deviceMemory |
| Battery drain | Inference drains mobile battery significantly | Warning banner when on mobile |
Key Gotchas and Lessons
- HTTPS is non-negotiable for WebGPU. Serving over plain HTTP makes
navigator.gpuundefined. Always checkisSecureContextfirst. - Turbopack breaks Web Workers.
new Worker(new URL('./file.ts', import.meta.url))only works with webpack. Dev script must usenext devnotnext dev --turbopack. - React batches kill streaming on WASM. When
model.generate()blocks the main thread, React accumulatessetStatecalls but never re-renders. Only a Web Worker fixes this: no amount offlushSyncorrequestAnimationFramehelps. - Do not double-cache model files. Transformers.js manages its own model cache in the browser Cache API. Skip HuggingFace URLs in the service worker to avoid double-caching ~850 MB of data.
- Service Worker range requests fail. Large model files use HTTP range requests. The SW must bypass these URLs since
Cache.put()throwsTypeErroron 206 Partial Content responses. - Progress callback flooding. Transformers.js fires download events very frequently. Throttle React state updates to 150ms with a trailing timer to prevent "Maximum update depth exceeded."
beforeinstallpromptfires before React mounts. Capture it globally in a vanilla JS file (pwa.js) loaded via<script defer>, not inuseEffect.- Disable thinking mode on small models. Qwen3.5's chain-of-thought reasoning wastes tokens on a 0.8B model. Pass
enable_thinking: falseexplicitly. - Context windowing is essential. With 2048 max tokens and potentially long conversations, implement a sliding window (20 messages) with summarization of older messages.
- Audio must be 16kHz. The feature extractor's
sampling_rateconfig requires 16kHz. CreateAudioContextwith{ sampleRate: 16000 }for automatic resampling viadecodeAudioData.
Key Learnings
- WebGPU is transformative for browser AI: 10-15× faster than WASM, making real-time LLM inference feasible on consumer hardware.
- Web Workers are the only real solution for streaming: the main thread cannot yield during WASM inference, no matter what React trick you use.
- Two-layer caching is required: service worker for app shell, Cache API for model files. Do not mix them.
- HTTPS is the #1 gotcha: the default Transformers.js error message is misleading. Always check
isSecureContextfirst. - Static export solves deployment:
output: 'export'avoids the 250MB serverless function limit entirely. - Mobile WebGPU is real but fragile: Android 12+, Chrome 121+, hardware acceleration ON, no Advanced Protection. Each constraint needs explicit detection and user-friendly diagnostics.
- Progress UX matters for 3.4 GB downloads: byte-level progress with smoothed speed and ETA prevents users from thinking the app is broken.