✨TL;DR Summary
- Flash Attention Regression: Investigated a massive latency scaling issue (6.7s to 600s+) in GLM-4.7-Flash. Discovered that llama.cpp was silently falling back to O(n²) attention because the model's GQA ratio (20) was not divisible by 16.
- GLM-5.1 Code Review: Identified and fixed 36 critical infrastructure bugs in a production SGLang deployment, including missing watchdog timeouts, uncaptured subprocess stdout, and race conditions.
- Concurrency Cliff Discovery: Profiled llama.cpp throughput and discovered a catastrophic performance collapse (0 tokens/s) at 4+ concurrent requests due to head-of-line blocking in its single-threaded architecture.
Debugging production ML systems requires going deep: into CUDA kernels, framework internals, subprocess management, and upstream source code. These investigations span multiple stacks and uncovered both critical bugs and fundamental architecture issues.
Flash Attention Investigation
GLM-4.7-Flash GGUF on L40S showed suspicious behavior: VRAM usage constant at 24.54 GiB regardless of context length, and latency scaling went from 6.7s at 1K tokens to 600s+ at 16K (timeout). flash_attn=True had no measurable effect.
Root cause: llama-cpp-python v0.3.16 ships vendored llama.cpp from August 2025. The CUDA flash attention kernel only supports GQA ratios divisible by 16. GLM-4.7-Flash's GQA ratio is 20. When unsupported, llama.cpp silently falls back to O(n²) standard attention with no warning.
The fix exists upstream: PR #18953 relaxes the constraint to GQA ratio divisible by 4. However, backporting caused ABI incompatibilities and a 7.5× regression when missing prerequisite kernel changes.
GLM-5.1 FP8 Code Review: 36 Issues
Two-pass review of a production SGLang deployment on Modal.
Critical Bugs (First Pass)
| Bug | Severity | Fix |
|---|---|---|
serve() / startup() race condition | Critical | Start subprocess inside serve() |
| No subprocess stdout/stderr capture | Critical | stdout=PIPE, stderr=STDOUT with daemon thread |
region=REGION silently ignored | Medium | Not valid on @app.cls |
Missing --watchdog-timeout | Critical | Set to 1200 (20 min) for long loads |
| No subprocess crash detection | Critical | Background monitor with os._exit(1) |
Second-Pass Bugs
| Bug | Severity | Fix |
|---|---|---|
bufsize=1 in binary mode (1-byte buffer) | Critical | text=True, bufsize=1 |
| Monitor thread races with startup | Critical | Add _startup_complete gate |
Missing dg_volume.reload() in compile | High | Add reload before compilation |
| 10-15 min silent compilation | High | Stream output instead of capture_output=True |
| No per-request token cap | Medium | Add --max-total-tokens 65536 |
Concurrency Cliff Discovery
llama-cpp-python throughput collapses at N≥4 concurrent requests:
| Concurrent N | Wall Time | Agg tok/s |
|---|---|---|
| 1 | 2.34s | 54.8 |
| 2 | 2.27s | 56.4 |
| 4 | 24.82s | 0.0 |
| 8 | 54.80s | 0.0 |
Root cause: Inference is strictly single-threaded. At N≥4, queue depth causes head-of-line blocking that cascades into timeouts.
Key Learnings
- Silent failures are the worst kind: flash attention silently fell back to O(n²). Always verify optimizations are active.
- Read framework source code: the GQA ratio constraint was not documented. Found by reading CUDA kernel dispatch logic.
- Source patches are valid but risky: ABI incompatibilities can make backports worse than the original problem.
- Subprocess log streaming is non-negotiable: zero visibility into 5-7 minute weight loading is unacceptable.
- Use
text=Truewithbufsize=1:bufsize=1in binary mode becomes a catastrophically slow 1-byte buffer. - Concurrency testing must cover realistic loads: the cliff at N≥4 only appeared under realistic load testing.
Source: §3 (Flash Attention Investigation, Concurrency Cliff), §4 (GLM-5.1 FP8 Code Review). Upstream issues: llama.cpp PR #18953.