✨TL;DR Summary
- Engine Comparison: Evaluated vLLM, SGLang, and llama.cpp across 10 production model deployments, analyzing quantization tradeoffs, memory leaks, and parallelization capabilities.
- vLLM Snapshot Bug: Discovered that vLLM's default CUDA Graph capture caused infinite hangs upon container snapshot restoration. Fixed by forcing eager execution (
enforce_eager=True). - SGLang Memory Leak: Identified and mitigated a slow VRAM leak in SGLang's FlashInfer allocation during tensor parallel sharding by capping prefill sizes and forcing PyTorch caching allocator garbage collection.
- llama.cpp Limitations: Built a custom source-compiled Docker pipeline for llama.cpp to support new GGUF architectures, but determined it is unsuitable for high-concurrency production due to its lack of dynamic continuous batching.
Running LLMs in production requires choosing the right runtime engine. Across 10 model deployments, we evaluated the three dominant engines: vLLM, SGLang, and llama.cpp. We discovered deep architectural differences, memory leaks, and engine-specific quirks that impact scalability.
Engine Capability Map
| Feature / Metric | vLLM (v0.7.2) | SGLang (v0.4.3) | llama.cpp (b8765) |
|---|---|---|---|
| Primary Backend | PyTorch / Triton / vLLM Kernels | PyTorch / FlashInfer / Triton | Pure C / C++ (GGML) |
| Quantization Support | FP8, AWQ, GPTQ, INT4 | FP8, AWQ, GPTQ | GGUF (Q4_K_M, IQ4_NL, etc.) |
| Attention Optimization | PagedAttention, FlashAttention | RadixAttention, FlashInfer | Ring Buffer, FlashAttention |
| KV Cache Reuse | Prefix Caching (static) | Radix Cache (dynamic LRU) | Smart KV Cache shift |
| Multi-GPU Parallelism | TP + PP (highly stable) | TP (stable), PP (experimental) | Row/Column splitting (single node) |
vLLM: The Industry Standard
vLLM is the most mature engine for high-throughput batching. However, it exhibits critical gotchas when combined with serverless GPU checkpointing (CRIU).
The Snapshot Hang Bug
During deployment of GLM-4.7-Flash-FP8, the container would hang indefinitely upon snapshot restore. The root cause was vLLM's default **CUDA Graph capture**. At startup, vLLM captures CUDA graphs for shapes up to max_num_seqs=16.
When Modal captures a GPU snapshot, the captured CUDA graphs refer to memory addresses that become invalid upon container restoration. The fix was forcing eager execution or running eager warmups:
# vLLM initialization with eager mode execution guards
from vllm import LLM, SamplingParams
llm = LLM(
model="THUDM/glm-4-9b-chat",
quantization="fp8",
enforce_eager=True, # Disable graph capture to allow safe CRIU restores
max_model_len=8192
)SGLang: The Low-Latency Contender
SGLang excels at complex, multi-turn prompts and structured JSON outputs due to **RadixAttention** (which dynamically caches KV caches in an LRU radix tree).
The Flash-Attention Memory Leak
During stress testing of the MiniMax-M2.7 (456B MoE) model on a 4× B200 cluster, we detected a slow but steady VRAM creep under concurrent streaming requests. Running a deep memory profiler revealed that SGLang's flash-attention buffer allocation was not freeing intermediate PyTorch tensor handles during model sharding across tensor parallel ranks.
We mitigated this by forcing garbage collection and flushing the PyTorch caching allocator after initialization, and utilizing SGLang's --chunked-prefill-size 4096 to cap the maximum context allocation block.
llama.cpp: Edge & CPU Portability
When serving GGUF models (like Gemma-4-26B-A4B-it), llama.cpp is the only engine capable of running across heterogeneous hardware (like Macbooks or cheap L4 GPUs).
Source-Compilation Pipeline
To support the latest Gemma-4 architectures before they landed in stable package releases, we built a custom multi-stage Docker build that compiles llama.cpp directly from source:
# Multi-stage CUDA builder
FROM nvidia/cuda:12.4.1-devel-ubuntu22.04 AS builder
RUN apt-get update && apt-get install -y git cmake build-essential
RUN git clone --recursive https://github.com/ggerganov/llama.cpp.git && \
cd llama.cpp && mkdir build && cd build && \
cmake -DGGML_CUDA=ON .. && make -j$(nproc) llama-server
# Runtime image copy
FROM nvidia/cuda:12.4.1-runtime-ubuntu22.04
COPY --from=builder /llama.cpp/build/bin/llama-server /usr/local/bin/llama-serverWe then wrap the llama-server binary in a Python subprocess, performing HTTP-level health checks before exposing it to the gateway.
Quantization Performance Breakdown
| Quantization | Size on Disk | Inference Speed | Perplexity Loss | Best Engine |
|---|---|---|---|---|
| Unquantized (BF16) | ~74 GB | 18 tok/s (H100) | 0.00% (Baseline) | vLLM |
| FP8 (Dynamic) | ~37 GB | 35 tok/s (L40S) | < 0.05% | SGLang |
| GGUF (Q4_K_M) | ~21 GB | 24 tok/s (L4) | ~0.85% | llama.cpp |
Key Learnings
- SGLang wins for structured routing: SGLang compiles the output JSON schema directly into a regex guide, running up to 5× faster than vLLM's Out-of-Line JSON parsers.
- vLLM is superior for multi-node setups: When scaling beyond a single node, vLLM's Ray integration makes Tensor Parallelism across machines seamless.
- Never use llama.cpp for high concurrency: llama.cpp lacks dynamic continuous batching; it queues concurrent requests synchronously, resulting in high Time-to-First-Token (TTFT) when loaded.
Source: §2 (GLM-4.7-Flash-FP8), §5 (Gemma-4-26B-it), §6 (MiniMax-M2.7).