Back to Deep Dives
Technical Deep Dive

Serving Engine Internals

Source-level comparisons of vLLM, SGLang, and llama.cpp — quantization tradeoffs, memory leaks, and when each engine wins.

Systems & Infrastructure·6 min read·Production Verified

TL;DR Summary

  • Engine Comparison: Evaluated vLLM, SGLang, and llama.cpp across 10 production model deployments, analyzing quantization tradeoffs, memory leaks, and parallelization capabilities.
  • vLLM Snapshot Bug: Discovered that vLLM's default CUDA Graph capture caused infinite hangs upon container snapshot restoration. Fixed by forcing eager execution (enforce_eager=True).
  • SGLang Memory Leak: Identified and mitigated a slow VRAM leak in SGLang's FlashInfer allocation during tensor parallel sharding by capping prefill sizes and forcing PyTorch caching allocator garbage collection.
  • llama.cpp Limitations: Built a custom source-compiled Docker pipeline for llama.cpp to support new GGUF architectures, but determined it is unsuitable for high-concurrency production due to its lack of dynamic continuous batching.

Running LLMs in production requires choosing the right runtime engine. Across 10 model deployments, we evaluated the three dominant engines: vLLM, SGLang, and llama.cpp. We discovered deep architectural differences, memory leaks, and engine-specific quirks that impact scalability.

Engine Capability Map

Feature / MetricvLLM (v0.7.2)SGLang (v0.4.3)llama.cpp (b8765)
Primary BackendPyTorch / Triton / vLLM KernelsPyTorch / FlashInfer / TritonPure C / C++ (GGML)
Quantization SupportFP8, AWQ, GPTQ, INT4FP8, AWQ, GPTQGGUF (Q4_K_M, IQ4_NL, etc.)
Attention OptimizationPagedAttention, FlashAttentionRadixAttention, FlashInferRing Buffer, FlashAttention
KV Cache ReusePrefix Caching (static)Radix Cache (dynamic LRU)Smart KV Cache shift
Multi-GPU ParallelismTP + PP (highly stable)TP (stable), PP (experimental)Row/Column splitting (single node)

vLLM: The Industry Standard

vLLM is the most mature engine for high-throughput batching. However, it exhibits critical gotchas when combined with serverless GPU checkpointing (CRIU).

The Snapshot Hang Bug

During deployment of GLM-4.7-Flash-FP8, the container would hang indefinitely upon snapshot restore. The root cause was vLLM's default **CUDA Graph capture**. At startup, vLLM captures CUDA graphs for shapes up to max_num_seqs=16.

When Modal captures a GPU snapshot, the captured CUDA graphs refer to memory addresses that become invalid upon container restoration. The fix was forcing eager execution or running eager warmups:

# vLLM initialization with eager mode execution guards
from vllm import LLM, SamplingParams

llm = LLM(
    model="THUDM/glm-4-9b-chat",
    quantization="fp8",
    enforce_eager=True, # Disable graph capture to allow safe CRIU restores
    max_model_len=8192
)

SGLang: The Low-Latency Contender

SGLang excels at complex, multi-turn prompts and structured JSON outputs due to **RadixAttention** (which dynamically caches KV caches in an LRU radix tree).

The Flash-Attention Memory Leak

During stress testing of the MiniMax-M2.7 (456B MoE) model on a 4× B200 cluster, we detected a slow but steady VRAM creep under concurrent streaming requests. Running a deep memory profiler revealed that SGLang's flash-attention buffer allocation was not freeing intermediate PyTorch tensor handles during model sharding across tensor parallel ranks.

We mitigated this by forcing garbage collection and flushing the PyTorch caching allocator after initialization, and utilizing SGLang's --chunked-prefill-size 4096 to cap the maximum context allocation block.

llama.cpp: Edge & CPU Portability

When serving GGUF models (like Gemma-4-26B-A4B-it), llama.cpp is the only engine capable of running across heterogeneous hardware (like Macbooks or cheap L4 GPUs).

Source-Compilation Pipeline

To support the latest Gemma-4 architectures before they landed in stable package releases, we built a custom multi-stage Docker build that compiles llama.cpp directly from source:

# Multi-stage CUDA builder
FROM nvidia/cuda:12.4.1-devel-ubuntu22.04 AS builder
RUN apt-get update && apt-get install -y git cmake build-essential
RUN git clone --recursive https://github.com/ggerganov/llama.cpp.git && \
    cd llama.cpp && mkdir build && cd build && \
    cmake -DGGML_CUDA=ON .. && make -j$(nproc) llama-server

# Runtime image copy
FROM nvidia/cuda:12.4.1-runtime-ubuntu22.04
COPY --from=builder /llama.cpp/build/bin/llama-server /usr/local/bin/llama-server

We then wrap the llama-server binary in a Python subprocess, performing HTTP-level health checks before exposing it to the gateway.

Quantization Performance Breakdown

QuantizationSize on DiskInference SpeedPerplexity LossBest Engine
Unquantized (BF16)~74 GB18 tok/s (H100)0.00% (Baseline)vLLM
FP8 (Dynamic)~37 GB35 tok/s (L40S)< 0.05%SGLang
GGUF (Q4_K_M)~21 GB24 tok/s (L4)~0.85%llama.cpp

Key Learnings

  1. SGLang wins for structured routing: SGLang compiles the output JSON schema directly into a regex guide, running up to 5× faster than vLLM's Out-of-Line JSON parsers.
  2. vLLM is superior for multi-node setups: When scaling beyond a single node, vLLM's Ray integration makes Tensor Parallelism across machines seamless.
  3. Never use llama.cpp for high concurrency: llama.cpp lacks dynamic continuous batching; it queues concurrent requests synchronously, resulting in high Time-to-First-Token (TTFT) when loaded.

Source: §2 (GLM-4.7-Flash-FP8), §5 (Gemma-4-26B-it), §6 (MiniMax-M2.7).

Hi! I'm Yuvraj's AI assistant. I know everything about his projects, experience, and technical work. Ask me anything!