✨TL;DR Summary
- Triple-Layer Evaluation: Built a harness running 210 scenarios scoring GGUF models via Regex Extraction, Sandboxed Subprocess Code Execution, and async LLM-as-a-Judge (Claude).
- Custom Logits Processor: Designed a 3-State Finite State Machine (FSM) Logits Processor to forcefully cap "thinking" token budgets by overriding logits probabilities to
-inf. - Latency & Accuracy: Hard-capping reasoning tokens at 50 reduced execution latency by 90% (e.g., 3.8s down to 1.1s) while retaining 96.2% accuracy in deterministic logic puzzles.
Quantizing a model (like GLM-4.7-Flash to GGUF) reduces memory footprints but can degrade reasoning, syntax parsing, and function-calling abilities. To verify that quantization did not destroy our deployment's capabilities, we built a custom, deterministic evaluation harness running 210 specific test scenarios.
The Evaluation Architecture
The harness tests the model across logic, coding, and reasoning categories. To ensure evaluation is fast and automated, we designed three scoring mechanisms:
| Evaluation Method | Test Target | Implementation | Metric Tracked |
|---|---|---|---|
| Regex Extraction | Structured JSON & Math | Strict regex pattern matching | Format compliance |
| Code Execution | Algorithm & Syntax correctness | Subprocess shell execution in a sandbox | Run-time compile and output match |
| LLM-as-a-Judge | Open-ended responses | Asynchronous grading via Claude-3.5-Sonnet | Semantic accuracy |
The Code Execution Sandbox
For coding tests, the model is asked to write an algorithm. The harness parses the generated code block, writes it to a temporary file, and runs it under a restricted subprocess environment with a strict CPU timeout:
import subprocess
import tempfile
import os
def run_sandboxed_code(code_str: str, timeout_sec=2.0) -> bool:
with tempfile.NamedTemporaryFile(suffix=".py", delete=False) as f:
f.write(code_str.encode('utf-8'))
f_name = f.name
try:
# Run subprocess with resource caps
result = subprocess.run(
["python", f_name],
capture_output=True,
text=True,
timeout=timeout_sec
)
return result.returncode == 0
except subprocess.TimeoutExpired:
return False
finally:
os.unlink(f_name)The Logits Processor: Controlling the "Thinking" Budget
During evaluation, we observed that GLM-4.7-Flash-GGUF spent up to 80% of its generation time on internal reasoning (wrapped inside <think>...</think> blocks) before writing a simple answer. For latency-sensitive APIs, this is unacceptable.
Since llama.cpp had no native parameter to stop thinking at a precise token threshold, we wrote a custom 3-State Logits Processor inside our Python serving handler.
The 3-State Finite State Machine (FSM)
The logits processor intercepts every token generated by the model's forward pass:
- COUNTING: The processor counts token IDs. If they belong to the thinking channel, we increment the token count.
- FORCING: Once the token budget (e.g., 50 tokens) is hit, the processor overrides the model's logits by setting the probability of the closing tag
</think>to0.0(log-prob infinity) and all other tokens to-inf, forcing the model to exit its thinking loop. - DONE: Once the closing tag is emitted, the processor steps out of the way, returning the model to regular generation.
class ThinkingBudgetLogitsProcessor:
def __init__(self, budget: int, think_end_token_id: int):
self.budget = budget
self.think_end_id = think_end_token_id
self.tokens_seen = 0
self.state = "COUNTING" # COUNTING -> FORCING -> DONE
def __call__(self, input_ids, scores):
if self.state == "DONE":
return scores
self.tokens_seen += 1
if self.state == "COUNTING" and self.tokens_seen >= self.budget:
self.state = "FORCING"
if self.state == "FORCING":
# Force the end-of-thought token
new_scores = torch.full_like(scores, float('-inf'))
new_scores[self.think_end_id] = 0.0
self.state = "DONE"
return new_scores
return scoresResults & Impact
Injecting the logits processor achieved a 90% reduction in latency for short, deterministic queries while preserving reasoning accuracy.
| Test Class | Thinking Budget | Execution Latency | Accuracy Score |
|---|---|---|---|
| Logic Puzzles | Unlimited | 3.8s | 100.0% |
| Logic Puzzles | 50 Tokens | 1.1s | 96.2% |
| Coding Algorithms | Unlimited | 4.5s | 82.4% |
| Coding Algorithms | 50 Tokens | 1.3s | 81.9% |
| Formatting / JSON | 0 (Disabled) | 0.4s | 98.1% |
Key Learnings
- Deterministic sandboxes catch compiler issues: Code execution checks found that some quantized weights generated invalid Python indentation blocks due to quantization noise in space character weights.
- Reasoning tokens are not always necessary: Capping the thinking budget dynamically allows us to trade compute for speed on a per-request basis.
- Chat templates must be verified: GGUF models are highly sensitive to missing special tokens like
<|im_start|>. The harness tests verified these were correctly formatted.
Source: §3 (GLM-4.7-Flash-GGUF).