Back to Deep Dives
Technical Deep Dive

Building an Evaluation Harness

Designing a deterministic test suite with regex, code execution, and LLM-as-judge to verify quantized model quality.

Systems & Infrastructure·5 min read·Production Verified

TL;DR Summary

  • Triple-Layer Evaluation: Built a harness running 210 scenarios scoring GGUF models via Regex Extraction, Sandboxed Subprocess Code Execution, and async LLM-as-a-Judge (Claude).
  • Custom Logits Processor: Designed a 3-State Finite State Machine (FSM) Logits Processor to forcefully cap "thinking" token budgets by overriding logits probabilities to -inf.
  • Latency & Accuracy: Hard-capping reasoning tokens at 50 reduced execution latency by 90% (e.g., 3.8s down to 1.1s) while retaining 96.2% accuracy in deterministic logic puzzles.

Quantizing a model (like GLM-4.7-Flash to GGUF) reduces memory footprints but can degrade reasoning, syntax parsing, and function-calling abilities. To verify that quantization did not destroy our deployment's capabilities, we built a custom, deterministic evaluation harness running 210 specific test scenarios.

The Evaluation Architecture

The harness tests the model across logic, coding, and reasoning categories. To ensure evaluation is fast and automated, we designed three scoring mechanisms:

Evaluation MethodTest TargetImplementationMetric Tracked
Regex ExtractionStructured JSON & MathStrict regex pattern matchingFormat compliance
Code ExecutionAlgorithm & Syntax correctnessSubprocess shell execution in a sandboxRun-time compile and output match
LLM-as-a-JudgeOpen-ended responsesAsynchronous grading via Claude-3.5-SonnetSemantic accuracy

The Code Execution Sandbox

For coding tests, the model is asked to write an algorithm. The harness parses the generated code block, writes it to a temporary file, and runs it under a restricted subprocess environment with a strict CPU timeout:

import subprocess
import tempfile
import os

def run_sandboxed_code(code_str: str, timeout_sec=2.0) -> bool:
    with tempfile.NamedTemporaryFile(suffix=".py", delete=False) as f:
        f.write(code_str.encode('utf-8'))
        f_name = f.name
        
    try:
        # Run subprocess with resource caps
        result = subprocess.run(
            ["python", f_name],
            capture_output=True,
            text=True,
            timeout=timeout_sec
        )
        return result.returncode == 0
    except subprocess.TimeoutExpired:
        return False
    finally:
        os.unlink(f_name)

The Logits Processor: Controlling the "Thinking" Budget

During evaluation, we observed that GLM-4.7-Flash-GGUF spent up to 80% of its generation time on internal reasoning (wrapped inside <think>...</think> blocks) before writing a simple answer. For latency-sensitive APIs, this is unacceptable.

Since llama.cpp had no native parameter to stop thinking at a precise token threshold, we wrote a custom 3-State Logits Processor inside our Python serving handler.

The 3-State Finite State Machine (FSM)

The logits processor intercepts every token generated by the model's forward pass:

  1. COUNTING: The processor counts token IDs. If they belong to the thinking channel, we increment the token count.
  2. FORCING: Once the token budget (e.g., 50 tokens) is hit, the processor overrides the model's logits by setting the probability of the closing tag </think> to 0.0 (log-prob infinity) and all other tokens to -inf, forcing the model to exit its thinking loop.
  3. DONE: Once the closing tag is emitted, the processor steps out of the way, returning the model to regular generation.
class ThinkingBudgetLogitsProcessor:
    def __init__(self, budget: int, think_end_token_id: int):
        self.budget = budget
        self.think_end_id = think_end_token_id
        self.tokens_seen = 0
        self.state = "COUNTING" # COUNTING -> FORCING -> DONE

    def __call__(self, input_ids, scores):
        if self.state == "DONE":
            return scores
            
        self.tokens_seen += 1
        
        if self.state == "COUNTING" and self.tokens_seen >= self.budget:
            self.state = "FORCING"
            
        if self.state == "FORCING":
            # Force the end-of-thought token
            new_scores = torch.full_like(scores, float('-inf'))
            new_scores[self.think_end_id] = 0.0
            self.state = "DONE"
            return new_scores
            
        return scores

Results & Impact

Injecting the logits processor achieved a 90% reduction in latency for short, deterministic queries while preserving reasoning accuracy.

Test ClassThinking BudgetExecution LatencyAccuracy Score
Logic PuzzlesUnlimited3.8s100.0%
Logic Puzzles50 Tokens1.1s96.2%
Coding AlgorithmsUnlimited4.5s82.4%
Coding Algorithms50 Tokens1.3s81.9%
Formatting / JSON0 (Disabled)0.4s98.1%

Key Learnings

  1. Deterministic sandboxes catch compiler issues: Code execution checks found that some quantized weights generated invalid Python indentation blocks due to quantization noise in space character weights.
  2. Reasoning tokens are not always necessary: Capping the thinking budget dynamically allows us to trade compute for speed on a per-request basis.
  3. Chat templates must be verified: GGUF models are highly sensitive to missing special tokens like <|im_start|>. The harness tests verified these were correctly formatted.

Source: §3 (GLM-4.7-Flash-GGUF).

Hi! I'm Yuvraj's AI assistant. I know everything about his projects, experience, and technical work. Ask me anything!