LLM-Powered Fuzzing: Our 23-Strategy Arsenal - All You Need Is A Fuzzing Brain

🧭 Strategy Overview

🔎

10 Discovery Strategies

Delta-scan and full-scan modes, SARIF-guided refinement, call-path targeting, and input synthesis blended with sanitizers and coverage to produce fast, verifiable POVs.

🩹

13 Patching Strategies

From minimal, path-aware guards to structural refactors and XPatch (patching without a POV) — all gated by compile, test, and POV-negation checks.

🎭

Multi-Model Orchestration

Routing and fallback across Anthropic, OpenAI, and Google models, with quotas, backoff, and success-rate tracking to avoid cascade failures.

🔎 Discovery Strategies (10)

Discovery aims to produce a robust proof-of-vulnerability (POV). Strategies combine static signals (SARIF, call graphs, reachability) with dynamic feedback (sanitizers, coverage) to steer LLMs toward executable triggers.

1) Delta-Scan (Patch Diff Focus)

Prioritize files and functions touched by recent changes. Parse diffs, map to call paths, and have the LLM hypothesize likely CWE classes and inputs that traverse the modified path.

2) Full-Scan (Hotspot Ranking)

Rank all files using heuristics (unsafe APIs, complexity, historical bug density). Use LLM to draft targeted test harnesses per hotspot with auto-build/run loops.

3) SARIF-Guided Refinement

Ingest SARIF from static analyzers. For each finding, ask the LLM to convert the warning into a runnable POV with concrete inputs, then validate under ASAN/UBSAN.

4) Call-Path Targeting

Generate candidate input shapes that traverse specific call sequences to the sink. LLM reasons about required invariants and state to reach the vulnerable site.

5) Taint-Spot Exploration

Surface user-controlled data flows (CLI args, HTTP params, file parsers). LLM proposes minimally valid inputs that survive parsing and reach memory-unsafe operations.

6) Sanitizer-Driven Generalization

Cluster sanitizer crashes and let the LLM generalize a stable repro from noisy stack traces. Convert flakey crashes into deterministic POVs.

7) Grammar-Guided Input Synthesis

Ask the LLM to emit a minimal grammar/schema for inputs (e.g., PNG, JSON). Mutate within the grammar to preserve reachability while stressing edge counts.

8) Coverage-Loop Refinement

Instrument harnesses, report missed branches back to the LLM, and request inputs that flip specific predicates or increase rare-edge hit counts.

9) Exception-Mining (Java)

Exploit language-level stack traces and messages to shortcut to faulting APIs. Request inputs that transform a handled exception into a crash or integrity violation.

10) Pattern Replay

Leverage a library of historical bug patterns (e.g., off-by-one in image decoders). LLM adapts known triggers to the current codebase with type- and path-aware tweaks.

📄

Design Notes

Discovery treats model output as untrusted. Every candidate is compiled and executed under sanitizers; only deterministic repros graduate to POVs. Details in our technical report.

🩹 Patching Strategies (13)

Patches must compile, negate the POV, and preserve functionality. We bias toward minimal, auditable diffs unless the LLM justifies a larger refactor. Each strategy runs through the same validation gates.

1) Minimal Guard

Add precondition checks (bounds, null, state) at the faulting site with early returns or error codes.

// Before
memcpy(dst, src, len);
// After
if (len > dst_size) return ERR_INVALID_SIZE;
memcpy(dst, src, len);

2) Path-Aware Fix

Harden only the failing path: guard specific states along the call-chain that lead to the sink; avoid broad behavioral changes.

3) Size-Checked Copy

Replace unsafe copies with bounded variants (`strncpy`, `memcpy_s`) or explicit length checks with clear error handling.

4) Input Validation

Enforce strict parsing and reject malformed structures early (magic bytes, lengths, indices, state machines).

5) Signedness & Overflow

Normalize types, add overflow checks on arithmetic, and clamp to safe ranges before allocations or indexing.

6) Resource Safety

Fix leaks and double-frees by clarifying lifetime rules; prefer RAII/`defer`-like scopes where available.

7) Concurrency Guard

Introduce minimal synchronization (atomic flags, fine-grained locks) to eliminate races causing memory corruption or TOCTOU.

8) Defensive Defaults

On parser or API failure, return safe defaults rather than partially initialized structures.

9) API-Level Replacement

Swap to safer APIs (e.g., `snprintf` over `sprintf`), or centralize validation in a wrapper used across call sites.

10) State Machine Tightening

For complex formats, enforce valid transitions and terminal states to prevent invalid memory access.

11) Spec-Conformant Refactor

Where minimal guards aren’t enough, perform small refactors that align with spec rules while preserving public APIs.

12) Regression-Aware Patch

Augment patches with new unit tests derived from the POV and near-miss inputs to prevent reintroduction.

13) XPatch (No-POV Fix)

When a POV cannot be produced, synthesize a patch from high-confidence static findings plus local invariants; validate by negative testing and coverage invariants.

// Example (Java) – safer length check
public byte[] read(byte[] buf, int len) {
    if (buf == null || len < 0 || len > buf.length) {
        throw new IllegalArgumentException("invalid length");
    }
    // ... existing logic ...
}

🎛️ Orchestration & Validation Gates

LLM Router

class LLMRouter:
    MODELS = ["claude", "gpt", "gemini"]

    async def call(self, prompt, validate):
        for name in self.MODELS:
            try:
                out = await call_model(name, prompt)
                if validate(out):
                    return out
            except (RateLimit, Overload):
                await backoff()
                continue
        raise RuntimeError("All models failed")

Per-model quotas, exponential backoff, and success-rate telemetry avoid stampede failures and control cost.

Validation Gates

Compile under sanitizers; reject non-deterministic crashes
POV must be deterministic (N≥3 runs)
Patch must negate POV and pass regression
Cost/latency budgets enforced per strategy

Reproducibility

Per-job directories (/tmp/job_{id}/...) and artifact bundles (inputs, logs, patches) make results auditable, aligning with our report.

🧪 What Worked (and What Didn’t)

Start Minimal

Minimal guards neutralize most memory errors without side effects; escalate to refactors only when justified by failing regression tests.

Cache Static Facts

Precomputed call graphs and symbol maps make strategy runs predictable and cheap at scale.

Feedback Loops Matter

Coverage- and sanitizer-driven hints dramatically reduce LLM trial-and-error during discovery.

POV-First — Except When Not

XPatch salvages high-confidence findings when a deterministic POV is elusive, but requires tighter negative tests.

📚 Dive Deeper

The full system, data, and ablations are documented in our technical report and open-source CRS.

📂 View Source Code 📄 Read Technical Report