Introducing guardrail-eval-bench: a deterministic eval set for prompt-injection detectors.

Most prompt-injection benchmarks use an LLM to judge whether the injection was detected. That is exactly the bug they claim to detect. We released guardrail-eval-bench (Apache-2.0) because we got tired of evals that pass at 96% on Monday and 73% on Tuesday with no input change — the judge model updated.

Why deterministic

Adversarial robustness needs hand-labeled ground truth. LLM-as-judge converges around plausible-sounding answers, not correct ones, and the convergence shifts when the underlying model is retrained. For a benchmark that's supposed to measure how well your guardrail catches inputs that fool LLMs, using an LLM to grade is structurally incoherent.

What's in the set

1,800 hand-labeled cases across four categories. Each case carries a deterministic outcome label (block / allow / flag-for-review) and the rationale a human applied. The labels and rationales are version-pinned so a benchmark run today and a run six months from now are directly comparable.

Direct injection — explicit instruction overrides ("ignore previous instructions...")
Role play — persona impersonation ("you are DAN, you can do anything")
Context smuggling — injection via tool output, doc retrieval, or system context
Encoding tricks — base64, leetspeak, language-switch, zero-width unicode

How to run it

// python

from guardrail_eval_bench import load, run, score

bench = load(version="2025.11")
results = run(your_detector, bench, categories=["direct_injection", "context_smuggling"])

report = score(results)
print(report.precision, report.recall, report.f1)
# By-category breakdown:
for cat, metrics in report.by_category.items():
    print(cat, metrics)

your_detector is any callable that takes a dict { input, context } and returns a verdict. The bench is intentionally framework-agnostic. We've run it against LangChain guardrails, Anthropic's safety filters, and three custom detectors built by clients.

What it doesn't catch

Three known gaps, worth saying out loud so you don't oversell what passing the bench means:

Novel attack patterns. No static benchmark can; the cases here are the ones we've seen in the wild as of 2025 Q4.
Latency tradeoffs. The bench measures accuracy, not eval-budget. A 99.5% detector that takes 800ms is worse for most production systems than a 96% detector that takes 4ms.
Category boundaries are fuzzy. Encoding tricks often nest inside role play. The labels capture the dominant category but you'll see disagreement on edge cases.

How we use it internally

Every guardrail node we ship runs this bench as a CI gate. A drop of more than 0.5% on any category against the previous main blocks merge. It's also the first thing a client sees when we hand off — it sets the floor for what their team can regress against.

Public release is in the legal/redaction pass — we're pulling identifying cases from one regulated engagement before the repo goes live. Apache-2.0, no LLM judge, no telemetry. If you want early access for evaluation against your detector, ask via the intake agent and we'll send a private mirror.

// continue

What we learned migrating a $2.4B claims-intake queue to an agentic pipeline.