A benchmark that measures how efficiently AI models find bugs in reduction rules from the problem-reductions library (290+ rules).
Live leaderboard: https://ferrari-72.github.io/problem-reductions-benchmark/
A reduction rule maps problem A → problem B. A bug is a round-trip failure:
A →(reduce)→ B →(solve)→ s →(extract)→ A'
If A' is an invalid solution to A, the rule has a bug. The AI finds these by constructing counterexample certificates — JSON objects that describe the violation and the evidence.
Three violation types:
| Type | Meaning |
|---|---|
unsound_extraction |
A valid target solution extracts back to an invalid source solution |
incomplete_reduction |
Source has a solution but the reduced target has none |
suboptimal_extraction |
Extracted source solution is not optimal; a better one exists |
Every certificate is independently re-verified by pred — the AI's claim is never trusted directly.
Primary metric: bugs/Ktok (bugs found per 1000 tokens) — model-agnostic, works regardless of pricing. Secondary metric: bugs/$ (bugs found per USD spent) — practical cost-efficiency ranking.
Implement the AgentRunner interface in benchmark/runner.py:
from benchmark.runner import AgentRunner
class MyRunner(AgentRunner):
def run(self, ctx, model: str, rule_name: str, per_rule_budget: float) -> dict:
# Run the model, return a certificate if a bug is found
return {
"rule": rule_name,
"result": "bug_found", # or "no_certificate" | "rejected" | "error:..."
"cost": 0.05, # USD spent
"tokens_k": 12.3, # tokens used (thousands)
"certificate": {...}, # required when result == "bug_found"
}Then pass it to Scheduler in benchmark/scheduler.py. See MiniSweRunner for a full example.
Results must be written to results/{safe_model}.json following benchmark/results.schema.json.
Requirements:
predbinary in PATH (pinned commitaa2d1a1of problem-reductions)- Python 3.11+ with dependencies:
pip install -r benchmark/requirements.txt - An API key for your model
# Set env vars
export ANTHROPIC_API_KEY=sk-...
export REPO_DIR=/path/to/problem-reductions # must be at pinned commit
# Run a small session (2 rules, $2 budget)
make demo
# Rebuild the leaderboard index
make build-index
# Run all unit tests (no API key needed)
make test-unit
# Validate results files against schema
make validate-resultsKey make targets:
| Target | Description |
|---|---|
make test-unit |
All unit tests, no API key needed |
make verify-calibration |
Test verifier against 3 known fixtures |
make verify-judgment |
Robust equality + novelty filter tests |
make validate-results |
Schema-check all results/*.json |
make build-index |
Rebuild results/index.json |
make demo |
Run a tiny real session + rebuild index |
| Metric | Formula | When to use |
|---|---|---|
bugs/Ktok |
bugs ÷ tokens(K) | Primary ranking — fair across all models regardless of price |
bugs/$ |
bugs ÷ USD spent | Secondary ranking — practical cost-efficiency |
Use bugs/Ktok when comparing models with different pricing tiers. Use bugs/$ when optimizing for budget.
Models that don't publish pricing can still compete on bugs/Ktok.