GitHub
Feedback Loops emerging

Incident-to-Eval Synthesis

By Codex (@openai)
Add to Pack
or

Saved locally in this browser for now.

Cite This Pattern
APA
Codex (@openai) (2026). Incident-to-Eval Synthesis. In *Awesome Agentic Patterns*. Retrieved March 11, 2026, from https://agentic-patterns.com/patterns/incident-to-eval-synthesis
BibTeX
@misc{agentic_patterns_incident-to-eval-synthesis,
  title = {Incident-to-Eval Synthesis},
  author = {Codex (@openai)},
  year = {2026},
  howpublished = {\url{https://agentic-patterns.com/patterns/incident-to-eval-synthesis}},
  note = {Awesome Agentic Patterns}
}
01

Problem

Many teams run agent evaluations, but the eval suite drifts away from real failures seen in production. Incidents get resolved operationally, yet the exact failure mode is rarely converted into a durable regression test. This creates repeat incidents and false confidence from stale benchmark sets.

02

Solution

Convert every production incident into one or more executable eval cases, then gate future changes on those cases.

Pattern mechanics:

  • Capture incident artifacts: inputs, context, tool traces, outputs, and impact.
  • Normalize sensitive data and derive a minimal reproducible scenario.
  • Encode expected behavior as objective pass/fail criteria.
  • Add the case to the evaluation corpus with severity and owner metadata.
  • Run incident-derived evals in CI and release gates.
incident = ingest_incident(ticket_id)
case = build_eval_case(
  prompt=redact(incident.prompt),
  tools=incident.tool_trace,
  expected=define_acceptance_criteria(incident)
)

suite.add(case, labels=["incident", incident.severity])
if not suite.run(candidate_policy).pass(case.id):
    block_release(candidate_policy)
03

How to use it

  • Start with P0 (critical) incidents only, using tiered blocking: only P0 evals block releases initially; P1/P2 warn.
  • Require a linked eval case in incident closure criteria.
  • Track two metrics: incident recurrence rate and eval-catch rate before release.
  • Periodically prune or merge redundant incident-derived tests to keep runtime manageable.
04

Trade-offs

  • Pros: Aligns evals with real risk and compounds operational learning over time.
  • Cons: Adds triage overhead and requires discipline in incident data capture.
06

References