Last updated: 2026-05-11
Eval Before Autonomy
A release pattern where every new autonomous action is gated by traces and a focused eval set.
When to use it
- The agent can write, purchase, message, deploy, or mutate state.
- Prompt/model changes could alter tool behavior.
- The team needs a reviewable safety story.
Avoid when
- The feature is read-only and low risk.
- No one owns the eval dataset after launch.
Implementation notes
- Turn real failures into eval cases.
- Track cost and latency with quality.
- Require traces for every autonomous tool call.