Blog

/failure-recovery-auditability March 25, 2026

Failure, Recovery, and Auditability

Designing agent workflows to fail safely and recover cleanly.

Reliable systems are not systems that never fail. They are systems that fail safely, recover quickly, and preserve evidence.

Failure taxonomy

Constraint violation
Intent mismatch
Environment/runtime error
Unknown/ambiguous outcome

Each class should have a predefined response.

Recovery loop

Detect and classify
Record deviation
Choose disposition (fix / accept / defer)
Reconcile before merge

Audit requirements

Every exception decision is recorded
Every bypass has explicit signal
Every deferred issue has a destination and owner

Anti-patterns

Silent retries without evidence
Merge-first, reconcile-later culture
Untracked emergency edits

References

AWS Prescriptive Guidance, Lifecycle management
https://docs.aws.amazon.com/prescriptive-guidance/latest/agentic-ai-serverless/prompt-agent-and-model.html
Inkeep, Why Agents Fail
https://inkeep.com/blog/context-engineering-why-agents-fail