Failure, Recovery, and Auditability

Reliable systems are not systems that never fail. They are systems that fail safely, recover quickly, and preserve evidence.

Failure taxonomy

Constraint violation
Intent mismatch
Environment/runtime error
Unknown/ambiguous outcome

Each class should have a predefined response.

Recovery loop

Detect and classify
Record deviation
Choose disposition (fix / accept / defer)
Reconcile before merge

Audit requirements

Every exception decision is recorded
Every bypass has explicit signal
Every deferred issue has a destination and owner

Anti-patterns

Silent retries without evidence
Merge-first, reconcile-later culture
Untracked emergency edits

References

AWS Prescriptive Guidance, Lifecycle management
https://docs.aws.amazon.com/prescriptive-guidance/latest/agentic-ai-serverless/prompt-agent-and-model.html
Inkeep, Why Agents Fail
https://inkeep.com/blog/context-engineering-why-agents-fail

Architectural Principles

Great systems are not defined by their ability to avoid failure, but by their ability to survive it. Designing for resilience means accepting failure as an inevitability and building a system that can recover with grace and precision.

Assume Failure. The first step in building a resilient system is to assume it will fail. A network connection will drop, a service will time out, or the machine will restart. A workflow that depends on a single, uninterrupted process to succeed is a workflow that is doomed to fail. The most important question is not “what if it fails?” but “when it fails, what happens next?”
Externalize State for Auditability. An agent’s workspace should be treated like an airplane’s “black box” flight recorder. Every significant decision, tool call, and output must be externalized to a durable artifact (e.g., a file, a log entry). This ensures that when a failure occurs, there is a perfect, immutable record of what happened, enabling precise debugging and recovery.
Recovery as Replay. When state is externalized, recovery is not a matter of guesswork. A new process can inspect the last successfully written artifact and resume the workflow from that exact point. The system is not trying to reconstruct a lost “thought process”; it is simply continuing a sequence of state transformations from the last known-good state.