An agent is tasked with planning a marketing campaign. It starts by analyzing last quarter’s performance data. Then, it moves on to researching competitor strategies. Midway through, the user asks for a quick summary of the campaign’s primary goal. The agent, its context window now saturated with competitor ad copy and performance metrics, has forgotten the initial objective. It provides a generic, unhelpful answer.
This is a failure of memory, but not in the human sense. The agent didn’t truly “forget.” The information was simply pushed out of its context window by newer, more immediate data. The problem wasn’t the size of the memory, but its architecture. It was a flat, undifferentiated mess.
To build agents that can reason over long-running, complex tasks, we need to move beyond the single, monolithic context window. We need a stratified memory system, one that distinguishes between “hot” memory for immediate context and “cold” memory for foundational knowledge.
Modern language models operate on a fixed or sliding context window. This is the “hot” memory—the set of information immediately available to the model for generating its next response. Everything inside this window is treated with roughly equal importance.
This creates several problems for agentic workflows:
A flat memory architecture forces us into an impossible trade-off: a small window that forgets too quickly, or a massive window that’s slow, expensive, and still susceptible to instruction dilution.
A common approach to this problem is the “scratchpad” or “thought” pattern, where the model maintains a running log of its reasoning and actions. As the task progresses, this scratchpad is appended with new information and fed back into the next prompt.
While this can be effective for short-lived tasks, it inevitably collapses under its own weight. The scratchpad grows linearly with the complexity of the task. Soon, it’s too large to fit in the context window, or so large that the original instructions are thousands of tokens away from the current reasoning step.
We try to solve this with summarization loops, asking the model to periodically condense its scratchpad. But this introduces its own problems. The summarization itself is a lossy process. The model might discard a detail that seems unimportant at the time but becomes critical later. We are asking the model to perform its own memory management, a task it is not well-suited for.
A better approach is to design a memory system that mirrors how humans reason. We have a “working memory” (hot) for the task at hand and a “long-term memory” (cold) for foundational knowledge, past experiences, and core principles.
In an AI agent system, this translates to:
This isn’t just about storing data; it’s about a process of stratification. The agent harness—the code that orchestrates the model—is responsible for managing the flow of information between these two layers.
The workflow looks like this:
/mission.md: The core, high-level objective./plan.md: The agent’s current high-level plan./workspace/: A directory for files the agent creates and modifies./memory/: A searchable log of key decisions, observations, and retrieved facts, perhaps using a vector database for semantic search.retrieve_from_memory(query) tool. This gives the model direct control.commit_to_memory(data, summary) tool to explicitly save a piece of information and its significance.By stratifying memory, we move from the anti-pattern of a single, ever-growing scratchpad to a more robust and scalable memory architecture. The agent’s context window (hot memory) stays small and focused on the immediate task, preventing context thrashing and instruction dilution. The harness, acting as a memory controller, ensures that relevant long-term knowledge from cold storage is available when needed.
A robustly stratified memory architecture is not an ad-hoc solution but the expression of a disciplined engineering philosophy. Three core principles underpin this pattern:
Minimum Effective Context. The primary objective is not to maximize the context an agent can access, but to minimize it to the essential information required for the current task. This prevents instruction dilution and reduces the cognitive load on the model, leading to more reliable outcomes.
Stateless Execution with Externalized State. Agent operations must be treated as stateless transformations. All durable state—including plans, intermediate results, and consolidated knowledge—must be externalized to an artifact-first repository (e.g., a version-controlled filesystem). This ensures full reproducibility, auditability, and resilience against system failure.
System-Managed Context Retrieval. The agent itself should not be responsible for searching and retrieving from its own long-term memory. This responsibility lies with the surrounding harness. The harness must act as a context provisioner, retrieving relevant knowledge from “cold” storage and injecting it into the “hot” context window on a just-in-time basis, guided by the agent’s current task and role.