Hot vs Cold Memory Stratification: A Practical Memory Model for Agent Workflows

An agent is tasked with planning a marketing campaign. It starts by analyzing last quarter’s performance data. Then, it moves on to researching competitor strategies. Midway through, the user asks for a quick summary of the campaign’s primary goal. The agent, its context window now saturated with competitor ad copy and performance metrics, has forgotten the initial objective. It provides a generic, unhelpful answer.

This is a failure of memory, but not in the human sense. The agent didn’t truly “forget.” The information was simply pushed out of its context window by newer, more immediate data. The problem wasn’t the size of the memory, but its architecture. It was a flat, undifferentiated mess.

To build agents that can reason over long-running, complex tasks, we need to move beyond the single, monolithic context window. We need a stratified memory system, one that distinguishes between “hot” memory for immediate context and “cold” memory for foundational knowledge.

The Tyranny of the Flat Context Window

Modern language models operate on a fixed or sliding context window. This is the “hot” memory—the set of information immediately available to the model for generating its next response. Everything inside this window is treated with roughly equal importance.

This creates several problems for agentic workflows:

Context Thrashing: As new information (tool outputs, user messages, web search results) pours in, older but still relevant information is pushed out. The agent loses track of its own plans, the user’s original intent, and critical background context.
Instruction Dilution: Core instructions, goals, and constraints provided at the start of a task become “distant” in the context window. Their influence on the model’s behavior weakens with every new piece of data, leading to the kind of instruction-compliance failures we discussed previously.
High Cost and Latency: Constantly feeding large amounts of foundational context into the prompt for every single step of a workflow is expensive and slow. We pay a token tax to remind the agent of things it should already “know.”

A flat memory architecture forces us into an impossible trade-off: a small window that forgets too quickly, or a massive window that’s slow, expensive, and still susceptible to instruction dilution.

Anti-Pattern: The Ever-Growing Scratchpad

A common approach to this problem is the “scratchpad” or “thought” pattern, where the model maintains a running log of its reasoning and actions. As the task progresses, this scratchpad is appended with new information and fed back into the next prompt.

While this can be effective for short-lived tasks, it inevitably collapses under its own weight. The scratchpad grows linearly with the complexity of the task. Soon, it’s too large to fit in the context window, or so large that the original instructions are thousands of tokens away from the current reasoning step.

We try to solve this with summarization loops, asking the model to periodically condense its scratchpad. But this introduces its own problems. The summarization itself is a lossy process. The model might discard a detail that seems unimportant at the time but becomes critical later. We are asking the model to perform its own memory management, a task it is not well-suited for.

Pattern: Hot and Cold Memory Stratification

A better approach is to design a memory system that mirrors how humans reason. We have a “working memory” (hot) for the task at hand and a “long-term memory” (cold) for foundational knowledge, past experiences, and core principles.

In an AI agent system, this translates to:

Hot Memory (The Context Window): This is the dynamic, fast-access memory for the immediate task. It contains the user’s latest message, the output of the last few tool calls, the current plan, and short-term reasoning. It is expected to be volatile and change with every step.
Cold Memory (External Storage): This is a persistent, structured, and searchable repository for information that needs to survive beyond the current context window. It is the agent’s long-term memory.

This isn’t just about storing data; it’s about a process of stratification. The agent harness—the code that orchestrates the model—is responsible for managing the flow of information between these two layers.

The workflow looks like this:

Initialization: At the start of a task, the harness retrieves relevant information from cold memory. This might include the overall mission, user preferences, API documentation for available tools, and the outcomes of similar past tasks. This retrieved information forms the initial hot context.
Execution Loop: During the task, the agent uses its hot memory to reason and act. It calls tools, processes results, and formulates its next step.
Memory Management: The harness observes the execution.
- Hot to Cold: As important conclusions are reached or durable artifacts are created (e.g., a completed piece of code, a final report), the harness writes this information to cold memory. This is the process of consolidation. For example, after successfully writing a function, the harness might save the function’s code and a brief description of what it does to a file in a “completed work” directory.
- Cold to Hot: When the agent appears to be stuck, asking a question that its cold memory should contain the answer to, or beginning a new phase of the task, the harness retrieves the relevant context from cold storage and injects it into the hot context window. This is the process of retrieval.

Practical Guidance: Implementing a Stratified Memory

Choose Your Cold Storage: Cold storage can be as simple as a structured set of files and folders. For example:
- /mission.md: The core, high-level objective.
- /plan.md: The agent’s current high-level plan.
- /workspace/: A directory for files the agent creates and modifies.
- /memory/: A searchable log of key decisions, observations, and retrieved facts, perhaps using a vector database for semantic search.
Design the Retrieval Trigger: How does the harness know when to fetch information from cold storage?
- Explicit Retrieval: The agent can explicitly request information by calling a retrieve_from_memory(query) tool. This gives the model direct control.
- Heuristic Retrieval: The harness can use heuristics. For example, if the model’s output contains phrases like “I don’t know” or “I need to find out about X,” the harness can automatically perform a search for “X” in the cold memory.
- Just-in-Time Context: Before executing a complex tool, the harness can retrieve that tool’s documentation and examples from cold storage and place them in the prompt, giving the model a refresher right when it needs it.
Design the Consolidation Process: How does information move from hot to cold?
- Artifact-Driven: The primary method should be through the creation of artifacts. When the agent writes a file, that file is the memory. The act of saving it is the act of consolidation.
- Explicit Commits: The agent can use a commit_to_memory(data, summary) tool to explicitly save a piece of information and its significance.
- Automated Summarization: After a certain number of steps, the harness can use a separate, dedicated model instance to summarize the recent activity and save that summary to the memory log.

Conclusion: From Scratchpad to Architecture

By stratifying memory, we move from the anti-pattern of a single, ever-growing scratchpad to a more robust and scalable memory architecture. The agent’s context window (hot memory) stays small and focused on the immediate task, preventing context thrashing and instruction dilution. The harness, acting as a memory controller, ensures that relevant long-term knowledge from cold storage is available when needed.

This is not just about building agents that can handle longer tasks. It’s about building agents that can learn, adapt, and build upon their own past work. A flat memory system creates amnesiacs, doomed to repeat their mistakes. A stratified memory system is the first step toward creating agents with continuity, persistence, and a genuine capacity for complex, long-term reasoning.

Architectural Principles

A robustly stratified memory architecture is not an ad-hoc solution but the expression of a disciplined engineering philosophy. Three core principles underpin this pattern:

Minimum Effective Context. The primary objective is not to maximize the context an agent can access, but to minimize it to the essential information required for the current task. This prevents instruction dilution and reduces the cognitive load on the model, leading to more reliable outcomes.
Stateless Execution with Externalized State. Agent operations must be treated as stateless transformations. All durable state—including plans, intermediate results, and consolidated knowledge—must be externalized to an artifact-first repository (e.g., a version-controlled filesystem). This ensures full reproducibility, auditability, and resilience against system failure.
System-Managed Context Retrieval. The agent itself should not be responsible for searching and retrieving from its own long-term memory. This responsibility lies with the surrounding harness. The harness must act as a context provisioner, retrieving relevant knowledge from “cold” storage and injecting it into the “hot” context window on a just-in-time basis, guided by the agent’s current task and role.