Problem
Individual task execution transcripts contain valuable learnings, but:
- Too specific: "Make this button pink" isn't useful as general guidance
- Unknown relevance: Hard to predict which learnings apply to future tasks
- Scattered knowledge: Insights buried across hundreds of conversation logs
- Abstraction challenge: Difficult to know the right level of generality
Simply memorizing everything creates noise; ignoring everything loses valuable patterns.
Solution
Implement a two-tier memory system:
- Task diaries: Agent writes structured logs for each task (what it tried, what failed, why)
- Synthesis agents: Periodically review multiple task logs to extract reusable patterns
The synthesis step identifies recurring themes across logs, surfacing insights that aren't obvious from any single execution. This approach is validated by academic research: Reflexion (NeurIPS 2023) achieved 91% pass@1 on HumanEval using episodic memory with self-reflection, and Stanford's Generative Agents paper demonstrates "reflection" mechanisms that synthesize higher-level insights from multiple memories.
Example diary entry format:
How to use it
Implementation approach:
Trade-offs
Pros:
- Pattern detection: Finds recurring issues humans might miss
- Right abstraction level: Synthesis across multiple tasks reveals what's general
- Automatic knowledge extraction: Don't rely on humans remembering to document
- Evolving memory: System learns and improves over time
- Evidence-based: Patterns backed by multiple occurrences, not speculation
Cons:
- Storage overhead: Must persist all task logs
- Synthesis complexity: Requires sophisticated agents to extract good patterns
- False patterns: May identify coincidental correlations
- Maintenance burden: Synthesized rules need periodic review
- Privacy concerns: Logs may contain sensitive information
- Token costs: Synthesis over many logs is expensive
- Cold start problem: Insufficient data for reliable pattern extraction initially
Open questions:
- How many occurrences validate a pattern?
- How to prune outdated or wrong patterns?
- What's the right synthesis frequency?
- How to handle conflicting patterns across logs?
References
- Cat Wu: "Some people at Anthropic where for every task they do, they tell Claude Code to write a diary entry in a specific format. What did it try? Why didn't it work? And then they even have these agents that look over the past memory and synthesize it into observations."
- Boris Cherny: "Synthesizing the memory from a lot of logs is a way to find these patterns more consistently... If I say make the button pink, I don't want you to remember to make all buttons pink in the future."
- AI & I Podcast: How to Use Claude Code Like the People Who Built It
- Shinn et al. Reflexion: Language Agents with Verbal Reinforcement Learning (NeurIPS 2023) - episodic memory with self-reflection achieving 91% pass@1 on HumanEval
- Park et al. Generative Agents: Interactive Simulacra of Human Behavior (Stanford 2023) - reflection synthesis from multiple memories