GitHub
Learning & Adaptation proposed

Memory Reinforcement Learning (MemRL)

TODO: Add a concise summary for "Memory Reinforcement Learning (MemRL)" describing the pattern's purpose and key benefits.

By Nikola Balic (@nibzard)
Add to Pack
or

Saved locally in this browser for now.

Cite This Pattern
APA
Nikola Balic (@nibzard) (2026). Memory Reinforcement Learning (MemRL). In *Awesome Agentic Patterns*. Retrieved March 11, 2026, from https://agentic-patterns.com/patterns/memory-reinforcement-learning-memrl
BibTeX
@misc{agentic_patterns_memory-reinforcement-learning-memrl,
  title = {Memory Reinforcement Learning (MemRL)},
  author = {Nikola Balic (@nibzard)},
  year = {2026},
  howpublished = {\url{https://agentic-patterns.com/patterns/memory-reinforcement-learning-memrl}},
  note = {Awesome Agentic Patterns}
}
01

Problem

LLMs struggle with runtime self-evolution due to the stability-plasticity dilemma:

  • Fine-tuning: Computationally expensive and prone to catastrophic forgetting
  • RAG/memory systems: Rely on semantic similarity that retrieves noise
  • No utility learning: Can't distinguish high-value strategies from semantically similar but ineffective ones

Standard retrieval assumes "similar implies useful," but that's often wrong. A semantically relevant past solution might actually be a bad approach for the current task.

02

Solution

MemRL transfers reinforcement learning from parameter space to context space: instead of updating model weights, it learns utility scores on episodic memories. The LLM stays frozen; only memory utilities evolve.

Core idea: Instead of just retrieving by similarity, rank memories by how well they've worked in the past.

Memory triplet structure:

  • Intent: What the user asked for (embedded)
  • Experience: What the agent tried (solution trace)
  • Utility: How well it worked (learned score, updated over time)

Two-phase retrieval:

  1. Phase A - Semantic filter: Find semantically similar memories
  2. Phase B - Utility ranking: Re-rank by learned utility scores

This filters out "distractor" memories that look relevant but historically lead to poor outcomes.

graph LR A[Query] --> B[Find Similar Memories] B --> C[Rank by Utility Scores] C --> D[Use Top Memories] D --> E[Get Result] E --> F[Update Utilities] F --> G[Store New Experience] style C fill:#e8f5e9,stroke:#388e3c,stroke-width:2px style F fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
03

How to use it

Basic implementation:

  1. Store experiences with utility scores

    memory_bank.append({
        "intent": embed(query),
        "experience": solution_trace,
        "utility": 0.5  # initial score, learned over time
    })
    
  2. Retrieve with utility ranking

    # First: filter by similarity
    candidates = similar_memories(query, threshold=0.7)
    
    # Then: re-rank by utility
    ranked = sorted(candidates, key=lambda m: m.utility, reverse=True)
    context = ranked[:k]
    
  3. Update utilities based on outcomes

    reward = 1 if success else 0
    for mem in retrieved_contexts:
        mem.utility += learning_rate * (reward - mem.utility)
    

Why this works:

  • Successful memories get higher scores, retrieved more often
  • Failed memories get downranked, even if semantically similar
  • Frozen LLM stays stable; only memory utilities evolve
  • Agent self-improves through runtime experience
04

Trade-offs

Pros:

  • No catastrophic forgetting (frozen LLM)
  • Self-improves from experience
  • Filters out "look-alike" bad solutions
  • No retraining needed

Cons:

  • Need reliable success/failure signals
  • Memory overhead grows over time
  • Cold start: needs episodes to learn
  • More complex than basic RAG

When to use:

  • Multi-step tasks with clear success signals
  • Reusable problem-solving patterns
  • Can't afford fine-tuning

When NOT to use:

  • Single-turn queries
  • No clear reward signals
  • Highly diverse tasks (no patterns)
06

References