Explicit Posterior-Sampling Planner

Problem

Heuristic planning loops often over-exploit the first plausible strategy and under-explore alternatives. In uncertain environments, this drives repeated dead ends, unstable learning, and high token/API spend with little information gain.

Solution

Embed a fully specified RL algorithm—Posterior Sampling for Reinforcement Learning (PSRL)—inside the LLM's reasoning:

Maintain a Bayesian posterior over task models.
Sample a model, compute an optimal plan/policy, execute, observe reward, update posterior.
Express each step in natural language so the core LLM can carry it out with tool calls.

The planner becomes an explicit exploration policy instead of an improvised chain of thoughts. By repeatedly sampling from the posterior, the agent balances exploration and exploitation with a principled uncertainty model rather than ad-hoc retries. This is Thompson sampling generalized to multi-state MDPs, with near-optimal regret bounds of O(√T).

How to use it

Wrap PSRL in a reusable prompt template or controller skeleton with explicit state variables (posterior, reward, horizon). Start in bounded environments with measurable reward signals and instrument posterior updates for debugging. For text-based environments, design a state abstraction (e.g., semantic hashing or embedding-based clustering) to map unstructured context to discrete MDP states.

Trade-offs

Pros: More sample-efficient exploration and better decision consistency under uncertainty.
Cons: Higher implementation complexity, sensitive reward design, additional compute overhead, and requires careful state abstraction for text environments.

Best for: Small-to-medium state spaces (<10k states) where sample efficiency matters and reward signals are informative.

Production status: While Thompson sampling is widely deployed for bandit problems (Netflix, Amazon, Spotify), PSRL embedded in LLM reasoning remains emerging with no verified production implementations.

References

Arumugam & Griffiths, Toward Efficient Exploration by LLM Agents (2025)
Strens, A Bayesian Framework for Reinforcement Learning (ICML 2000)
Osband et al., More Efficient Reinforcement Learning via Posterior Sampling (NeurIPS 2013)
Primary source: https://arxiv.org/abs/2504.20997