Problem
Recursive delegation (parent agent → sub-agents → sub-sub-agents) decomposes big tasks, but has a failure mode:
- A single weak sub-agent result can poison the parent's next steps (wrong assumption, missed file, bad patch)
- Errors compound up the tree: "one bad leaf" can derail the whole rollout
- Pure recursion underuses parallelism when a node is uncertain: you want multiple shots right where the ambiguity is
Meanwhile, "best-of-N" parallel attempts help reliability, but without structure they waste compute by repeatedly solving the same problem instead of decomposing it. The pattern applies parallelism only where uncertainty exists—at the subtask level—while maintaining structured decomposition.
Solution
At each node in a recursive agent tree, run best-of-N for the current subtask before expanding further. This combines the structured decomposition of recursive delegation with the reliability of self-consistency sampling:
- Decompose: Parent turns task into sub-tasks (like normal recursive delegation)
- Parallel candidates per subtask: For each subtask, spawn K candidate workers in isolated sandboxes (K=2-5 typical)
- Score candidates: Use a judge that combines:
- Automated signals (tests, lint, exit code, diff size, runtime)
- LLM-as-judge rubric (correctness, adherence to constraints, simplicity)
- Select + promote: Pick the top candidate as the "canonical" result for that subtask
- Escalate uncertainty: If the judge confidence is low (or candidates disagree), either:
- Increase K for that subtask, or
- Spawn a focused "investigator" sub-agent to gather missing facts, then re-run selection
- Aggregate upward: Parent synthesizes selected results and continues recursion
How to use it
Best for tasks where:
- Subtasks are shardable, but each shard can be tricky (ambiguous API use, repo-specific conventions)
- You can score outputs cheaply (unit tests, type checks, lint, golden files)
- "One wrong move" is costly (migration diffs, security-sensitive changes, large refactors)
Practical defaults:
- Start with K=2 for most subtasks
- Increase to K=5 only on "high uncertainty" nodes (low judge confidence, conflicting outputs, failing tests)
- Keep the rubric explicit: "must pass tests; minimal diff; no new dependencies; follow style guide"
Trade-offs
Pros:
- Much more robust than single-recursion: local uncertainty gets extra shots
- Compute is targeted: you spend K where it matters, not globally
- Works naturally with sandboxed execution and patch-based workflows
Cons:
- More orchestration complexity (judge, scoring, confidence thresholds)
- Higher cost/latency if you overuse K
- Judge quality becomes a bottleneck; add objective checks whenever possible
References
- Self-Consistency (Wang et al. 2022): Foundation for best-of-N sampling via majority voting
- Recursive Language Models (arXiv 2512.24601, 2025): Recursion as inference-time scaling
- Tree-of-Thoughts (Yao et al. 2023): Tree-based reasoning with evaluation mechanisms
- Labruno (GitHub): Parallel sandboxes + LLM judge selects best implementation
- Daytona RLM Guide: Recursive delegation with sandboxed execution
- Related patterns: Sub-Agent Spawning, Swarm Migration Pattern, Self-Critique / Evaluator loops