Recursive Best-of-N Delegation

01

Problem

Recursive delegation (parent agent → sub-agents → sub-sub-agents) decomposes big tasks, but has a failure mode:

A single weak sub-agent result can poison the parent's next steps (wrong assumption, missed file, bad patch)
Errors compound up the tree: "one bad leaf" can derail the whole rollout
Pure recursion underuses parallelism when a node is uncertain: you want multiple shots right where the ambiguity is

Meanwhile, "best-of-N" parallel attempts help reliability, but without structure they waste compute by repeatedly solving the same problem instead of decomposing it. The pattern applies parallelism only where uncertainty exists—at the subtask level—while maintaining structured decomposition.

02

Solution

At each node in a recursive agent tree, run best-of-N for the current subtask before expanding further. This combines the structured decomposition of recursive delegation with the reliability of self-consistency sampling:

Decompose: Parent turns task into sub-tasks (like normal recursive delegation)
Parallel candidates per subtask: For each subtask, spawn K candidate workers in isolated sandboxes (K=2-5 typical)
Score candidates: Use a judge that combines:
- Automated signals (tests, lint, exit code, diff size, runtime)
- LLM-as-judge rubric (correctness, adherence to constraints, simplicity)
Select + promote: Pick the top candidate as the "canonical" result for that subtask
Escalate uncertainty: If the judge confidence is low (or candidates disagree), either:
- Increase K for that subtask, or
- Spawn a focused "investigator" sub-agent to gather missing facts, then re-run selection
Aggregate upward: Parent synthesizes selected results and continues recursion

flowchart TD A[Parent task] --> B[Decompose into subtasks] B --> C1[Subtask 1] B --> C2[Subtask 2] C1 --> D1[Worker 1a] C1 --> D2[Worker 1b] C1 --> D3[Worker 1c] D1 --> J1[Judge + tests] D2 --> J1 D3 --> J1 J1 --> S1[Select best result 1] C2 --> E1[Worker 2a] C2 --> E2[Worker 2b] E1 --> J2[Judge + tests] E2 --> J2 J2 --> S2[Select best result 2] S1 --> Z[Aggregate + continue recursion] S2 --> Z

03

How to use it

Best for tasks where:

Subtasks are shardable, but each shard can be tricky (ambiguous API use, repo-specific conventions)
You can score outputs cheaply (unit tests, type checks, lint, golden files)
"One wrong move" is costly (migration diffs, security-sensitive changes, large refactors)

Practical defaults:

Start with K=2 for most subtasks
Increase to K=5 only on "high uncertainty" nodes (low judge confidence, conflicting outputs, failing tests)
Keep the rubric explicit: "must pass tests; minimal diff; no new dependencies; follow style guide"

04

Trade-offs

Pros:

Much more robust than single-recursion: local uncertainty gets extra shots
Compute is targeted: you spend K where it matters, not globally
Works naturally with sandboxed execution and patch-based workflows

Cons:

More orchestration complexity (judge, scoring, confidence thresholds)
Higher cost/latency if you overuse K
Judge quality becomes a bottleneck; add objective checks whenever possible

06