Problem
Parallel sandboxes are intoxicating: you can spawn 10... 100... 1000 runs. But two things break quickly:
- Diminishing returns: After some N, you're mostly paying for redundant failures or near-duplicate solutions
- Prompt fragility: If the prompt is underspecified, scaling N just scales errors (lots of sandboxes fail fast)
- Resource risk: Unbounded fan-out can overwhelm budgets, rate limits, or queues
- Oscillation risk: Poorly tuned thresholds can cause scale-up/scale-down thrashing as the controller oscillates between decisions
Static "N=10 always" policies don't adapt to task difficulty, model variance, or observed failure rates. Most implementations use static caps rather than true signal-driven adaptation.
Solution
Add a controller that adapts fan-out in real time based on observed signals from early runs.
Core loop:
-
Start small: Launch a small batch (e.g., N=3-5) in parallel
-
Early signal sampling: As soon as the first X runs finish (or after T seconds), compute:
- success rate (exit code / test pass)
- diversity score (are solutions meaningfully different?)
- judge confidence / winner margin
- error clustering (same error everywhere vs varied errors)
-
Decide next action:
- Scale up if: success rate is good but quality variance is high (you want a better winner)
- Stop early if: judge is confident + tests pass + solutions converge
- Refine prompt / spec if: error clustering is high (everyone fails the same way)
- Switch strategy if: repeated failure suggests decomposition is needed (spawn investigative sub-agent)
-
Budget guardrails: Enforce max sandboxes, max runtime, and "no-progress" stop conditions
-
Hysteresis for stability: Use different thresholds for scale-up vs. stop (e.g., scale up if confidence < 0.65, stop only if > 0.75) to prevent oscillation
How to use it
Use when:
- You're doing "best-of-N codegen + execution" in sandboxes
- You have cheap objective checks (unit tests, static analysis, schema validation)
- Latency and cost matter: you want the minimum N that achieves reliability
Concrete heuristics (example):
- Start N=3
- If >=2 succeed but disagree and judge confidence < 0.65 -> add +3 more
- If 0 succeed and top error signature covers >70% runs -> run a "spec clarifier" step, then restart
- Hysteresis: Stop only if confidence > 0.75 (higher threshold than scale-up) to prevent thrash
- Hard cap: N_max (e.g., 50), runtime cap, and "two refinement attempts then decompose"
Trade-offs
Pros:
- Prevents "scale errors" when prompts are bad
- Lowers spend by stopping early when a clear winner appears
- Makes sandbox swarms production-safe via budgets and no-progress stopping
Cons:
- Requires instrumentation (collecting failure signatures, confidence, diversity)
- Needs careful defaults and hysteresis to avoid oscillation (scale up/down thrash)
- Bad scoring functions can cause premature stopping
- Few verified implementations; most systems use static caps instead of true signal-driven adaptation
References
- Labruno: Scaling number of parallel sandboxes + judging winners (video) — Note: Uses static
MAX_SANDBOXESrather than true signal-driven adaptation - Labruno (GitHub) — Parallel execution with post-hoc judging, not adaptive fanout
- OpenClaw Orchestrator — Closest verified implementation; LLM decides next steps based on accumulated results
- Related patterns: Swarm Migration Pattern (batch tuning, resource caps), Sub-Agent Spawning (switch to decomposition when needed)