Problem
During reinforcement learning training with tool-using agents, multiple rollouts execute simultaneously and may call destructive or stateful tools. This challenge is well-established in distributed RL research—A3C (Mnih et al., 2016) and PPO (Schulman et al., 2017) both rely on parallel isolated environment instances for stable gradient estimation.
- Cross-contamination: One rollout's actions affect another rollout's environment
- Destructive commands: Agent might run
rm -rf, corrupting shared state - State leakage: File system changes persist across rollouts, creating inconsistent training data
- Reward corruption: If rollout B sees rollout A's side effects, reward signals become meaningless
- Debugging nightmares: Non-deterministic failures due to race conditions
Cognition faced this when training Devon's file planning agent: the agent had access to a shell tool that could run arbitrary commands like grep, find, or even rm. Running 32 parallel rollouts on shared infrastructure would cause chaos.
Solution
Spin up an isolated virtual machine (or container) for each RL rollout, ensuring complete environment isolation.
Architecture:
- Rollout ID Tracking: OpenAI's Agent RFT platform assigns unique IDs to each rollout
- VM/Container Mapping: Your infrastructure maps rollout ID → dedicated VM
- Clean State: Each VM starts fresh with identical filesystem, packages, and configuration
- Cleanup: VMs are destroyed after rollout completes (success or failure)
Key Components:
- VM Provisioning: Fast VM creation (typically cloud instances or containers)
- Bursty Scaling: Handle 100s-500s of simultaneous VM requests at training step boundaries
- State Isolation: No shared filesystems or databases between VMs
- Timeout Handling: VMs auto-destroy after timeout to prevent resource leaks
# Infrastructure setup (Cognition's approach)
from modal import Image, App, method
import uuid
# Base VM image with all dependencies
base_image = (
Image.debian_slim()
.apt_install("git", "build-essential")
.pip_install("pandas", "numpy", "openai")
.copy_local_dir("./corpus", "/workspace/corpus") # Training data
)
app = App("agent-rft-tool-server")
@app.cls(
image=base_image,
cpu=2,
memory=4096,
timeout=600, # 10 min per rollout max
)
class IsolatedToolExecutor:
"""
Each instance gets its own isolated VM
Spun up per-rollout during RL training
"""
def __init__(self):
"""Initialize fresh state for this rollout"""
self.rollout_id = None
self.workspace = "/workspace"
self.history = []
@method()
def initialize_rollout(self, rollout_id: str):
"""
Called first when rollout starts
Sets up isolated state for this specific rollout
"""
self.rollout_id = rollout_id
print(f"[{rollout_id}] Initialized isolated VM")
# Create isolated working directory
import os
self.work_dir = f"{self.workspace}/rollout_{rollout_id}"
os.makedirs(self.work_dir, exist_ok=True)
return {"status": "ready", "rollout_id": rollout_id}
@method()
def execute_shell(self, rollout_id: str, command: str):
"""
Execute shell command in isolated environment
Safe because this VM is dedicated to this rollout
"""
if rollout_id != self.rollout_id:
raise ValueError(f"Rollout ID mismatch: {rollout_id} != {self.rollout_id}")
import subprocess
print(f"[{rollout_id}] Executing: {command}")
# Even destructive commands are safe in isolated VM
result = subprocess.run(
command,
shell=True,
cwd=self.work_dir,
capture_output=True,
text=True,
timeout=60
)
self.history.append({
"command": command,
"returncode": result.returncode,
"stdout": result.stdout[:1000], # Limit output
"stderr": result.stderr[:1000]
})
return {
"stdout": result.stdout,
"stderr": result.stderr,
"returncode": result.returncode
}
@method()
def read_file(self, rollout_id: str, filepath: str):
"""Read file from corpus or workspace"""
if rollout_id != self.rollout_id:
raise ValueError(f"Rollout ID mismatch")
# Files are isolated to this VM
full_path = f"{self.workspace}/{filepath}"
try:
with open(full_path, 'r') as f:
content = f.read()
return {"content": content, "error": None}
except Exception as e:
return {"content": None, "error": str(e)}
@method()
def search_corpus(self, rollout_id: str, query: str):
"""Semantic search over documents"""
if rollout_id != self.rollout_id:
raise ValueError(f"Rollout ID mismatch")
# Search implementation here...
# Corpus is read-only, copied into VM at startup
return {"results": [...]}
@method()
def cleanup(self, rollout_id: str):
"""
Optional cleanup (Modal handles VM destruction automatically)
"""
print(f"[{rollout_id}] Rollout complete, VM will be destroyed")
return {"history": self.history}
# Tool endpoint configuration for OpenAI Agent RFT
tools_config = [
{
"name": "shell",
"url": "https://your-app.modal.run/execute_shell",
"headers": {"Authorization": "Bearer YOUR_TOKEN"}
},
{
"name": "read_file",
"url": "https://your-app.modal.run/read_file",
"headers": {"Authorization": "Bearer YOUR_TOKEN"}
},
{
"name": "search",
"url": "https://your-app.modal.run/search_corpus",
"headers": {"Authorization": "Bearer YOUR_TOKEN"}
}
]
Request Flow:
How to use it
Phase 1: Infrastructure Setup
Choose your isolation technology:
- Modal/E2B: MicroVMs with ~1s startup, Firecracker isolation (recommended for agent training)
- Modal/Lambda: Serverless functions with container isolation (easiest)
- Docker: Containers per rollout (good balance)
- Kubernetes Jobs: K8s pods per rollout (production-grade)
- Cloud VMs: EC2/GCP instances per rollout (maximum isolation, 30-120s startup)
Phase 2: Implement Rollout ID Tracking
# All tool endpoints must accept and validate rollout_id
@app.post("/tool/{tool_name}")
async def execute_tool(tool_name: str, rollout_id: str, params: dict):
# Get or create isolated environment for this rollout
vm = get_or_create_vm(rollout_id)
# Execute in isolated context
result = vm.execute(tool_name, params)
return result
Phase 3: Handle Bursty Traffic
Agent RFT sends traffic in bursts:
- Training step boundary: 100-500 simultaneous rollout requests
- Tool call latency: Brief pauses while agent thinks
- Cleanup phase: Mass VM destruction
Configure auto-scaling:
# Modal example
@app.cls(
image=base_image,
concurrency_limit=500, # Max concurrent VMs
container_idle_timeout=60, # Cleanup after 1 min idle
)
Phase 4: Monitor Infrastructure
Critical metrics:
- VM provisioning time: Should be <5 seconds (alert if >10s)
- Infrastructure error rate: Target <1%, alert if >5% (higher rates cause training collapse)
- Rollout timeout rate: Target <0.1%, alert if >1%
- Resource leaks: Active rollout count should match expected; alert if +50 over baseline
Sam's advice from Cognition:
"Sometimes like let's say there's infrastructure error and the VMs fail... that does lead to the training kind of collapsing because even the model might have done something good, it got a zero reward."
Set up monitoring:
import logging
logger = logging.getLogger("rollout-infra")
@method()
def execute_tool(self, rollout_id: str, tool: str, params: dict):
try:
result = self._execute(tool, params)
# Log success
logger.info(f"rollout={rollout_id} tool={tool} status=success")
return result
except Exception as e:
# Log failure - critical for debugging training collapse
logger.error(
f"rollout={rollout_id} tool={tool} status=error error={str(e)}"
)
# Return error to model (don't give zero reward for infra issues)
return {
"error": "Infrastructure error, please retry",
"retryable": True
}
Trade-offs
Pros:
- Complete isolation: No cross-contamination between rollouts
- Safety: Destructive commands can't affect other rollouts or host system
- Determinism: Consistent environment for reliable reward signals
- Production parity: Can use exact same environment as production
Cons:
- Cost: 100s of VMs running simultaneously can be expensive
- Provisioning time: VM startup adds latency (containers are faster)
- Complexity: Requires robust infrastructure and monitoring
- Scaling limits: Cloud provider quotas may limit concurrent VMs
- Failure modes: Infrastructure issues can cause training collapse
References
- OpenAI Build Hour: Agent RFT - Cognition Case Study (November 2025)
- Mnih et al. (2016). "Asynchronous Methods for Deep Reinforcement Learning". arXiv:1602.01783 — A3C foundation for parallel isolated environments
- Schulman et al. (2017). "Proximal Policy Optimization Algorithms". arXiv:1707.06347 — PPO parallel rollout collection
- Liang et al. (2018). "Ray RLLib: A Scalable Reinforcement Learning Library". arXiv:1807.03343 — Actor-model isolation for RL
- Modal Documentation
- E2B Documentation — Firecracker microVMs for agent sandboxes
- Related patterns: Agent Reinforcement Fine-Tuning, Adaptive Sandbox Fanout Controller, Sandboxed Tool Authorization, Egress Lockdown, Virtual Machine Operator Agent