Isolated VM per RL Rollout

Problem

During reinforcement learning training with tool-using agents, multiple rollouts execute simultaneously and may call destructive or stateful tools. This challenge is well-established in distributed RL research—A3C (Mnih et al., 2016) and PPO (Schulman et al., 2017) both rely on parallel isolated environment instances for stable gradient estimation.

Cross-contamination: One rollout's actions affect another rollout's environment
Destructive commands: Agent might run rm -rf, corrupting shared state
State leakage: File system changes persist across rollouts, creating inconsistent training data
Reward corruption: If rollout B sees rollout A's side effects, reward signals become meaningless
Debugging nightmares: Non-deterministic failures due to race conditions

Cognition faced this when training Devon's file planning agent: the agent had access to a shell tool that could run arbitrary commands like grep, find, or even rm. Running 32 parallel rollouts on shared infrastructure would cause chaos.

Solution

Spin up an isolated virtual machine (or container) for each RL rollout, ensuring complete environment isolation.

Architecture:

Rollout ID Tracking: OpenAI's Agent RFT platform assigns unique IDs to each rollout
VM/Container Mapping: Your infrastructure maps rollout ID → dedicated VM
Clean State: Each VM starts fresh with identical filesystem, packages, and configuration
Cleanup: VMs are destroyed after rollout completes (success or failure)

Key Components:

VM Provisioning: Fast VM creation (typically cloud instances or containers)
Bursty Scaling: Handle 100s-500s of simultaneous VM requests at training step boundaries
State Isolation: No shared filesystems or databases between VMs
Timeout Handling: VMs auto-destroy after timeout to prevent resource leaks

# Infrastructure setup (Cognition's approach)
from modal import Image, App, method
import uuid

# Base VM image with all dependencies
base_image = (
    Image.debian_slim()
    .apt_install("git", "build-essential")
    .pip_install("pandas", "numpy", "openai")
    .copy_local_dir("./corpus", "/workspace/corpus")  # Training data
)

app = App("agent-rft-tool-server")

@app.cls(
    image=base_image,
    cpu=2,
    memory=4096,
    timeout=600,  # 10 min per rollout max
)
class IsolatedToolExecutor:
    """
    Each instance gets its own isolated VM
    Spun up per-rollout during RL training
    """

    def __init__(self):
        """Initialize fresh state for this rollout"""
        self.rollout_id = None
        self.workspace = "/workspace"
        self.history = []

    @method()
    def initialize_rollout(self, rollout_id: str):
        """
        Called first when rollout starts
        Sets up isolated state for this specific rollout
        """
        self.rollout_id = rollout_id
        print(f"[{rollout_id}] Initialized isolated VM")

        # Create isolated working directory
        import os
        self.work_dir = f"{self.workspace}/rollout_{rollout_id}"
        os.makedirs(self.work_dir, exist_ok=True)

        return {"status": "ready", "rollout_id": rollout_id}

    @method()
    def execute_shell(self, rollout_id: str, command: str):
        """
        Execute shell command in isolated environment
        Safe because this VM is dedicated to this rollout
        """
        if rollout_id != self.rollout_id:
            raise ValueError(f"Rollout ID mismatch: {rollout_id} != {self.rollout_id}")

        import subprocess

        print(f"[{rollout_id}] Executing: {command}")

        # Even destructive commands are safe in isolated VM
        result = subprocess.run(
            command,
            shell=True,
            cwd=self.work_dir,
            capture_output=True,
            text=True,
            timeout=60
        )

        self.history.append({
            "command": command,
            "returncode": result.returncode,
            "stdout": result.stdout[:1000],  # Limit output
            "stderr": result.stderr[:1000]
        })

        return {
            "stdout": result.stdout,
            "stderr": result.stderr,
            "returncode": result.returncode
        }

    @method()
    def read_file(self, rollout_id: str, filepath: str):
        """Read file from corpus or workspace"""
        if rollout_id != self.rollout_id:
            raise ValueError(f"Rollout ID mismatch")

        # Files are isolated to this VM
        full_path = f"{self.workspace}/{filepath}"

        try:
            with open(full_path, 'r') as f:
                content = f.read()
            return {"content": content, "error": None}
        except Exception as e:
            return {"content": None, "error": str(e)}

    @method()
    def search_corpus(self, rollout_id: str, query: str):
        """Semantic search over documents"""
        if rollout_id != self.rollout_id:
            raise ValueError(f"Rollout ID mismatch")

        # Search implementation here...
        # Corpus is read-only, copied into VM at startup

        return {"results": [...]}

    @method()
    def cleanup(self, rollout_id: str):
        """
        Optional cleanup (Modal handles VM destruction automatically)
        """
        print(f"[{rollout_id}] Rollout complete, VM will be destroyed")
        return {"history": self.history}


# Tool endpoint configuration for OpenAI Agent RFT
tools_config = [
    {
        "name": "shell",
        "url": "https://your-app.modal.run/execute_shell",
        "headers": {"Authorization": "Bearer YOUR_TOKEN"}
    },
    {
        "name": "read_file",
        "url": "https://your-app.modal.run/read_file",
        "headers": {"Authorization": "Bearer YOUR_TOKEN"}
    },
    {
        "name": "search",
        "url": "https://your-app.modal.run/search_corpus",
        "headers": {"Authorization": "Bearer YOUR_TOKEN"}
    }
]

Request Flow:

sequenceDiagram participant OAI as OpenAI Training participant LB as Load Balancer participant VM1 as VM (Rollout 1) participant VM2 as VM (Rollout 2) Note over OAI: Training Step Starts OAI->>LB: Rollout 1: shell("grep TODO") OAI->>LB: Rollout 2: shell("rm temp.txt") LB->>VM1: Route to isolated VM 1 LB->>VM2: Route to isolated VM 2 Note over VM1: Executes grep TODO<br/>(safe, isolated) Note over VM2: Executes rm temp.txt<br/>(safe, isolated) VM1->>LB: grep results VM2->>LB: success LB->>OAI: Return results Note over VM1,VM2: Rollouts complete<br/>VMs destroyed

How to use it

Phase 1: Infrastructure Setup

Choose your isolation technology:

Modal/E2B: MicroVMs with ~1s startup, Firecracker isolation (recommended for agent training)
Modal/Lambda: Serverless functions with container isolation (easiest)
Docker: Containers per rollout (good balance)
Kubernetes Jobs: K8s pods per rollout (production-grade)
Cloud VMs: EC2/GCP instances per rollout (maximum isolation, 30-120s startup)

Phase 2: Implement Rollout ID Tracking

# All tool endpoints must accept and validate rollout_id
@app.post("/tool/{tool_name}")
async def execute_tool(tool_name: str, rollout_id: str, params: dict):
    # Get or create isolated environment for this rollout
    vm = get_or_create_vm(rollout_id)

    # Execute in isolated context
    result = vm.execute(tool_name, params)

    return result

Phase 3: Handle Bursty Traffic

Agent RFT sends traffic in bursts:

Training step boundary: 100-500 simultaneous rollout requests
Tool call latency: Brief pauses while agent thinks
Cleanup phase: Mass VM destruction

Configure auto-scaling:

# Modal example
@app.cls(
    image=base_image,
    concurrency_limit=500,  # Max concurrent VMs
    container_idle_timeout=60,  # Cleanup after 1 min idle
)

Phase 4: Monitor Infrastructure

Critical metrics:

VM provisioning time: Should be <5 seconds (alert if >10s)
Infrastructure error rate: Target <1%, alert if >5% (higher rates cause training collapse)
Rollout timeout rate: Target <0.1%, alert if >1%
Resource leaks: Active rollout count should match expected; alert if +50 over baseline

Sam's advice from Cognition:

"Sometimes like let's say there's infrastructure error and the VMs fail... that does lead to the training kind of collapsing because even the model might have done something good, it got a zero reward."

Set up monitoring:

import logging

logger = logging.getLogger("rollout-infra")

@method()
def execute_tool(self, rollout_id: str, tool: str, params: dict):
    try:
        result = self._execute(tool, params)

        # Log success
        logger.info(f"rollout={rollout_id} tool={tool} status=success")

        return result

    except Exception as e:
        # Log failure - critical for debugging training collapse
        logger.error(
            f"rollout={rollout_id} tool={tool} status=error error={str(e)}"
        )

        # Return error to model (don't give zero reward for infra issues)
        return {
            "error": "Infrastructure error, please retry",
            "retryable": True
        }

Trade-offs

Pros:

Complete isolation: No cross-contamination between rollouts
Safety: Destructive commands can't affect other rollouts or host system
Determinism: Consistent environment for reliable reward signals
Production parity: Can use exact same environment as production

Cons:

Cost: 100s of VMs running simultaneously can be expensive
Provisioning time: VM startup adds latency (containers are faster)
Complexity: Requires robust infrastructure and monitoring
Scaling limits: Cloud provider quotas may limit concurrent VMs
Failure modes: Infrastructure issues can cause training collapse

References

OpenAI Build Hour: Agent RFT - Cognition Case Study (November 2025)
Mnih et al. (2016). "Asynchronous Methods for Deep Reinforcement Learning". arXiv:1602.01783 — A3C foundation for parallel isolated environments
Schulman et al. (2017). "Proximal Policy Optimization Algorithms". arXiv:1707.06347 — PPO parallel rollout collection
Liang et al. (2018). "Ray RLLib: A Scalable Reinforcement Learning Library". arXiv:1807.03343 — Actor-model isolation for RL
Modal Documentation
E2B Documentation — Firecracker microVMs for agent sandboxes
Related patterns: Agent Reinforcement Fine-Tuning, Adaptive Sandbox Fanout Controller, Sandboxed Tool Authorization, Egress Lockdown, Virtual Machine Operator Agent