Agent Circuit Breaker

Problem

Agents that use external tools — APIs, databases, web scrapers, code executors — face a common failure mode: a tool endpoint becomes degraded or unavailable, and the agent keeps calling it, burning tokens on retries that will never succeed.

This creates three cascading problems:

Token waste: Each failed tool call costs input/output tokens, and the agent often generates lengthy retry reasoning
Latency amplification: Sequential retries on a dead endpoint add seconds or minutes with no progress
Cascading failure: If one tool is down (e.g., a search API), the agent may stall entirely instead of using alternative approaches

Unlike model-level failover (switching between GPT-4 and Claude when one provider is down), tool-level failures require a different strategy — the agent needs to learn, mid-session, that a specific tool is broken and stop using it.

Solution

Apply the classic Circuit Breaker pattern from distributed systems to agent tool invocations. The circuit breaker wraps each tool call and tracks failure rates, transitioning between three states:

stateDiagram-v2 [*] --> Closed Closed --> Open: failure_count >= threshold Open --> HalfOpen: cooldown_elapsed HalfOpen --> Closed: probe_succeeds HalfOpen --> Open: probe_fails

States:

State	Behavior
Closed	Tool calls pass through normally. Failures are counted.
Open	Tool calls are blocked immediately — returns a cached error or fallback. No actual call is made.
Half-Open	One probe call is allowed through. If it succeeds, reset to Closed. If it fails, return to Open.

Core mechanism:

class AgentCircuitBreaker:
    def __init__(self, failure_threshold=3, cooldown_seconds=60):
        self.state = "closed"
        self.failure_count = 0
        self.threshold = failure_threshold
        self.cooldown = cooldown_seconds
        self.opened_at = None

    def call(self, tool_fn, *args, **kwargs):
        if self.state == "open":
            if time.time() - self.opened_at >= self.cooldown:
                self.state = "half_open"  # Allow one probe
            else:
                raise CircuitOpenError(f"Circuit open — tool disabled for {self.cooldown}s")

        try:
            result = tool_fn(*args, **kwargs)
            if self.state == "half_open":
                self.state = "closed"
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            if self.failure_count >= self.threshold:
                self.state = "open"
                self.opened_at = time.time()
            raise

Agent-specific adaptations beyond traditional circuit breakers:

Token-aware thresholds: Open the circuit after N tokens wasted, not just N failures
Fallback routing: When a circuit opens, inform the agent's system prompt so it chooses alternative tools
Per-tool granularity: Each tool (search API, code executor, database) gets its own circuit breaker
Session-scoped state: Circuit state resets between agent sessions (unlike persistent microservice breakers)

How to use it

When to apply:

Agent uses 3+ external tools that can independently fail
Tools have variable reliability (APIs with rate limits, web scrapers, third-party services)
Agent sessions are long enough that a tool may recover mid-session

Implementation steps:

Wrap each tool in its own circuit breaker instance
Set thresholds based on tool characteristics:
- Fast APIs (search, weather): threshold=3, cooldown=30s
- Slow tools (web scraping, compilation): threshold=2, cooldown=120s
Define fallback behavior when circuits open:
- Return cached/stale results if available
- Route to an alternative tool (e.g., switch search providers)
- Inform the agent that the tool is unavailable so it can adjust its plan
Log circuit state changes for observability

Integration with agent loops:

# In your agent's tool execution layer
breakers = {
    "web_search": AgentCircuitBreaker(failure_threshold=3, cooldown_seconds=60),
    "code_exec": AgentCircuitBreaker(failure_threshold=2, cooldown_seconds=120),
    "database":  AgentCircuitBreaker(failure_threshold=3, cooldown_seconds=30),
}

def execute_tool(tool_name, *args):
    breaker = breakers.get(tool_name)
    if breaker:
        return breaker.call(tools[tool_name], *args)
    return tools[tool_name](*args)

Trade-offs

Pros:

Prevents token waste from futile retries on broken tools
Enables graceful degradation — agent continues working with available tools
Self-healing: half-open probes restore tools automatically when they recover
Simple to implement (~50 lines of core logic)
Session-scoped state avoids the state management complexity of persistent breakers

Cons:

Adds a layer of indirection around tool calls
Threshold tuning requires understanding each tool's failure characteristics
May mask intermittent errors that would naturally resolve with a single retry
Agent must be prompted to handle CircuitOpenError gracefully (fallback awareness)
Not useful for agents with only 1-2 highly reliable tools

References

Martin Fowler: CircuitBreaker — canonical pattern description
Release It! (Michael Nygard, 2007) — original production pattern
Netflix Hystrix — production implementation at scale
Resilience4j — lightweight Java implementation
Related: Failover-Aware Model Fallback — handles model-provider failures (complementary)
Related: Action Caching & Replay — cached results can serve as circuit-open fallbacks

Agent Circuit Breaker

Problem

Solution

How to use it

Trade-offs

References

Follow the library as it sharpens

Related patterns

Failover-Aware Model Fallback

Action Caching & Replay Pattern