Skip to main content
Circuit breakers prevent cascading failures by stopping calls to a failing service. When too many calls fail, the circuit “opens” and blocks further calls until the service recovers.

The Circuit Breaker Pattern

Basic Usage

from splinter.control import CircuitBreaker, CircuitBreakerConfig

breaker = CircuitBreaker(
    breaker_id="openai",
    config=CircuitBreakerConfig(
        failure_threshold=5,   # Open after 5 failures
        timeout_seconds=60,    # Stay open for 60s
        half_open_max=1,       # Allow 1 test request
    )
)

# Use the breaker
try:
    breaker.check()  # Raises if circuit is open
    result = await call_openai()
    breaker.record_success()
except OpenAIError:
    breaker.record_failure()
except CircuitOpenError:
    print("Circuit is open, skipping call")

Circuit States

StateBehavior
CLOSEDNormal operation. Calls allowed. Failures counted.
OPENAll calls blocked. Waiting for timeout.
HALF_OPENLimited calls allowed. Testing if service recovered.

Circuit Breaker Registry

Manage multiple circuit breakers:
from splinter.control import CircuitBreakerRegistry

registry = CircuitBreakerRegistry()

# Register breakers for each provider
registry.register("openai", CircuitBreakerConfig(
    failure_threshold=5,
    timeout_seconds=60,
))

registry.register("anthropic", CircuitBreakerConfig(
    failure_threshold=3,
    timeout_seconds=30,
))

# Check state
print(registry.get_state("openai"))  # CLOSED, OPEN, or HALF_OPEN

Global Stop (Emergency)

Stop everything immediately:
# Something is very wrong - stop all calls
registry.trip_all("Emergency: API returning bad data")

# All circuits are now OPEN
# No calls will go through

# Later, reset when issue is resolved
registry.reset_all()

Per-Agent Breakers

Different agents can have different breakers:
registry = CircuitBreakerRegistry()

# Researcher is less critical - more lenient
registry.register("researcher", CircuitBreakerConfig(
    failure_threshold=10,
    timeout_seconds=30,
))

# Writer is critical - fail fast
registry.register("writer", CircuitBreakerConfig(
    failure_threshold=2,
    timeout_seconds=120,
))

Monitoring Circuits

# Get all circuit states
states = registry.get_all_states()
# {"openai": "CLOSED", "anthropic": "OPEN", ...}

# Get detailed info
info = registry.get_info("openai")
# {
#   "state": "CLOSED",
#   "failure_count": 2,
#   "last_failure": "2024-01-15T10:30:00Z",
#   "success_count": 150,
# }

# Listen for state changes
registry.on_state_change("openai", lambda old, new: 
    print(f"OpenAI circuit: {old}{new}")
)

Handling Circuit Open

from splinter.exceptions import CircuitOpenError

try:
    await run_agent()
except CircuitOpenError as e:
    print(f"Circuit {e.breaker_id} is open")
    print(f"Opens at: {e.opened_at}")
    print(f"Retry after: {e.retry_after:.1f}s")
    
    # Options:
    # 1. Wait and retry
    await asyncio.sleep(e.retry_after)
    
    # 2. Use fallback
    result = await fallback_operation()
    
    # 3. Fail gracefully
    return {"error": "Service temporarily unavailable"}

Integration with Workflow

from splinter.workflow import Workflow
from splinter.control import CircuitBreakerRegistry

workflow = Workflow(workflow_id="pipeline")

registry = CircuitBreakerRegistry()
registry.register("*", CircuitBreakerConfig(
    failure_threshold=5,
    timeout_seconds=60,
))

workflow.set_circuit_breaker_registry(registry)

Best Practices

Not just LLM providers - any external service can fail.
Too short = circuit never recovers. Too long = downtime persists. 30-60 seconds is a good starting point.
If circuits are opening frequently, investigate the root cause.
What happens when a circuit opens? Have a plan.