Architecture 8 min readApril 12, 2026

LLM Failover & High Availability: Building Resilient AI Applications

When OpenAI goes down, does your app go down too? This architecture guide covers circuit breakers, fallback chains, and multi-provider resilience for production AI.

NR

NeuralRouting Team

April 12, 2026

LLM Failover & High Availability: Building Resilient AI Applications

On March 7, 2026, OpenAI experienced a 4-hour outage that affected thousands of production applications. Companies running single-provider setups lost revenue, SLA credits, and user trust. The ones running multi-provider gateways? Their users never noticed.

This guide covers the architecture patterns for building resilient AI applications that survive provider outages.


The Single Point of Failure Problem

Most AI applications look like this:

Your App → OpenAI API → Response

When OpenAI goes down:

Your App → OpenAI API → 503 → Your App Crashes

OpenAI's historical uptime is approximately 99.7%, which sounds good until you calculate: 0.3% downtime = 26 hours/year. For a production app handling thousands of requests per hour, that's significant.


Pattern 1: Simple Fallback Chain

The most basic resilience pattern:

FALLBACK_CHAIN = [
    {"provider": "openai", "model": "gpt-4o"},
    {"provider": "anthropic", "model": "claude-sonnet-4-20250514"},
    {"provider": "groq", "model": "llama-3.1-70b-versatile"},
]

async def resilient_call(prompt: str) -> str:
    for provider in FALLBACK_CHAIN:
        try:
            response = await call_provider(
                provider["provider"],
                provider["model"],
                prompt,
                timeout=10.0
            )
            return response
        except (Timeout, APIError, RateLimitError):
            continue
    raise AllProvidersDown("No available providers")

Pros: Simple to implement, handles basic outages. Cons: Each failure wastes timeout seconds. 3 failures = 30 seconds of latency.


Pattern 2: Circuit Breaker

Circuit breakers prevent cascading failures by short-circuiting requests to known-failing providers:

CLOSED (healthy) → errors exceed threshold → OPEN (failing)
OPEN → after cooldown period → HALF-OPEN (testing)
HALF-OPEN → test succeeds → CLOSED
HALF-OPEN → test fails → OPEN

When a provider is OPEN, requests skip it entirely — no timeout wait. This reduces failover latency from 10+ seconds to milliseconds.

Thresholds we recommend:

  • Failure threshold: 5 failures in 60 seconds → OPEN
  • Cooldown period: 30 seconds before HALF-OPEN
  • Success threshold: 3 consecutive successes → CLOSED

Pattern 3: Health-Check Monitoring

Instead of waiting for user requests to discover failures, proactive health checks detect issues before they impact users:

async def health_check_loop():
    while True:
        for provider in providers:
            try:
                # Lightweight test call
                await provider.completions.create(
                    model=provider.test_model,
                    messages=[{"role": "user", "content": "test"}],
                    max_tokens=5,
                    timeout=5.0
                )
                provider.status = "healthy"
            except Exception:
                provider.status = "degraded"
        await asyncio.sleep(30)

Combined with circuit breakers, this gives you near-instant failover.


Pattern 4: Latency-Based Routing

Not all failures are binary. A provider might be "up" but responding 5x slower than normal. Latency-based routing detects degradation:

  • Track P50 and P95 latency per provider over a 5-minute window
  • If P95 exceeds 2x the historical average, deprioritize that provider
  • Route to the provider with the lowest current P50

This catches partial outages and rate limiting that don't trigger error-based circuit breakers.


How NeuralRouting Handles Failover

NeuralRouting combines all four patterns:

  1. Multi-provider support: OpenAI + Groq (Anthropic and Mistral coming soon)
  2. Automatic fallback: If the primary provider fails, requests route to the next available
  3. Circuit breaker: Failing providers are temporarily bypassed
  4. Zero configuration: Failover is built into the routing layer — you don't configure anything

The result: 99.99%+ effective uptime even when individual providers experience outages.

More in Architecture

Ready to cut your AI costs?

Start saving up to 80% on token costs today. Free tier available.

Get Started Free →