LLM Failover & High Availability: Building Resilient AI Applications
On March 7, 2026, OpenAI experienced a 4-hour outage that affected thousands of production applications. Companies running single-provider setups lost revenue, SLA credits, and user trust. The ones running multi-provider gateways? Their users never noticed.
This guide covers the architecture patterns for building resilient AI applications that survive provider outages.
The Single Point of Failure Problem
Most AI applications look like this:
Your App → OpenAI API → Response
When OpenAI goes down:
Your App → OpenAI API → 503 → Your App Crashes
OpenAI's historical uptime is approximately 99.7%, which sounds good until you calculate: 0.3% downtime = 26 hours/year. For a production app handling thousands of requests per hour, that's significant.
Pattern 1: Simple Fallback Chain
The most basic resilience pattern:
FALLBACK_CHAIN = [
{"provider": "openai", "model": "gpt-4o"},
{"provider": "anthropic", "model": "claude-sonnet-4-20250514"},
{"provider": "groq", "model": "llama-3.1-70b-versatile"},
]
async def resilient_call(prompt: str) -> str:
for provider in FALLBACK_CHAIN:
try:
response = await call_provider(
provider["provider"],
provider["model"],
prompt,
timeout=10.0
)
return response
except (Timeout, APIError, RateLimitError):
continue
raise AllProvidersDown("No available providers")
Pros: Simple to implement, handles basic outages. Cons: Each failure wastes timeout seconds. 3 failures = 30 seconds of latency.
Pattern 2: Circuit Breaker
Circuit breakers prevent cascading failures by short-circuiting requests to known-failing providers:
CLOSED (healthy) → errors exceed threshold → OPEN (failing)
OPEN → after cooldown period → HALF-OPEN (testing)
HALF-OPEN → test succeeds → CLOSED
HALF-OPEN → test fails → OPEN
When a provider is OPEN, requests skip it entirely — no timeout wait. This reduces failover latency from 10+ seconds to milliseconds.
Thresholds we recommend:
- Failure threshold: 5 failures in 60 seconds → OPEN
- Cooldown period: 30 seconds before HALF-OPEN
- Success threshold: 3 consecutive successes → CLOSED
Pattern 3: Health-Check Monitoring
Instead of waiting for user requests to discover failures, proactive health checks detect issues before they impact users:
async def health_check_loop():
while True:
for provider in providers:
try:
# Lightweight test call
await provider.completions.create(
model=provider.test_model,
messages=[{"role": "user", "content": "test"}],
max_tokens=5,
timeout=5.0
)
provider.status = "healthy"
except Exception:
provider.status = "degraded"
await asyncio.sleep(30)
Combined with circuit breakers, this gives you near-instant failover.
Pattern 4: Latency-Based Routing
Not all failures are binary. A provider might be "up" but responding 5x slower than normal. Latency-based routing detects degradation:
- Track P50 and P95 latency per provider over a 5-minute window
- If P95 exceeds 2x the historical average, deprioritize that provider
- Route to the provider with the lowest current P50
This catches partial outages and rate limiting that don't trigger error-based circuit breakers.
How NeuralRouting Handles Failover
NeuralRouting combines all four patterns:
- Multi-provider support: OpenAI + Groq (Anthropic and Mistral coming soon)
- Automatic fallback: If the primary provider fails, requests route to the next available
- Circuit breaker: Failing providers are temporarily bypassed
- Zero configuration: Failover is built into the routing layer — you don't configure anything
The result: 99.99%+ effective uptime even when individual providers experience outages.