LLM load balancing and failover: how to keep your AI running when providers go down
OpenAI went down 3 times last quarter. If your app depends on a single LLM provider, those outages are your outages. Here's how to build multi-provider failover that switches automatically.
NR
NeuralRouting Team
April 25, 2026
LLM load balancing and failover: how to keep your AI running when providers go down
On March 14, 2026, OpenAI had a partial outage that lasted about 4 hours. If your app was wired directly to the OpenAI API with no fallback, your LLM features were down for 4 hours. Your users saw errors, your support queue filled up, and there was nothing you could do except wait.
This happens more often than most teams plan for. OpenAI, Anthropic, and Google all have periodic degraded performance, rate limit spikes, and full outages. If you depend on a single provider, their reliability is your ceiling.
The fix is multi-provider failover: when one provider goes down or slows down, requests automatically reroute to another.
The three failure modes
Not all provider failures look the same, and each one requires a different detection and response strategy.
Full outage. The API returns 500/503 errors or doesn't respond. Every request fails, and detection is straightforward.
Partial degradation. The API responds slowly. Time-to-first-token goes from 300ms to 4 seconds. Or responses get worse: more refusals, truncations. These requests technically succeed, so detection is harder.
Rate limiting. You hit the provider's rate limit and start getting 429 errors. This often happens during traffic spikes or when another customer on your shared tier is consuming capacity. The fix is not to retry against the same provider, it is to route the overflow to a different one.
This tries OpenAI first, falls back to Anthropic if OpenAI fails, and falls back to Groq if Anthropic also fails. It works. But it has problems.
What the simple approach gets wrong
It is slow to detect degradation. The timeout has to elapse before failover triggers. If your timeout is 10 seconds and OpenAI is responding slowly but not timing out, every request eats that latency before falling over.
It retries unnecessarily. If OpenAI is fully down, every request still tries OpenAI first, fails, and then falls over. You are adding the timeout delay to every request instead of just routing around the outage.
It does not balance load. All traffic goes to the primary provider until it fails. You are not spreading requests across providers to avoid hitting rate limits, and you are not using the cheaper provider when it can handle the request.
Model behavior varies by provider. GPT-4o, Claude Sonnet, and Llama 3.1 70B produce different outputs for the same prompt. If you depend on specific formats or behaviors, failover to a different model can break things.
Better approach: health-checked routing
Instead of failing over per-request, maintain a health status for each provider and route based on current health.
from dataclasses import dataclass
from time import time
@dataclass
class ProviderHealth:
name: str
healthy: bool = True
last_check: float = 0
consecutive_failures: int = 0
avg_latency_ms: float = 0
def mark_failure(self):
self.consecutive_failures += 1
if self.consecutive_failures >= 3:
self.healthy = False
self.last_check = time()
def mark_success(self, latency_ms: float):
self.consecutive_failures = 0
self.healthy = True
# rolling average
self.avg_latency_ms = (self.avg_latency_ms * 0.9) + (latency_ms * 0.1)
def should_retry(self) -> bool:
"""Check unhealthy providers every 30 seconds"""
if self.healthy:
return True
return time() - self.last_check > 30
Now you route to the first healthy provider, skip providers that are down, and periodically re-check unhealthy providers to see if they have recovered. The user never waits for a timeout on a dead provider.
Load balancing strategies
Once you have multiple providers, you also have a load balancing decision. Not all requests need to go to the same provider.
Round-robin. Distribute requests evenly across healthy providers. Simple, but ignores the fact that different providers have different strengths and costs.
Cost-weighted. Route to the cheapest healthy provider first. If DeepSeek is up and the request is simple, send it to DeepSeek at $0.28/M output tokens instead of GPT-4o at $10/M.
Latency-weighted. Route to the fastest healthy provider. If Groq is consistently returning responses in 200ms while OpenAI averages 800ms, Groq handles latency-sensitive requests.
Capability-weighted. Route based on what the request needs. Complex reasoning goes to GPT-4o or Claude. Simple classification goes to Llama on Groq. This is model routing, which overlaps with load balancing when you have multiple providers for similar capability tiers.
The most effective strategy combines these: route to the cheapest model that can handle the task on the fastest healthy provider. This is where load balancing, failover, and model routing converge into a single routing decision.
The rate limit problem
Rate limits are the most common reason for LLM request failures in production, more common than actual outages.
Every provider has rate limits: requests per minute, tokens per minute, and sometimes requests per day. When you hit them, requests fail with 429 errors until the window resets.
The problem is that rate limits are per-organization. If you have multiple applications or teams using the same API key, one team's batch job can burn through the rate limit and cause another team's real-time chatbot to start failing.
Solutions:
Separate API keys per application. Each app gets its own rate limit allocation. This prevents one workload from starving another.
Request queuing. Buffer requests and release them at a rate just below the limit. Adds latency but prevents 429 errors.
Overflow routing. When you approach the rate limit on one provider, start routing overflow to a secondary provider. This is the cleanest solution: your primary provider handles traffic up to its limit, and the overflow goes somewhere else without any request failures.
Gateway vs DIY
Build it yourself if you have one model from one provider and just want basic retry logic. The code above takes an afternoon.
Use a gateway if you have multiple models across multiple providers and need health checking, rate limit management, cost optimization, and quality-aware routing. The infrastructure to do this properly requires:
Health checking with circuit breakers for each provider
Rate limit tracking and predictive overflow
Latency monitoring and routing optimization
Model-aware routing (which provider serves which model)
Response normalization (different providers return different formats)
Cost tracking per request, per provider, per model
Quality validation to ensure failover models produce equivalent output
NeuralRouting does all of this at the gateway level. Multi-provider failover triggers before the timeout, not after it. If OpenAI is degraded, requests route to the next capable provider automatically. If you hit a rate limit, overflow goes to a secondary provider. Your application code makes one API call. The routing layer handles the rest.
What your users experience
Without failover: OpenAI goes down, your LLM features show error messages for hours, users lose trust.
With basic failover: OpenAI goes down, your app retries on another provider after a timeout, users see a slow response once, then things recover.
With health-checked routing: OpenAI goes down, the router already knows, requests never go there, users notice nothing. Not even a slow response.
The difference between these experiences is the difference between an LLM feature that feels like a demo and one that feels like production infrastructure.
Start simple, then upgrade
If you are running one model on one provider today, add a secondary provider as a failover. Just the basic try/catch version. It takes an afternoon and it will save you the next time OpenAI has a bad day.
If you are already using multiple models and providers, and the routing, health checking, and cost management is becoming its own project, that is the point where a gateway pays for itself.