LLM load balancing and failover: how to…

LLM load balancing and failover: how to… | NeuralRouting.io

import httpx

PROVIDERS = [
    {"name": "openai", "base_url": "https://api.openai.com/v1", "model": "gpt-4o"},
    {"name": "anthropic", "base_url": "https://api.anthropic.com/v1", "model": "claude-sonnet-4-6"},
    {"name": "groq", "base_url": "https://api.groq.com/openai/v1", "model": "llama-3.1-70b-versatile"},
]

async def call_with_failover(prompt: str, timeout: float = 10.0):
    for provider in PROVIDERS:
        try:
            response = await call_provider(provider, prompt, timeout=timeout)
            return response
        except (httpx.TimeoutException, httpx.HTTPStatusError) as e:
            if is_rate_limit(e):
                continue  # try next provider
            if is_server_error(e):
                continue  # try next provider
            raise  # client error, don't retry
    
    raise Exception("All providers failed")

from dataclasses import dataclass
from time import time

@dataclass
class ProviderHealth:
    name: str
    healthy: bool = True
    last_check: float = 0
    consecutive_failures: int = 0
    avg_latency_ms: float = 0
    
    def mark_failure(self):
        self.consecutive_failures += 1
        if self.consecutive_failures >= 3:
            self.healthy = False
            self.last_check = time()
    
    def mark_success(self, latency_ms: float):
        self.consecutive_failures = 0
        self.healthy = True
        # rolling average
        self.avg_latency_ms = (self.avg_latency_ms * 0.9) + (latency_ms * 0.1)
    
    def should_retry(self) -> bool:
        """Check unhealthy providers every 30 seconds"""
        if self.healthy:
            return True
        return time() - self.last_check > 30

LLM load balancing and failover: how to keep your AI running when providers go down

LLM load balancing and failover: how to keep your AI running when providers go down

The three failure modes

Basic failover architecture

What the simple approach gets wrong

Better approach: health-checked routing

Load balancing strategies

The rate limit problem

Gateway vs DIY

What your users experience

Start simple, then upgrade