Route smarter. Spend less. Scale faster.

Cut your AI costs from your first request with our intelligent multi-provider gateway.

1. Request

Unified API

2. Optimize

Cost/Latency

3. Route

Best Provider

4. Fallback

100% Uptime

Why NeuralRouting?

Legacy Approach

Single provider, fixed high costs, and a single point of failure. If OpenAI goes down, your business stops.

NeuralRouting

Multi-cloud resilience with real-time cost optimization. We pick the best model for every single prompt.

"You don't pick the model. The best model is picked for you."

Zero config. Real-time optimization per request.

Drop-in Integration

Replace your OpenAI baseURL and start saving. No refactoring needed.

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: "https://api.neuralrouting.io/v1",
  apiKey: "nr_live_your_api_key"
});

const response = await client.chat.completions.create({
  model: "neural-router-v2",
  messages: [{ role: "user", content: "Analyze this data" }]
});

Example Response

{
  "status": "success",
  "model_used": "claude-3.5-sonnet",
  "output": { "ai_answer": "..." },
  "business_metrics": {
    "cost_usd": 0.0020,
    "estimated_gpt4_cost": 0.0052,
    "savings_percentage": 61.5
  }
}

Streaming Responses

Use /v1/dispatch/stream to receive tokens as they are generated — ideal for chat UIs and real-time applications. Returns standard SSE (Server-Sent Events) in OpenAI-compatible format.

JavaScript / TypeScript

const res = await fetch("https://api.neuralrouting.io/v1/dispatch/stream", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "X-API-KEY": "nr_live_your_api_key",
  },
  body: JSON.stringify({
    messages: [{ role: "user", content: "Explain streaming in one paragraph." }],
    session_id: "my-session-01",
  }),
});

const reader = res.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  for (const line of decoder.decode(value).split("\n")) {
    if (!line.startsWith("data: ")) continue;
    const raw = line.slice(6).trim();
    if (raw === "[DONE]") break;

    const chunk = JSON.parse(raw);

    // Standard token chunks
    if (chunk.object === "chat.completion.chunk") {
      const token = chunk.choices?.[0]?.delta?.content;
      if (token) process.stdout.write(token);
    }

    // NeuralRouting billing event (last event before [DONE])
    if (chunk.object === "nr.billing") {
      console.log("Model:", chunk.model_used);
      console.log("Cost: $" + chunk.financials.billed_price.toFixed(6));
    }
  }
}

SSE Event Types

chat.completion.chunkToken delta — same format as OpenAI streaming
nr.billingFinal event with model, cost, savings, and token usage
[DONE]Stream complete

AI Agent Integration

NeuralRouting exposes a fully OpenAI-compatible /v1/chat/completions endpoint — including tool/function calling. Point any agent framework at NeuralRouting and every step in your loop gets routed to the cheapest model that can handle it.

Complex reasoning steps → GPT-4o. Simple decisions and summaries → Llama (budget). Automatically, per request.

LangChain / Python

from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain_core.prompts import ChatPromptTemplate

# Just change the base_url — everything else stays the same
llm = ChatOpenAI(
    base_url="https://api.neuralrouting.io/v1",
    api_key="nr_live_your_api_key",
    model="neural-optimizer",
)

# Your tools, prompts, and agent logic — unchanged
agent = create_openai_tools_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools)
result = executor.invoke({"input": "Research and summarize AI pricing trends"})

OpenAI SDK (any framework)

from openai import OpenAI

client = OpenAI(
    base_url="https://api.neuralrouting.io/v1",
    api_key="nr_live_your_api_key",
)

# Full tool/function calling support
response = client.chat.completions.create(
    model="neural-optimizer",
    messages=[{"role": "user", "content": "What is 15% of 2400?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "calculate",
            "description": "Evaluate a math expression",
            "parameters": {
                "type": "object",
                "properties": {"expression": {"type": "string"}},
                "required": ["expression"]
            }
        }
    }],
    tool_choice="auto",
)

Why this matters for agents

Agent loops are expensive

A typical research agent makes 10–30 LLM calls per task. Sending all of them to GPT-4o is wasteful.

Not every step needs GPT-4o

Tool selection, intermediate summaries, and simple decisions route to budget models automatically.

10x cost reduction

An agent that costs $0.50/task with GPT-4o can drop to $0.05 with intelligent per-step routing.

Scale with confidence

70%
Reduce Costs

Instantly offload simple tasks to economical models.

99.9%+
Increase Uptime

Achieve total resilience without multi-provider complexity.

0ms
Deploy Global

Zero infrastructure setup. Global availability from day one.

Financial Control Center

  • Pinpoint cost leaks by filtering by endpoint, user, or feature.
  • Detect expensive requests that don't require high-tier models.
  • Real-time audit logs for every cent spent and saved.

Semantic Cache

Every prompt that passes through NeuralRouting is embedded and stored. Future requests that are semantically similar (not just identical) hit the cache and are returned instantly — no model call, no credit deduction.

Level 1 — Exact match

SHA-256 hash lookup. Zero cost, < 1ms.

Level 2 — Semantic match

Cosine similarity via pgvector. Threshold: 0.92.

Cache miss

Routes normally, stores result async. No latency added.

When a cache hit occurs, the response includes two extra fields:

{
  "status": "success",
  "model_used": "claude-3.5-sonnet",
  "output": { "ai_answer": "..." },
  "cache_hit": true,           // ← served from semantic cache
  "cache_exact": false,        // ← exact hash (true) or semantic match (false)
  "cache_similarity": 0.9541,  // ← cosine similarity score
  "business_metrics": { ... }
}

Configure via env: SEMANTIC_CACHE_ENABLED, CACHE_SIMILARITY_THRESHOLD (default: 0.92), CACHE_TTL_DAYS (default: 7).

Prompt Injection Shield

Every request is scanned by a real-time heuristic engine before any model call or credit deduction. It detects and blocks prompt injection attempts, jailbreaks, DAN patterns, and system-prompt extraction — in under 1ms with no LLM calls.

CRITICAL (blocked)

DAN jailbreaks, ignore-all-instructions, token smuggling, bypass-safety patterns

HIGH (blocked)

System-tag injection ([INST], <system>, [[SYSTEM]]), role override, prompt extraction requests

MEDIUM (flagged)

Compound ignore-above constructs, base64 encoded payloads

Blocked requests return HTTP 403:

HTTP 403 Forbidden

{
  "error": "Request blocked by NeuralRouting Security Shield",
  "category": "CRITICAL",
  "risk_score": 0.95
}

All blocked requests are logged in your security audit trail, accessible from the dashboard. Configure threshold via SHIELD_BLOCK_THRESHOLD (default: 0.85).

User Attribution

Pass a user field in your request body to tag requests by end-user. This unlocks per-user cost breakdowns, savings attribution, and budget enforcement in the dashboard.

{
  "messages": [{ "role": "user", "content": "..." }],
  "user": "end-user-id-or-email",   // ← attribution tag
  "session_id": "optional-session"
}

View per-user spend, request counts, and savings at neuralrouting.io/attribution.

Routing Modes

Control how the LLM router selects models. Pass X-Routing-Mode header or use the dashboard default.

ModeStrategyCostSpeedQualityPlans
autoClassifier analyzes complexity → optimal model★★★★★★★★★★★★★★All
costAlways economy tier (Llama 3.1 8B)★★★★★★★★★★★★★Starter+
qualityAlways premium tier (GPT-4o)★★★★★★★★Growth+
speedFastest model via Groq★★★★★★★★★★★★★Growth+

In auto mode, the zero-cost local classifier detects task type and complexity. Requests with complexity > 6 or high-risk content auto-escalate to premium. The Confidence Matrix further adjusts based on historical quality data.

Rate Limits & Models

Rate Limits by Plan

Free
3 req/min5K credits
Starter ($29/mo)
60 req/min50K credits
Growth ($89/mo)
250 req/min200K credits
Business ($349/mo)
1,000 req/min1M credits

Available Models

GPT-4oPremium
100 cr/1K tokens
GPT-4o MiniMedium
10 cr/1K tokens
Llama 3.1 8BBudget
1 cr/1K tokens
Llama 3.1 70BBudget+
3 cr/1K tokens

In auto mode, the LLM router picks the model. Use X-Force-Model header to override (Starter+ plans).

Error Codes

200

Success

402

Insufficient credits

403

Security shield blocked

429

Rate limit exceeded

Start saving on every AI request.

Join the teams optimizing their AI infrastructure. No credit card required to start.

Get Started Now