Route smarter. Spend less. Scale faster.
Cut your AI costs from your first request with our intelligent multi-provider gateway.
1. Request
Unified API
2. Optimize
Cost/Latency
3. Route
Best Provider
4. Fallback
100% Uptime
Why NeuralRouting?
Legacy Approach
Single provider, fixed high costs, and a single point of failure. If OpenAI goes down, your business stops.
NeuralRouting
Multi-cloud resilience with real-time cost optimization. We pick the best model for every single prompt.
"You don't pick the model. The best model is picked for you."
Zero config. Real-time optimization per request.
Drop-in Integration
Replace your OpenAI baseURL and start saving. No refactoring needed.
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: "https://api.neuralrouting.io/v1",
apiKey: "nr_live_your_api_key"
});
const response = await client.chat.completions.create({
model: "neural-router-v2",
messages: [{ role: "user", content: "Analyze this data" }]
});Example Response
{
"status": "success",
"model_used": "claude-3.5-sonnet",
"output": { "ai_answer": "..." },
"business_metrics": {
"cost_usd": 0.0020,
"estimated_gpt4_cost": 0.0052,
"savings_percentage": 61.5
}
}Streaming Responses
Use /v1/dispatch/stream to receive tokens as they are generated — ideal for chat UIs and real-time applications. Returns standard SSE (Server-Sent Events) in OpenAI-compatible format.
JavaScript / TypeScript
const res = await fetch("https://api.neuralrouting.io/v1/dispatch/stream", {
method: "POST",
headers: {
"Content-Type": "application/json",
"X-API-KEY": "nr_live_your_api_key",
},
body: JSON.stringify({
messages: [{ role: "user", content: "Explain streaming in one paragraph." }],
session_id: "my-session-01",
}),
});
const reader = res.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
for (const line of decoder.decode(value).split("\n")) {
if (!line.startsWith("data: ")) continue;
const raw = line.slice(6).trim();
if (raw === "[DONE]") break;
const chunk = JSON.parse(raw);
// Standard token chunks
if (chunk.object === "chat.completion.chunk") {
const token = chunk.choices?.[0]?.delta?.content;
if (token) process.stdout.write(token);
}
// NeuralRouting billing event (last event before [DONE])
if (chunk.object === "nr.billing") {
console.log("Model:", chunk.model_used);
console.log("Cost: $" + chunk.financials.billed_price.toFixed(6));
}
}
}SSE Event Types
AI Agent Integration
NeuralRouting exposes a fully OpenAI-compatible /v1/chat/completions endpoint — including tool/function calling. Point any agent framework at NeuralRouting and every step in your loop gets routed to the cheapest model that can handle it.
Complex reasoning steps → GPT-4o. Simple decisions and summaries → Llama (budget). Automatically, per request.
LangChain / Python
from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain_core.prompts import ChatPromptTemplate
# Just change the base_url — everything else stays the same
llm = ChatOpenAI(
base_url="https://api.neuralrouting.io/v1",
api_key="nr_live_your_api_key",
model="neural-optimizer",
)
# Your tools, prompts, and agent logic — unchanged
agent = create_openai_tools_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools)
result = executor.invoke({"input": "Research and summarize AI pricing trends"})OpenAI SDK (any framework)
from openai import OpenAI
client = OpenAI(
base_url="https://api.neuralrouting.io/v1",
api_key="nr_live_your_api_key",
)
# Full tool/function calling support
response = client.chat.completions.create(
model="neural-optimizer",
messages=[{"role": "user", "content": "What is 15% of 2400?"}],
tools=[{
"type": "function",
"function": {
"name": "calculate",
"description": "Evaluate a math expression",
"parameters": {
"type": "object",
"properties": {"expression": {"type": "string"}},
"required": ["expression"]
}
}
}],
tool_choice="auto",
)Why this matters for agents
Agent loops are expensive
A typical research agent makes 10–30 LLM calls per task. Sending all of them to GPT-4o is wasteful.
Not every step needs GPT-4o
Tool selection, intermediate summaries, and simple decisions route to budget models automatically.
10x cost reduction
An agent that costs $0.50/task with GPT-4o can drop to $0.05 with intelligent per-step routing.
Scale with confidence
Instantly offload simple tasks to economical models.
Achieve total resilience without multi-provider complexity.
Zero infrastructure setup. Global availability from day one.
Financial Control Center
- Pinpoint cost leaks by filtering by endpoint, user, or feature.
- Detect expensive requests that don't require high-tier models.
- Real-time audit logs for every cent spent and saved.
Semantic Cache
Every prompt that passes through NeuralRouting is embedded and stored. Future requests that are semantically similar (not just identical) hit the cache and are returned instantly — no model call, no credit deduction.
Level 1 — Exact match
SHA-256 hash lookup. Zero cost, < 1ms.
Level 2 — Semantic match
Cosine similarity via pgvector. Threshold: 0.92.
Cache miss
Routes normally, stores result async. No latency added.
When a cache hit occurs, the response includes two extra fields:
{
"status": "success",
"model_used": "claude-3.5-sonnet",
"output": { "ai_answer": "..." },
"cache_hit": true, // ← served from semantic cache
"cache_exact": false, // ← exact hash (true) or semantic match (false)
"cache_similarity": 0.9541, // ← cosine similarity score
"business_metrics": { ... }
}Configure via env: SEMANTIC_CACHE_ENABLED, CACHE_SIMILARITY_THRESHOLD (default: 0.92), CACHE_TTL_DAYS (default: 7).
Prompt Injection Shield
Every request is scanned by a real-time heuristic engine before any model call or credit deduction. It detects and blocks prompt injection attempts, jailbreaks, DAN patterns, and system-prompt extraction — in under 1ms with no LLM calls.
CRITICAL (blocked)
DAN jailbreaks, ignore-all-instructions, token smuggling, bypass-safety patterns
HIGH (blocked)
System-tag injection ([INST], <system>, [[SYSTEM]]), role override, prompt extraction requests
MEDIUM (flagged)
Compound ignore-above constructs, base64 encoded payloads
Blocked requests return HTTP 403:
HTTP 403 Forbidden
{
"error": "Request blocked by NeuralRouting Security Shield",
"category": "CRITICAL",
"risk_score": 0.95
}All blocked requests are logged in your security audit trail, accessible from the dashboard. Configure threshold via SHIELD_BLOCK_THRESHOLD (default: 0.85).
User Attribution
Pass a user field in your request body to tag requests by end-user. This unlocks per-user cost breakdowns, savings attribution, and budget enforcement in the dashboard.
{
"messages": [{ "role": "user", "content": "..." }],
"user": "end-user-id-or-email", // ← attribution tag
"session_id": "optional-session"
}View per-user spend, request counts, and savings at neuralrouting.io/attribution.
Routing Modes
Control how the LLM router selects models. Pass X-Routing-Mode header or use the dashboard default.
| Mode | Strategy | Cost | Speed | Quality | Plans |
|---|---|---|---|---|---|
auto | Classifier analyzes complexity → optimal model | ★★★★★ | ★★★★ | ★★★★★ | All |
cost | Always economy tier (Llama 3.1 8B) | ★★★★★ | ★★★★★ | ★★★ | Starter+ |
quality | Always premium tier (GPT-4o) | ★ | ★★★ | ★★★★★ | Growth+ |
speed | Fastest model via Groq | ★★★★★ | ★★★★★ | ★★★ | Growth+ |
In auto mode, the zero-cost local classifier detects task type and complexity. Requests with complexity > 6 or high-risk content auto-escalate to premium. The Confidence Matrix further adjusts based on historical quality data.
Rate Limits & Models
Rate Limits by Plan
Available Models
In auto mode, the LLM router picks the model. Use X-Force-Model header to override (Starter+ plans).
Error Codes
Success
Insufficient credits
Security shield blocked
Rate limit exceeded
Start saving on every AI request.
Join the teams optimizing their AI infrastructure. No credit card required to start.
Get Started Now