Architecture
How It Works
Every request passes through 8 layers — cache, security, classification, quality matrix, routing, validation, learning, and reporting.
Semantic Cache Check
Before anything else, the prompt is hashed and matched against the semantic cache. If a similar prompt was routed before, the answer comes back instantly — no model call, zero cost.
// Neural Insight:Two-level lookup: exact SHA-256 hash (< 1ms, free) → cosine similarity via pgvector (threshold: 0.92). The cache grows smarter with every request.
Security Shield Scan
Every prompt is scanned by a real-time heuristic engine that detects prompt injection, DAN jailbreaks, system-prompt extraction, and token-smuggling attacks.
// Neural Insight:Pure regex/heuristic — no LLM call, under 1ms. Three tiers: CRITICAL and HIGH patterns are blocked (HTTP 403). MEDIUM patterns are flagged and logged.
Prompt Analysis & Classification
A lightweight intent model classifies the task type (coding, summarization, reasoning, chat...) and complexity score in real-time.
// Neural Insight:This classification drives all downstream decisions — routing mode, confidence matrix lookup, and shadow engine triggers.
Confidence Matrix Lookup
Before routing, the engine checks a live quality matrix: has this model/task-type combination historically produced poor results? If yes, it auto-escalates to a premium model.
// Neural Insight:Matrix is rebuilt from shadow audit data every 30 minutes. Pairs with < 20 samples are skipped to avoid cold-start false positives.
Dynamic Model Routing
The router selects the cheapest model capable of handling the task at the required quality level — Auto, Cost, Speed, Quality, or your Custom rules.
// Neural Insight:Why pay for GPT-4o if a 10x cheaper model handles summarization at 99.9% accuracy? The router makes that call per request, in milliseconds.
Shadow Quality Validation
For economy-tier responses, a silent A/B check runs in the background to validate quality before committing the routing decision to the confidence matrix.
// Neural Insight:If the shadow check flags a poor response, the result is escalated to a premium model automatically. This data feeds back into the confidence matrix.
Feedback Loop & Learning
Every shadow audit result updates the confidence matrix. Over time, routing decisions improve automatically without any manual configuration.
// Neural Insight:This is the data moat: a competitor running the same code today starts with zero historical quality data. Your routing gets better as your traffic grows.
FinOps Attribution & Reporting
Every cent saved is recorded. Per-user attribution, daily cost series, and ROI vs GPT-4o benchmark are all tracked in real-time.
// Neural Insight:Use the User Attribution field to track spend per end-user, generate monthly PDF reports, and show your CFO exactly how much AI routing saves.