23 endpoints, one key, 80 ms median overhead
ByteSpike fans out to 23 model endpoints across image, video, text, and async-job categories — through a single Anthropic-shape API. The interesting question isn't routing; it's the latency budget. Here's how we keep gateway median overhead under 80 ms while still doing OAuth pool stickiness, retry semantics, and per-request quota math.
The budget
When a user calls ByteSpike, we have to do four things before the upstream provider sees the request: validate the key + read the org's quota, pick a sticky OAuth-pool slot, attach observability headers, and serialize the request body to the upstream's specific shape (Anthropic Messages → OpenAI Chat Completions transparency layer, etc.). Then on the way back out: read response headers, increment per-request usage, hash for cache, and reserialize.
Total budget: 80 ms median. Real upstream latency for a typical Claude Sonnet TTFB is 600-1200 ms — so 80 ms is a 7-13% tax. We chose the number after benchmarking three competing gateways and finding the median tax was 130-220 ms. We thought we could do better.
Where the milliseconds went
- Connection pool warmth: ~20 ms saved by keeping per-upstream HTTP/2 connections alive across requests. The first hit to a cold pool member takes that hit; everyone after rides cache.
- Quota math at the edge: the org-level wallet read is the heaviest synchronous step. We pin it to a hot Redis replica colocated with the gateway pod and hold a 30-second-TTL local cache for sub-second checks.
- Sticky pool selection: rather than re-balance per request, we hash the user_id into a pool slot and only re-shard on slot health changes. That trades fairness for predictable latency, and we'd take that trade again.
- Streaming reserialize: we don't buffer the response. Bytes from the upstream get reframed inline (Anthropic SSE → OpenAI SSE if needed) and forwarded. The transformer is a small state machine, not a JSON parser.
Where the milliseconds didn't go
We don't run the request body through a generic gateway. There's no plugin chain, no middleware soup, no schema validator that re-parses every JSON field. Every line of pre-upstream code has a measured contribution to the 80 ms budget. Anything that didn't pull its weight got cut.
What this enables
estimated_credits headers returned at submit time, not after streaming completes. Failure refunds processed inline (5xx → release reserved credits before the response leaves the gateway). Per-model retry semantics that match each upstream's published policy. None of this would fit in a 220 ms budget; it all does in 80.
If you're shipping anything time-sensitive against frontier models, the gateway tax matters more than people think. We're going to keep grinding it down.