Skip to content
← All posts

23 endpoints, one key, 80 ms median overhead

ByteSpike fans out to 23 model endpoints across image, video, text, and async-job categories — through a single Anthropic-shape API. The interesting question isn't routing; it's the latency budget. Here's how we keep gateway median overhead under 80 ms while still doing OAuth pool stickiness, retry semantics, and per-request quota math.

KL6 min read

The budget

When a user calls ByteSpike, we have to do four things before the upstream provider sees the request: validate the key + read the org's quota, pick a sticky OAuth-pool slot, attach observability headers, and serialize the request body to the upstream's specific shape (Anthropic Messages → OpenAI Chat Completions transparency layer, etc.). Then on the way back out: read response headers, increment per-request usage, hash for cache, and reserialize.

Total budget: 80 ms median. Real upstream latency for a typical Claude Sonnet TTFB is 600-1200 ms — so 80 ms is a 7-13% tax. We chose the number after benchmarking three competing gateways and finding the median tax was 130-220 ms. We thought we could do better.

Where the milliseconds went

  • Connection pool warmth: ~20 ms saved by keeping per-upstream HTTP/2 connections alive across requests. The first hit to a cold pool member takes that hit; everyone after rides cache.
  • Quota math at the edge: the org-level wallet read is the heaviest synchronous step. We pin it to a hot Redis replica colocated with the gateway pod and hold a 30-second-TTL local cache for sub-second checks.
  • Sticky pool selection: rather than re-balance per request, we hash the user_id into a pool slot and only re-shard on slot health changes. That trades fairness for predictable latency, and we'd take that trade again.
  • Streaming reserialize: we don't buffer the response. Bytes from the upstream get reframed inline (Anthropic SSE → OpenAI SSE if needed) and forwarded. The transformer is a small state machine, not a JSON parser.

Where the milliseconds didn't go

We don't run the request body through a generic gateway. There's no plugin chain, no middleware soup, no schema validator that re-parses every JSON field. Every line of pre-upstream code has a measured contribution to the 80 ms budget. Anything that didn't pull its weight got cut.

What this enables

estimated_credits headers returned at submit time, not after streaming completes. Failure refunds processed inline (5xx → release reserved credits before the response leaves the gateway). Per-model retry semantics that match each upstream's published policy. None of this would fit in a 220 ms budget; it all does in 80.

If you're shipping anything time-sensitive against frontier models, the gateway tax matters more than people think. We're going to keep grinding it down.