Don't charge for failures: the engineering behind a one-sentence billing promise
"Failures don't bill" fits in a footer. Implementing it correctly across nine providers, two protocols, sync and streaming, sync and async, text and pixels — that is several thousand lines of code and one published refund policy.
Every aggregator that wants to look fair puts "failures don't bill" on its pricing page. The sentence is free. The implementation is not. Once you sit down to actually enforce it across nine upstream vendors, two protocols, sync and streaming endpoints, and four output modalities, the edge cases start arriving faster than the policy can absorb them. This post is the operational map underneath that one sentence — what we count as a failure, what we don't, and where the lines had to be drawn explicitly because the universe did not draw them for us.
The boring 80%: 4xx and 5xx
Most failures are easy. The upstream returns 401, 429, 500, 502, 503, 504. We propagate the status, log the request, and write zero into the credits ledger. The accounting view of these is identical: bytes left our edge, bytes came back, no model produced usable tokens, no charge. About 80% of our failure volume is this boring shape and it never needs a human in the loop.
The tricky middle: 200 OK with an error in the body
Three of our upstream providers will return HTTP 200 with a JSON body whose top-level field is `error`. From an HTTP perspective the request succeeded. From the user's perspective it didn't — there are no completion tokens, just an apology. Our gateway parses these and rewrites them to the appropriate 4xx or 5xx before they reach the client, then records the request as failed for billing. If we trusted the status code alone, we'd be silently charging users for requests that produced nothing.
The hard part: mid-stream failure
Streaming is where every aggregator's billing promise starts to leak. You're 300 tokens into a 1000-token response. The upstream sends a delta with `"type": "error"` and closes the SSE channel. The client received and rendered 300 tokens. Did the request succeed? We picked the strictest answer we could stand by: a mid-stream abort, regardless of how many tokens the client already rendered, bills nothing. The user got a partial answer they cannot trust the completeness of. We do not charge for that. Yes, this is more expensive than the alternative. It is also the only policy we can repeat back to a customer with a straight face.
Where partial billing is honest: video
Video generation breaks the symmetric pattern. A video job is async, multi-second to multi-minute, and held in a state machine. The vendor charges us per generated second the moment rendering starts on their GPUs — we cannot retroactively unbill them by cancelling a job. So our rate card says it explicitly: video jobs cancelled after the `running` state bill for the seconds already produced. This is the one exception, and it is written down where any customer can find it.
Image, embedding, rerank: atomic
These three modalities are the easy bucket. The request is synchronous, the response is either a file/vector/score or a non-2xx status, there is no "partial" state. We charge on 2xx with a present payload, we charge zero on any other shape. No grey zones, no special-case ledger entries. Most of the codebase that handles them is shape validation, not billing logic.
Tool use, thinking, and the cost-of-fidelity tax
Anthropic-shaped responses carry `tool_use` blocks and `thinking` blocks. Both count as output tokens upstream — and both can legitimately appear in a response that the client considers "useless" (a tool call to an endpoint the client cannot reach; a thinking pass that produced no final text). We do not refund these. The model did the work, the upstream billed us for the work, and "the client didn't use it" is not a failure of the gateway. We do log them prominently in the request inspector so customers can see exactly what they paid for. Transparency is not the same as refundability.
Idempotency, retries, and the 'paid twice' fear
Customers reading our SLA care less about how we count tokens than about whether their retry script can accidentally double-charge them. So every chat request accepts an `Idempotency-Key` header. Within a 24-hour window, a repeated key returns the cached response and writes zero credits the second time. The retries that matter — Cloudflare layer dropping a stream, a client timeout firing one millisecond before our 200 lands — are the most common source of "why was I billed twice" tickets, and idempotency keys close that loop without making customers think about it.
What the policy looks like from the inside
- Any non-2xx upstream → zero credits, no exceptions.
- 200 OK with an `error` body field → rewrite to the right status, zero credits.
- Mid-stream `error` delta → zero credits, even if tokens were rendered.
- Video job cancelled post-`running` → bills for seconds produced (disclosed on rate card).
- Image / embedding / rerank: 2xx with payload bills; anything else is free.
- Idempotency-Key dedupes within 24h — second hit is free regardless of outcome.
- tool_use and thinking tokens do bill — they cost us, they cost the customer; the request inspector shows the breakdown.
The refund layer above all of this
Even with this policy, customers will occasionally feel a charge was unfair — a model that produced a 2xx response they consider unusable, a stream that completed but only after Cloudflare retried it once. We don't argue these. The 30-day USD refund policy on unused credits, plus a discretionary credit-back path for ambiguous cases, covers what the per-request billing logic cannot. It is the only sustainable shape — code enforces the rules that can be enforced; humans handle the cases that genuinely cannot.
“Every pricing promise is a contract with the future. The cost of making one is one sentence; the cost of keeping one is the rest of the codebase.”
Failure billing was the first policy we wrote down for ByteSpike and the one we have rewritten the implementation of the most times. Streaming changed it. Multi-modal changed it. The first customer with a retry script changed it. The first vendor we onboarded with 200-OK-error-bodies changed it. The sentence on the pricing page has not moved. The code under it is still moving. We expect it always will be.