One key for image, video, audio, and embeddings

ByteSpike's multimodal surface sits behind the same Anthropic-shape key as text — image, video, embeddings, rerank. Here are the three patterns we use ourselves and the failure-doesn't-bill semantics that make experimenting cheap.

May 14, 2026KL6 min read

A gateway that only routes text is a 2024 design. By 2026, every interesting workflow needs to mix modalities — caption an image, score documents by relevance, generate a 5-second product clip, embed a corpus and rerank the top-K. ByteSpike was built with the multimodal surface as a first-class concern, sharing one key + one billing surface with the text models.

The multimodal surface in one paragraph

Image — Seedream v4 / v5lite, Nano Banana / Pro / v2, GPT-Image-2 (both Anthropic shimmed and OpenAI official routes), GPT-4o-image.
Video — Sora-2 + Sora-2-Pro, Veo-3.1 standard + fast, Seedance 1.5-Pro / 2.0 / 2.0-Pro / 2.0-Pro-fast.
Embeddings + rerank — exposed as utility endpoints; same auth header, same per-request credit accounting.
Async tasks — long-running generations (most video models) hand back a task ID instead of streaming. Poll, cancel, query — three Anthropic-style endpoints share the surface.

Pattern 1 — text + image in one Anthropic Messages call

Caption an image and follow with an LLM summary in one round-trip. The image generation tool_use is opt-in; ByteSpike's gateway shims the OpenAI Image API into Anthropic's tool_use block, so you keep one client and one auth header.

Pattern 2 — embed-then-rerank (cheap recall, smart sort)

Embeddings are cheap per token; reranking is expensive per pair. The shape we run internally for our docs site search and our internal Lark knowledge base: embed everything once, recall top 50 by cosine, then send the top 50 + query to a rerank model and keep top 5. Both calls go to ByteSpike, same key.

bash

# Step 1 — embed the corpus (one-shot, cache on your side)
curl https://llm.bytespike.ai/v1/embeddings \
  -H "Authorization: Bearer sk-byts-..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bge-large-zh-v1.5",
    "input": ["doc 1 text...", "doc 2 text..."]
  }'

# Step 2 — at query time, rerank the top-K candidates
curl https://llm.bytespike.ai/v1/rerank \
  -H "Authorization: Bearer sk-byts-..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bge-reranker-v2-m3",
    "query": "user question",
    "documents": ["candidate 1...", "candidate 2...", "..."]
  }'

Pattern 3 — async video generation with task IDs

Sora and Seedance routinely take 30s–2min per clip. ByteSpike standardizes the wait on three Anthropic-style endpoints: /tasks/submit returns an ID; /tasks/query polls; /tasks/cancel kills the job and reclaims the credit reservation. Same key, same content-type, no per-vendor SDK.

Failures don't bill — what that means for multimodal

Multimodal generation has more ways to fail than text: NSFW filters reject prompts, watermark detection trips, the upstream model is overloaded, the file uploaded is corrupt. ByteSpike charges only on a successful generation that delivered the asset. NSFW rejections cost zero. Watermark trips cost zero. Upstream 502s cost zero. The credit account only debits when there's actually something to download.

“Charging for failures discourages exploration. The whole point of a frontier multimodal gateway is to make experiments cheap enough that you actually run them.”

Public per-model rates live on the /pricing table (the same one driven by lib/endpoints.ts joined with the live channel export). Every endpoint listed there speaks the same Anthropic-style auth, so swapping models is a model-name change, not an integration.

The multimodal surface in one paragraph

Image — Seedream v4 / v5lite, Nano Banana / Pro / v2, GPT-Image-2 (both Anthropic shimmed and OpenAI official routes), GPT-4o-image.

Video — Sora-2 + Sora-2-Pro, Veo-3.1 standard + fast, Seedance 1.5-Pro / 2.0 / 2.0-Pro / 2.0-Pro-fast.

Embeddings + rerank — exposed as utility endpoints; same auth header, same per-request credit accounting.

Async tasks — long-running generations (most video models) hand back a task ID instead of streaming. Poll, cancel, query — three Anthropic-style endpoints share the surface.

Pattern 2 — embed-then-rerank (cheap recall, smart sort)

bash

# Step 1 — embed the corpus (one-shot, cache on your side)
curl https://llm.bytespike.ai/v1/embeddings \
  -H "Authorization: Bearer sk-byts-..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bge-large-zh-v1.5",
    "input": ["doc 1 text...", "doc 2 text..."]
  }'

# Step 2 — at query time, rerank the top-K candidates
curl https://llm.bytespike.ai/v1/rerank \
  -H "Authorization: Bearer sk-byts-..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bge-reranker-v2-m3",
    "query": "user question",
    "documents": ["candidate 1...", "candidate 2...", "..."]
  }'

Failures don't bill — what that means for multimodal

“Charging for failures discourages exploration. The whole point of a frontier multimodal gateway is to make experiments cheap enough that you actually run them.”