Dark Tokens

You think you’re paying for your prompt and what you see in your agent console. You’re not.

Most of your token bill is for tokens you never wrote, never read, and often cannot see at all. I call these dark tokens.

What are dark tokens?

Dark tokens are tokens that get sent to the model — and billed — but are invisible to you in the interface you’re using. They are often even impossible to access even with low-level debugging.

The analogy to dark matter is intentional. Dark matter dominates the mass budget of the universe. Dark tokens dominate the token budget of agentic AI usage. The visible part — the text you typed, the answer you read — is a small fraction.

Let me make this concrete. When you use a coding agent to fix a bug, here is what you see:

You: fix the null pointer in foo.py
Agent: I've fixed the issue. The problem was...

Here is what actually got tokenized and billed:

Your message: “fix the null pointer in foo.py” — maybe 10 tokens
The agent’s answer — maybe 200 tokens
The system prompt, AGENTS.md, CLAUDE.md, hook outputs — maybe 2000 tokens
The shell commands the agent ran, their outputs — maybe 5000 tokens
Every single message from step 1 to 4, resent verbatim to the model at each of the N inference rounds — maybe 40,000 tokens

Your 10-token prompt cost you 47,210 tokens, most of them you cannot even see (system prompt, repeated tokens).

Dark tokens (simplified): 47,000 out of 47,210, or 99.6% of the bill. The visible exchange — your prompt plus the agent’s answer — is 0.4%.

This ratio is typical.

The three kinds of dark tokens

Replay tokens

This is the biggest one.

Agents work by calling the model repeatedly. Each call resends the full context: your prompt, the model’s previous answers, every tool call and tool result so far. By inference round N, you’re resending everything from rounds 1 through N-1.

This is not a bug. It is how transformer models work: they need the full context window to attend over. But it means token usage scales roughly as O(N²) in the number of inference rounds.

A trajectory with 26 inference calls (a real example from my traces) consumed 5.8 million prompt tokens. The user’s actual text: a few hundred tokens across a handful of messages.

Scaffold tokens

The agent harness add tokens you never wrote and usually can’t inspect:

System prompts (“you are a helpful assistant that…”)
Policy enforcement wrappers
Internal routing metadata
Cache bookkeeping payloads
Bootstrap context injected at session start
AGENTS.md, CLAUDE.md, CURSOR_RULES — sent at every session start
Hook outputs — whatever your pre/post hooks emit

Some of these are visible if you know where to look. Provider-side ones are not documented and cannot be reconstructed from client-side traces.

Reasoning tokens

Thinking models (o1, o3, Sonnet with extended thinking) produce internal reasoning tokens before their answer. You often see a summary or nothing at all. You always pay for all of them.

Some providers deliberately obfuscate reasoning tokens to prevent model distillation (Claude for example). You get a redacted or summarized version. The full token count is billed.

Clear tokens

Not all invisible tokens are dark. Tool outputs — shell command results, file reads, search results — appear in your agent trace if you look at it. Same for model tool-call payloads (the JSON the model emits to invoke a tool). You didn’t write them, but you can inspect them. Call these clear tokens: billed, not authored by you, but auditable without any special access.

The distinction matters for auditability. Clear tokens are reconstructable from local traces with no guesswork. Dark tokens — replay, provider-side scaffold, obfuscated reasoning — require either inference or special access to recover.

Why this matters

How to think about cost. If you’re optimizing a prompt, you’re optimizing the wrong thing. The prompt is a single digit, say 2% of your total cost. The rest is 98%. Optimize for dark tokens first.

It makes provider numbers hard to audit. When your provider reports 5 million input tokens for a session, how do you check that? The standard answer is: you can’t. You have no visibility into the scaffold tokens, and reconstructing replay requires re-running the full tokenizer over the structured payloads in the right order.

It creates a measurement gap. You are flying blind on your own usage.

Can you measure it?

Yes, but it requires work. I wrote a longer piece on measuring tokens locally.

The key insight is that agent clients write local traces. Codex writes to ~/.codex/sessions/. Claude Code writes to ~/.claude/projects/. These contain the raw structured payloads — tool calls, tool outputs, model answers — with timestamps.

If you replay these traces with a local tokenizer (tiktoken’s o200k_base is a good baseline), you can reconstruct the token billing related to:

Prompt tokens: parse user messages
Tool output tokens: parse tool results
Model output tokens: parse assistant messages
Replay tokens: for each inference round n, sum all tokens from rounds 1..n-1

What you cannot reconstruct locally:

Scaffold tokens (provider-side, opaque)
Exact reasoning token counts (sometimes hidden, hidden for Claude)
Provider-internal cache bookkeeping

The practical takeaways

If you are a heavy agent user:

Stop optimizing prompts. Start optimizing trajectory length (number of inference rounds).
A 20-step trajectory costs roughly 4x a 10-step trajectory for the same task, because replay accumulates quadratically.
Tools that produce verbose output (verbose test runners, log dumps) multiply replay costs.

If you are building an agent platform:

Implement local token accounting from day one. Read the raw trace files. Tokenize them. This gives you observability you cannot get from provider dashboards alone.
Expose per-round breakdown: prompt / tool_output / replay / model_output / reasoning. Your users will find it valuable.

If you are auditing AI costs:

Provider totals are not auditable without local reconstruction.
The replay component is auditable. The scaffold is not (unless the provider documents it).
A 2x discrepancy between your mental model and your bill is normal. ## Incentive structure

Providers have no financial incentive to make dark tokens visible.

Every dark token is a billed token. Replay tokens, scaffold tokens, reasoning tokens — they all appear on the invoice as “input tokens” or “output tokens” with no further breakdown. The opacity is not accidental.

If providers exposed per-round token breakdowns, users would immediately see that 98% of their bill is replay. They would then pressure providers to offer smarter context management, selective replay, or cheaper cached-token pricing. Some providers do offer cache-read discounts, but the default behavior — full replay at full price — remains the norm.

Agent framework vendors have a similar incentive. Verbose tool outputs, large injected context files, and rich hook systems all increase dark tokens. They also increase perceived capability. A framework that injects 10,000 tokens of context at startup looks smarter. It is also more expensive to run.

The user, sitting at the visible end of the interface, sees only the task result. The token cost is abstracted away behind a monthly bill.

This is not a conspiracy. It is the natural outcome of an industry where the interface is optimized for capability demos and the billing is optimized for simplicity. But the effect is the same: dark tokens stay dark, and the user pays.

Conclusion

“dark tokens” share the defining property of dark matter: you can infer they exist from their effects (the bill), you cannot directly observe them in the interface, and they dominate the total.

The difference from dark matter is that dark tokens are not mysterious. They are entirely deterministic, reproducible, and inspectable, only if you have access to the data.

The challenge is that most AI tooling hides this complexity. The chat interface shows you a conversation. The agent shows you a task result. Neither shows you the 5 million tokens that moved behind the scenes.

And it is one that benefits providers more than users.