by Martin Monperrus

You think you’re paying for your prompt and what you see in your agent console. You’re not.

Most of your token bill is for tokens you never wrote, never read, and often cannot see at all. I call these dark tokens.

What are dark tokens?

Dark tokens are tokens that get sent to the model — and billed — but are invisible to you in the interface you’re using.

The analogy to dark matter is intentional. Dark matter dominates the mass budget of the universe. Dark tokens dominate the token budget of agentic AI usage. The visible part — the text you typed, the answer you read — is a small fraction.

Let me make this concrete. When you use a coding agent to fix a bug, here is what you see:

You: fix the null pointer in foo.py
Agent: I've fixed the issue. The problem was...

Here is what actually got tokenized and billed:

  1. Your message: “fix the null pointer in foo.py” — maybe 10 tokens
  2. The agent’s answer — maybe 200 tokens
  3. The system prompt, AGENTS.md, CLAUDE.md, hook outputs — maybe 2000 tokens
  4. The shell commands the agent ran, their outputs — maybe 5000 tokens
  5. Every single message from step 1 to 4, resent verbatim to the model at each of the N inference rounds — maybe 40,000 tokens

Your 10-token prompt cost you 47,210 tokens, most of them you cannot even see (system prompt, repeated tokens).

Dark tokens (simplified): 47,000 out of 47,210, or 99.6% of the bill. The visible exchange — your prompt plus the agent’s answer — is 0.4%.

This ratio is typical.

The three kinds of dark tokens

Replay tokens

This is the biggest one.

Agents work by calling the model repeatedly. Each call resends the full context: your prompt, the model’s previous answers, every tool call and tool result so far. By inference round N, you’re resending everything from rounds 1 through N-1.

This is not a bug. It is how transformer models work: they need the full context window to attend over. But it means token usage scales roughly as O(N²) in the number of inference rounds.

A trajectory with 26 inference calls (a real example from my traces) consumed 5.8 million prompt tokens. The user’s actual text: a few hundred tokens across a handful of messages.

Scaffold tokens

The provider and the agent framework add tokens you never wrote and usually can’t inspect:

Some of these are visible if you know where to look. Provider-side ones are not documented and cannot be reconstructed from client-side traces.

Reasoning tokens

Thinking models (o1, o3, Sonnet with extended thinking) produce internal reasoning tokens before their answer. You often see a summary or nothing at all. You always pay for all of them.

Some providers deliberately obfuscate reasoning tokens to prevent model distillation. You get a redacted or summarized version. The full token count is billed.

Clear tokens

Not all invisible tokens are dark. Tool outputs — shell command results, file reads, search results — appear in your agent trace if you look at it. Same for model tool-call payloads (the JSON the model emits to invoke a tool). You didn’t write them, but you can inspect them. Call these clear tokens: billed, not authored by you, but auditable without any special access.

The distinction matters for auditability. Clear tokens are reconstructable from local traces with no guesswork. Dark tokens — replay, provider-side scaffold, obfuscated reasoning — require either inference or special access to recover.

Why this matters

How to think about cost. If you’re optimizing a prompt, you’re optimizing the wrong thing. The prompt is 0.02% of your total cost. The rest is 98%. Optimize for dark tokens first.

It makes provider numbers hard to audit. When your provider reports 5 million input tokens for a session, how do you check that? The standard answer is: you can’t, not easily. You have no visibility into the scaffold tokens, and reconstructing replay requires re-running the full tokenizer over the structured payloads in the right order.

It creates a measurement gap. You are flying blind on your own usage.

Can you measure it?

Yes, but it requires work. I wrote a longer piece on measuring tokens locally.

The key insight is that agent clients write local traces. Codex writes to ~/.codex/sessions/. Claude Code writes to ~/.claude/projects/. These contain the raw structured payloads — tool calls, tool outputs, model answers — with timestamps.

If you replay these traces with a local tokenizer (tiktoken’s o200k_base is a good baseline), you can reconstruct most of the token bill:

What you cannot reconstruct locally:

The practical takeaways

If you are a heavy agent user:

If you are building an agent platform:

If you are auditing AI costs:

Incentive structure

Providers have no financial incentive to make dark tokens visible.

Every dark token is a billed token. Replay tokens, scaffold tokens, reasoning tokens — they all appear on the invoice as “input tokens” or “output tokens” with no further breakdown. The opacity is not accidental.

If providers exposed per-round token breakdowns, users would immediately see that 98% of their bill is replay. They would then pressure providers to offer smarter context management, selective replay, or cheaper cached-token pricing. Some providers do offer cache-read discounts, but the default behavior — full replay at full price — remains the norm.

Agent framework vendors have a similar incentive. Verbose tool outputs, large injected context files, and rich hook systems all increase dark tokens. They also increase perceived capability. A framework that injects 10,000 tokens of context at startup looks smarter. It is also more expensive to run.

The user, sitting at the visible end of the interface, sees only the task result. The token cost is abstracted away behind a monthly bill.

This is not a conspiracy. It is the natural outcome of an industry where the interface is optimized for capability demos and the billing is optimized for simplicity. But the effect is the same: dark tokens stay dark, and the user pays.

Conclusion

“dark tokens” share the defining property of dark matter: you can infer they exist from their effects (the bill), you cannot directly observe them in the interface, and they dominate the total.

The difference from dark matter is that dark tokens are not mysterious. They are entirely deterministic, reproducible, and inspectable, only if you have access to the data.

The challenge is that most AI tooling hides this complexity. The chat interface shows you a conversation. The agent shows you a task result. Neither shows you the 5 million tokens that moved behind the scenes.

And it is one that benefits providers more than users.