If you use coding agents or LLM-powered tools seriously, token accounting stops being a curiosity and becomes an engineering problem.
You want to know:
- what was really sent to the model
- what the model really produced
- whether the provider’s billing numbers are plausible
This post explains:
- why measuring tokens locally matters
- why it is hard
- how to do it in practice
- a conceptual model of token categories and the formulas that get closest to provider-reported usage
Repository: https://github.com/superleanai/verify
Why measure tokens locally?
The short answer is: no trust.
Provider-side token numbers are useful, but they are not enough if you want to audit behavior precisely.
There are several reasons:
- Providers report totals, but usually not the exact reconstructed payload that produced those totals.
- Agent systems do much more than a single prompt/response pair. They call tools, replay earlier context and sometimes add hidden scaffolding.
- When costs or limits matter, you need an independent estimate.
- If you are building an agent platform, you want observability that does not depend entirely on one vendor’s dashboard.
Local measurement is not about accusing providers of fraud. It is about reproducibility and auditability. In the same way that serious systems engineers inspect network traffic instead of trusting a summary counter, serious agent users should inspect token flows instead of trusting a single opaque usage number.
Why is local token measurement hard?
At first glance it sounds simple: tokenize the prompt, tokenize the answer, add them up.
That is wrong for modern agents. The difficulty comes from the gap between:
- the human-visible interaction
- the actual structured payload sent to the model
- the obfuscated reasoning tokens due to anti-distillation measures.
Hidden structure
A chat UI shows text. The model sees structured messages, tool call payloads, tool results, attachments, and sometimes reasoning content.
Replay across inference calls
Many agent interactions are multi-step:
- user asks something
- model emits a tool call
- tool runs
- tool result is sent back
- model answers
Each new inference call often resends earlier relevant context (by default all). This “replay” (aka cached tokens) dominates token usage.
Provider-specific internal scaffolding
The provider may add:
- system instructions
- hook-generated context and hook outputs
- policy wrappers
- internal routing metadata
- cache bookkeeping
- serialization details
You often cannot reconstruct these exactly from the client side.
Different token categories
Not all tokens mean the same thing. Some are:
- user prompt text
- tool execution results
- model text output
- model tool call payloads
- attachments
- reasoning tokens
- replayed input
Caching complicates the picture
Providers may distinguish between:
- non-cached input
- cache creation input
- cache read input
Even if you can estimate “total resent material”, you may not know exactly how the provider splits it internally for billing or reporting.
A conceptual model of token kinds
Here is the most pedagogical way to think about token accounting.
A. Prompt tokens P
These are the tokens from the user’s own textual request.
Examples:
- “summarize this file”
- “fix the bug”
- “write a test”
This is the narrowest notion of input.
At trajectory level:
P = Σ_{n=1..N} P_n
B. Tool-output tokens
T
These are tokens from results returned by tools and fed back into the model.
Examples:
- shell command output
- file contents returned by a tool
- search results
- patch results
These are not written by the user, but they become model input.
At trajectory level:
T = Σ_{n=1..N} T_n
C. Attachment tokens A
These are sidecar payloads sent alongside the normal message flow.
Examples:
- session bootstrap payloads and startup context (AGENTS.md, CLAUDE.md)
- hook outputs
These are easy to miss if you only look at visible chat text.
At trajectory level:
A = Σ_{n=1..N} A_n
D. Model-output text tokens
M_text
These are ordinary textual tokens generated by the model in its answer.
Examples:
- prose explanation
- code block text
- final answer text
At trajectory level:
M_text = Σ_{n=1..N} M_text_n
E. Model-output
tool-call tokens M_tool
These are tokens used by the model to emit structured call payloads.
- tool name & arguments (usually in JSON, sometimes XML)
This is model output.
At trajectory level:
M_tool = Σ_{n=1..N} M_tool_n
F. Reasoning tokens R
These are thinking tokens, sometimes visible when the platform exposes them, sometimes obfuscated, sometimes hidden. But always billed.
At trajectory level:
R = Σ_{n=1..N} R_n
G. Inference calls N
N is the number of model inference rounds in the
trajectory.
If there is only one inference call, there is no replay. As soon as there are multiple inference calls, earlier input is often resent.
The formulas
The main lesson is that there is no single “local token count”. There are several increasingly realistic estimates.
For turn-indexed accounting, we write:
P_n= prompt tokens sent at inference roundnT_n= tool-output tokens sent at inference roundnA_n= attachment tokens sent at inference roundnX_n= replay tokens added at inference roundnM_text_n= model text output produced at inference roundnM_tool_n= model tool-call output produced at inference roundnR_n= reasoning output produced at inference roundn
Replay tokens X_n If the model needs
multiple rounds, earlier input often gets resent. Replay tokens are the
repeated, usually cached tokens. This repeated material is by far the
largest component of token counts.
Therefore:
P = Σ_{n=1..N} P_nT = Σ_{n=1..N} T_nA = Σ_{n=1..N} A_nM_text = Σ_{n=1..N} M_text_nM_tool = Σ_{n=1..N} M_tool_nR = Σ_{n=1..N} R_nX = Σ_{n=1..N} X_n
Then:
Formula 1: chatting
chatting = P + M_text
This is the naive formula of chatting. It is useful pedagogically.
Formula 2: agents
agent_visible = P + T + M_text + M_tool
This includes user prompt, tool results fed back into the model, model text output, and model tool-call output.
Formula 3: trajectory-level total input
input_n = P_n + T_n + A_n + X_n
output_n = R_n + M_text_n + M_tool_n
This is the cleanest conceptual split:
- what was fed into inference round
n - what was produced by inference round
n
To estimate total input over the whole trajectory:
trajectory_input = Σ_{n=1..N} input_n
trajectory_input = P + T + A + X
Symmetrically, total output over the whole trajectory:
trajectory_output = Σ_{n=1..N} output_n
trajectory_output = M_text + M_tool + R
Formula 4: entire-trajectory total
The full token size of a trajectory is input plus output, summed over every inference round:
trajectory_total = Σ_{n=1..N} (input_n + output_n)
trajectory_total = trajectory_input + trajectory_output
trajectory_total = P + T + A + X + M_text + M_tool + R
This is the broadest estimate: every token fed into the model
(including replay X) plus every token the model produced.
It is the local quantity closest to what a provider would bill across
the entire trajectory.
How to measure tokens locally
The practical approach is:
Step 1: collect the local traces
You need the raw local trajectory files produced by the agent client.
For example:
- Codex traces come from
~/.codex/sessions/... - Claude traces come from
~/.claude/projects/...
Step 2: tokenize the raw local content yourself
Use the tokenizer closest to the model family you want to approximate.
In this repository, the prototype uses tiktoken with
o200k_base.
This is not perfect for every provider, but it is a strong baseline for consistent local accounting.
Step 3: split the trace into conceptual buckets
A useful decomposition is:
prompttool_outputmodel_output_textmodel_output_tool_callsattachmentreasoningrequest_inputreplayinference_calls
This repository already does that in
genmon_tokens.py.
Step 4: compute the derived totals
The useful derived metrics are:
prompt_inputrequest_inputreplayed_input
They are approximations of realistic provider-side input accounting.
Step 5: compare local estimates with provider-reported usage
You will usually find:
- local prompt-only counts are too small
- local full-request counts are better
- replay-aware counts get much closer to provider input totals
That is exactly the point: the closer your local accounting gets to the actual repeated structured payload, the closer you get to provider-side numbers.
Code and prototype
See:
genmon_tokens.pyclaude-visible-vs-server-scatter.py