Measuring AI Tokens Locally

If you use coding agents or LLM-powered tools seriously, token accounting stops being a curiosity and becomes an engineering problem.

You want to know:

what was really sent to the model
what the model really produced
whether the provider’s billing numbers are plausible

This post explains:

why measuring tokens locally matters
why it is hard
how to do it in practice
a conceptual model of token categories and the formulas that get closest to provider-reported usage

Repository: https://github.com/superleanai/verify

Why measure tokens locally?

The short answer is: no trust.

Provider-side token numbers are useful, but they are not enough if you want to audit behavior precisely.

There are several reasons:

Providers report totals, but usually not the exact reconstructed payload that produced those totals.
Agent systems do much more than a single prompt/response pair. They call tools, replay earlier context and sometimes add hidden scaffolding.
When costs or limits matter, you need an independent estimate.
If you are building an agent platform, you want observability that does not depend entirely on one vendor’s dashboard.

Local measurement is not about accusing providers of fraud. It is about reproducibility and auditability. In the same way that serious systems engineers inspect network traffic instead of trusting a summary counter, serious agent users should inspect token flows instead of trusting a single opaque usage number.

Why is local token measurement hard?

At first glance it sounds simple: tokenize the prompt, tokenize the answer, add them up.

That is wrong for modern agents. The difficulty comes from the gap between:

the human-visible interaction
the actual structured payload sent to the model
the obfuscated reasoning tokens due to anti-distillation measures.

Hidden structure

A chat UI shows text. The model sees structured messages, tool call payloads, tool results, attachments, and sometimes reasoning content.

Replay across inference calls

Many agent interactions are multi-step:

user asks something
model emits a tool call
tool runs
tool result is sent back
model answers

Each new inference call often resends earlier relevant context (by default all). This “replay” (aka cached tokens) dominates token usage.

Provider-specific internal scaffolding

The provider may add:

system instructions
hook-generated context and hook outputs
policy wrappers
internal routing metadata
cache bookkeeping
serialization details

You often cannot reconstruct these exactly from the client side.

Different token categories

Not all tokens mean the same thing. Some are:

user prompt text
tool execution results
model text output
model tool call payloads
attachments
reasoning tokens
replayed input

Caching complicates the picture

Providers may distinguish between:

non-cached input
cache creation input
cache read input

Even if you can estimate “total resent material”, you may not know exactly how the provider splits it internally for billing or reporting.

A conceptual model of token kinds

Here is the most pedagogical way to think about token accounting.

A. Prompt tokens `P`

These are the tokens from the user’s own textual request.

Examples:

“summarize this file”
“fix the bug”
“write a test”

This is the narrowest notion of input.

At trajectory level:

P = Σ_{n=1..N} P_n

B. Tool-output tokens `T`

These are tokens from results returned by tools and fed back into the model.

Examples:

shell command output
file contents returned by a tool
search results
patch results

These are not written by the user, but they become model input.

At trajectory level:

T = Σ_{n=1..N} T_n

C. Attachment tokens `A`

These are sidecar payloads sent alongside the normal message flow.

Examples:

session bootstrap payloads and startup context (AGENTS.md, CLAUDE.md)
hook outputs

These are easy to miss if you only look at visible chat text.

At trajectory level:

A = Σ_{n=1..N} A_n

D. Model-output text tokens `M_text`

These are ordinary textual tokens generated by the model in its answer.

Examples:

prose explanation
code block text
final answer text

At trajectory level:

M_text = Σ_{n=1..N} M_text_n

E. Model-output tool-call tokens `M_tool`

These are tokens used by the model to emit structured call payloads.

tool name & arguments (usually in JSON, sometimes XML)

This is model output.

At trajectory level:

M_tool = Σ_{n=1..N} M_tool_n

F. Reasoning tokens `R`

These are thinking tokens, sometimes visible when the platform exposes them, sometimes obfuscated, sometimes hidden. But always billed.

At trajectory level:

R = Σ_{n=1..N} R_n

G. Inference calls `N`

N is the number of model inference rounds in the trajectory.

If there is only one inference call, there is no replay. As soon as there are multiple inference calls, earlier input is often resent.

The formulas

The main lesson is that there is no single “local token count”. There are several increasingly realistic estimates.

For turn-indexed accounting, we write:

P_n = prompt tokens sent at inference round n
T_n = tool-output tokens sent at inference round n
A_n = attachment tokens sent at inference round n
X_n = replay tokens added at inference round n
M_text_n = model text output produced at inference round n
M_tool_n = model tool-call output produced at inference round n
R_n = reasoning output produced at inference round n

Replay tokens X_n If the model needs multiple rounds, earlier input often gets resent. Replay tokens are the repeated, usually cached tokens. This repeated material is by far the largest component of token counts.

Therefore:

P = Σ_{n=1..N} P_n
T = Σ_{n=1..N} T_n
A = Σ_{n=1..N} A_n
M_text = Σ_{n=1..N} M_text_n
M_tool = Σ_{n=1..N} M_tool_n
R = Σ_{n=1..N} R_n
X = Σ_{n=1..N} X_n

Then:

Formula 1: chatting

chatting = P + M_text

This is the naive formula of chatting. It is useful pedagogically.

Formula 2: agents

agent_visible = P + T + M_text + M_tool

This includes user prompt, tool results fed back into the model, model text output, and model tool-call output.

Formula 3: trajectory-level total input

input_n = P_n + T_n + A_n + X_n

output_n = R_n + M_text_n + M_tool_n

This is the cleanest conceptual split:

what was fed into inference round n
what was produced by inference round n

To estimate total input over the whole trajectory:

trajectory_input = Σ_{n=1..N} input_n

trajectory_input = P + T + A + X

Symmetrically, total output over the whole trajectory:

trajectory_output = Σ_{n=1..N} output_n

trajectory_output = M_text + M_tool + R

Formula 4: entire-trajectory total

The full token size of a trajectory is input plus output, summed over every inference round:

trajectory_total = Σ_{n=1..N} (input_n + output_n)

trajectory_total = trajectory_input + trajectory_output

trajectory_total = P + T + A + X + M_text + M_tool + R

This is the broadest estimate: every token fed into the model (including replay X) plus every token the model produced. It is the local quantity closest to what a provider would bill across the entire trajectory.

How to measure tokens locally

The practical approach is:

Step 1: collect the local traces

You need the raw local trajectory files produced by the agent client.

For example:

Codex traces come from ~/.codex/sessions/...
Claude traces come from ~/.claude/projects/...

Step 2: tokenize the raw local content yourself

Use the tokenizer closest to the model family you want to approximate.

In this repository, the prototype uses tiktoken with o200k_base.

This is not perfect for every provider, but it is a strong baseline for consistent local accounting.

Step 3: split the trace into conceptual buckets

A useful decomposition is:

prompt
tool_output
model_output_text
model_output_tool_calls
attachment
reasoning
request_input
replay
inference_calls

This repository already does that in genmon_tokens.py.

Step 4: compute the derived totals

The useful derived metrics are:

prompt_input
request_input
replayed_input

They are approximations of realistic provider-side input accounting.

Step 5: compare local estimates with provider-reported usage

You will usually find:

local prompt-only counts are too small
local full-request counts are better
replay-aware counts get much closer to provider input totals

That is exactly the point: the closer your local accounting gets to the actual repeated structured payload, the closer you get to provider-side numbers.

Code and prototype

See:

genmon_tokens.py
claude-visible-vs-server-scatter.py

Measuring AI Tokens Locally

Why measure tokens locally?

Why is local token measurement hard?

Hidden structure

Replay across inference calls

Provider-specific internal scaffolding

Different token categories

A conceptual model of token kinds

A. Prompt tokens P

B. Tool-output tokens T

C. Attachment tokens A

D. Model-output text tokens M_text

E. Model-output tool-call tokens M_tool

F. Reasoning tokens R

G. Inference calls N

The formulas

Formula 1: chatting

Formula 2: agents

Formula 3: trajectory-level total input

Formula 4: entire-trajectory total

How to measure tokens locally

Step 1: collect the local traces

Step 2: tokenize the raw local content yourself

Step 3: split the trace into conceptual buckets

Step 4: compute the derived totals

Step 5: compare local estimates with provider-reported usage

Code and prototype

A. Prompt tokens `P`

B. Tool-output tokens `T`

C. Attachment tokens `A`

D. Model-output text tokens `M_text`

E. Model-output tool-call tokens `M_tool`

F. Reasoning tokens `R`

G. Inference calls `N`