by Martin Monperrus

If you use coding agents or LLM-powered tools seriously, token accounting stops being a curiosity and becomes an engineering problem.

You want to know:

This post explains:

  1. why measuring tokens locally matters
  2. why it is hard
  3. how to do it in practice
  4. a conceptual model of token categories and the formulas that get closest to provider-reported usage

Repository: https://github.com/superleanai/verify

Why measure tokens locally?

The short answer is: no trust.

Provider-side token numbers are useful, but they are not enough if you want to audit behavior precisely.

There are several reasons:

Local measurement is not about accusing providers of fraud. It is about reproducibility and auditability. In the same way that serious systems engineers inspect network traffic instead of trusting a summary counter, serious agent users should inspect token flows instead of trusting a single opaque usage number.

Why is local token measurement hard?

At first glance it sounds simple: tokenize the prompt, tokenize the answer, add them up.

That is wrong for modern agents. The difficulty comes from the gap between:

Hidden structure

A chat UI shows text. The model sees structured messages, tool call payloads, tool results, attachments, and sometimes reasoning content.

Replay across inference calls

Many agent interactions are multi-step:

  1. user asks something
  2. model emits a tool call
  3. tool runs
  4. tool result is sent back
  5. model answers

Each new inference call often resends earlier relevant context (by default all). This “replay” (aka cached tokens) dominates token usage.

Provider-specific internal scaffolding

The provider may add:

You often cannot reconstruct these exactly from the client side.

Different token categories

Not all tokens mean the same thing. Some are:

Caching complicates the picture

Providers may distinguish between:

Even if you can estimate “total resent material”, you may not know exactly how the provider splits it internally for billing or reporting.

A conceptual model of token kinds

Here is the most pedagogical way to think about token accounting.

A. Prompt tokens P

These are the tokens from the user’s own textual request.

Examples:

This is the narrowest notion of input.

At trajectory level:

P = Σ_{n=1..N} P_n

B. Tool-output tokens T

These are tokens from results returned by tools and fed back into the model.

Examples:

These are not written by the user, but they become model input.

At trajectory level:

T = Σ_{n=1..N} T_n

C. Attachment tokens A

These are sidecar payloads sent alongside the normal message flow.

Examples:

These are easy to miss if you only look at visible chat text.

At trajectory level:

A = Σ_{n=1..N} A_n

D. Model-output text tokens M_text

These are ordinary textual tokens generated by the model in its answer.

Examples:

At trajectory level:

M_text = Σ_{n=1..N} M_text_n

E. Model-output tool-call tokens M_tool

These are tokens used by the model to emit structured call payloads.

This is model output.

At trajectory level:

M_tool = Σ_{n=1..N} M_tool_n

F. Reasoning tokens R

These are thinking tokens, sometimes visible when the platform exposes them, sometimes obfuscated, sometimes hidden. But always billed.

At trajectory level:

R = Σ_{n=1..N} R_n

G. Inference calls N

N is the number of model inference rounds in the trajectory.

If there is only one inference call, there is no replay. As soon as there are multiple inference calls, earlier input is often resent.

The formulas

The main lesson is that there is no single “local token count”. There are several increasingly realistic estimates.

For turn-indexed accounting, we write:

Replay tokens X_n If the model needs multiple rounds, earlier input often gets resent. Replay tokens are the repeated, usually cached tokens. This repeated material is by far the largest component of token counts.

Therefore:

Then:

Formula 1: chatting

chatting = P + M_text

This is the naive formula of chatting. It is useful pedagogically.

Formula 2: agents

agent_visible = P + T + M_text + M_tool

This includes user prompt, tool results fed back into the model, model text output, and model tool-call output.

Formula 3: trajectory-level total input

input_n = P_n + T_n + A_n + X_n

output_n = R_n + M_text_n + M_tool_n

This is the cleanest conceptual split:

To estimate total input over the whole trajectory:

trajectory_input = Σ_{n=1..N} input_n

trajectory_input = P + T + A + X

Symmetrically, total output over the whole trajectory:

trajectory_output = Σ_{n=1..N} output_n

trajectory_output = M_text + M_tool + R

Formula 4: entire-trajectory total

The full token size of a trajectory is input plus output, summed over every inference round:

trajectory_total = Σ_{n=1..N} (input_n + output_n)

trajectory_total = trajectory_input + trajectory_output

trajectory_total = P + T + A + X + M_text + M_tool + R

This is the broadest estimate: every token fed into the model (including replay X) plus every token the model produced. It is the local quantity closest to what a provider would bill across the entire trajectory.

How to measure tokens locally

The practical approach is:

Step 1: collect the local traces

You need the raw local trajectory files produced by the agent client.

For example:

Step 2: tokenize the raw local content yourself

Use the tokenizer closest to the model family you want to approximate.

In this repository, the prototype uses tiktoken with o200k_base.

This is not perfect for every provider, but it is a strong baseline for consistent local accounting.

Step 3: split the trace into conceptual buckets

A useful decomposition is:

This repository already does that in genmon_tokens.py.

Step 4: compute the derived totals

The useful derived metrics are:

They are approximations of realistic provider-side input accounting.

Step 5: compare local estimates with provider-reported usage

You will usually find:

That is exactly the point: the closer your local accounting gets to the actual repeated structured payload, the closer you get to provider-side numbers.

Code and prototype

See: