Reproducible Coding Agent Trajectories

TLDR: A coding agent trajectory is execution-reproducible when (1) its file edits replay from a known commit SHA and (2) its shell commands produce byte-identical outputs on re-execution. This is the natural extension of reproducible builds and reproducible notebooks to AI coding agents.

Definition

A coding agent trajectory is the complete, time-ordered log of an AI agent’s actions on a codebase: every file it read, every edit it made, and every shell command it ran, together with the outputs it received.

A trajectory is execution-reproducible when two criteria hold:

Edit criterion. All Write and Edit operations, replayed from the commit SHA that precedes the agent session, reproduce the exact file states that were committed.
Command criterion. All shell commands, re-executed verbatim in the same working directory, produce byte-identical stdout+stderr to the outputs recorded in the trajectory.

Why It Matters

Scientific validity. Trajectory datasets are increasingly used to study how AI agents solve programming tasks. If trajectories are not reproducible, they may not faithfully record what actually happened. Reproducibility is the precondition for using trajectories as scientific evidence.

Audit and attribution. When a trajectory is linked to a commit, the edit criterion lets any third party verify that the agent—not a human—produced the changes. This matters for questions of authorship, licensing, and accountability.

Environment characterization. A trajectory that fails the command criterion reveals that the agent’s behavior depended on environment state that is not captured in the trajectory: installed packages, network calls, file system contents, random seeds. This is diagnostic information for improving agent frameworks.

Implementation

The reproducible-trajectories package implements both criteria:

pip install reproducible-trajectories
reproducible-trajectories check-execution-reproducible path/to/trajectory.jsonl

Output:

Edit criterion:    no_repo
Command criterion: reproducible
  [PASS] echo "hello world"
  [PASS] python3 -c "print(1 + 1)"
Execution reproducible: True

The edit criterion requires a --repo pointing to the git repository. The command criterion runs entirely from the trajectory file. Both criteria return structured JSON with --json.

The implementation is ~120 lines of Python: parse the trajectory JSONL, extract tool_use / tool_result pairs for Bash calls, re-execute with subprocess.run, compare strings exactly. No fuzzy matching. Byte identity is the standard, for the same reason it is the standard in reproducible builds.

Reproducible Builds

The reproducible builds project (reproducible-builds.org) defines a build as reproducible when any party can independently rebuild the same binary from the same source. The key insight of reproducible builds: determinism is an auditable property. If builds are deterministic, you can verify that a distributed binary matches the public source. Backdoors inserted at build time become detectable.

Execution-reproducible trajectories carry the same insight. If a trajectory is reproducible, you can verify that the AI agent’s actions genuinely explain the resulting commit. You can detect post-hoc trajectory fabrication. You can replay the session on a different machine and confirm the agent’s reasoning.

Reproducible Notebooks

The notebook reproducibility literature addresses a closely related problem. A Jupyter notebook is reproducible when re-executing all cells top-to-bottom produces the same cell outputs. Papers like Are My Deep Learning Systems Reproducible? and tools like nbval operationalize this: they replay notebook cells and compare recorded vs. actual outputs. Notebooks are self-contained computation documents. Trajectories mix computation (Bash, Python) with file system mutations (Write, Edit). This is why two separate criteria are needed: one for the file mutation side, one for the computation side.