TLDR: A coding agent trajectory is execution-reproducible when (1) its file edits replay from a known commit SHA and (2) its shell commands produce byte-identical outputs on re-execution. This is the natural extension of reproducible builds and reproducible notebooks to AI coding agents.
Definition
A coding agent trajectory is the complete, time-ordered log of an AI agent’s actions on a codebase: every file it read, every edit it made, and every shell command it ran, together with the outputs it received.
A trajectory is execution-reproducible when two criteria hold:
- Edit criterion. All
WriteandEditoperations, replayed from the commit SHA that precedes the agent session, reproduce the exact file states that were committed. - Command criterion. All shell commands, re-executed verbatim in the same working directory, produce byte-identical stdout+stderr to the outputs recorded in the trajectory.
Why It Matters
Scientific validity. Trajectory datasets are increasingly used to study how AI agents solve programming tasks. If trajectories are not reproducible, they may not faithfully record what actually happened. Reproducibility is the precondition for using trajectories as scientific evidence.
Audit and attribution. When a trajectory is linked to a commit, the edit criterion lets any third party verify that the agent—not a human—produced the changes. This matters for questions of authorship, licensing, and accountability.
Environment characterization. A trajectory that fails the command criterion reveals that the agent’s behavior depended on environment state that is not captured in the trajectory: installed packages, network calls, file system contents, random seeds. This is diagnostic information for improving agent frameworks.
Implementation
The reproducible-trajectories package implements both criteria:
pip install reproducible-trajectories
reproducible-trajectories check-execution-reproducible path/to/trajectory.jsonlOutput:
Edit criterion: no_repo
Command criterion: reproducible
[PASS] echo "hello world"
[PASS] python3 -c "print(1 + 1)"
Execution reproducible: True
The edit criterion requires a --repo pointing to the git
repository. The command criterion runs entirely from the trajectory
file. Both criteria return structured JSON with --json.
The implementation is ~120 lines of Python: parse the trajectory
JSONL, extract tool_use / tool_result pairs
for Bash calls, re-execute with subprocess.run, compare
strings exactly. No fuzzy matching. Byte identity is the standard, for
the same reason it is the standard in reproducible builds.
Related work
Reproducible Builds
The reproducible builds project (reproducible-builds.org) defines a build as reproducible when any party can independently rebuild the same binary from the same source. The key insight of reproducible builds: determinism is an auditable property. If builds are deterministic, you can verify that a distributed binary matches the public source. Backdoors inserted at build time become detectable.
Execution-reproducible trajectories carry the same insight. If a trajectory is reproducible, you can verify that the AI agent’s actions genuinely explain the resulting commit. You can detect post-hoc trajectory fabrication. You can replay the session on a different machine and confirm the agent’s reasoning.
Reproducible Notebooks
The notebook reproducibility literature addresses a closely related problem. A Jupyter notebook is reproducible when re-executing all cells top-to-bottom produces the same cell outputs. Papers like Are My Deep Learning Systems Reproducible? and tools like nbval operationalize this: they replay notebook cells and compare recorded vs. actual outputs. Notebooks are self-contained computation documents. Trajectories mix computation (Bash, Python) with file system mutations (Write, Edit). This is why two separate criteria are needed: one for the file mutation side, one for the computation side.
See also
- reproducible-builds.org — the reference for build reproducibility
- nbval — notebook output validation
- Are Jupyter Notebooks Reproducible? — empirical study
- reproducible-trajectories on GitHub