Skip to content

Benchmark results -- token economy

Measured: 2026-04-15 19:36 UTC Tokenizer: Anthropic count_tokens API (requirements-bench.txt pins anthropic==0.72.0)

Task held constant across both scenarios: read 3 issues, edit 1, push the change. What differs is only the context the agent must ingest to get started.

Baseline comparison (MCP-mediated vs reposix)

Scenario Characters Real tokens (count_tokens)
MCP-mediated (tool catalog + schemas) 16,274 4,883
reposix (shell session transcript) 1,372 ** 531**

Reduction: reposix uses 89.1% fewer tokens than the MCP-mediated baseline for the same task. Equivalently, MCP costs ~9.2x more context.

Per-backend raw-JSON comparison (BENCH-02)

Backend Raw-API fixture Characters Real tokens reposix tokens Reduction
Jira (MCP) mcp_jira_catalog.json 16,274 4,883 531 89.1%
GitHub github_issues.json 10,100 3,661 531 85.5%
Confluence confluence_pages.json 6,647 2,251 531 76.4%
Jira (real adapter) (pending re-measurement) (pending) (pending) (pending) TBD (adapter shipped v0.11.x; bench rerun deferred to perf-dim P67)

What this does NOT measure

  • Actual inference cost (token price depends on the frontier model).
  • The agent's own reasoning tokens (they cancel out -- the task is identical).
  • Tool-call output tokens (small and comparable).
  • Re-fetch of schemas if the agent's context is compacted mid-session.

What this DOES measure

  • The raw bytes the agent's context window has to hold in order to be productive at minute 0.
  • The cost of "learning the tool" vs "using what you already know".
  • Token counts are produced by Anthropic's count_tokens endpoint (SDK pinned in requirements-bench.txt).

Fixture provenance

  • benchmarks/fixtures/mcp_jira_catalog.json -- a representative manifest of 35 Jira tools, modeled on the public Atlassian Forge surface and the schemas produced by the mcp-atlassian server. Full schemas for each tool, shaped like real JSON-Schema input descriptors.
  • benchmarks/fixtures/reposix_session.txt -- the ANSI-stripped excerpt of what an agent's shell actually contains after running the equivalent workflow through scripts/demo.sh.
  • benchmarks/fixtures/github_issues.json -- a synthetic GitHub REST v3 /repos/{owner}/{repo}/issues payload with 3 representative issues.
  • benchmarks/fixtures/confluence_pages.json -- a synthetic Confluence v2 /wiki/api/v2/pages payload with 3 pages including full ADF body content.

Reproduce: python3 scripts/bench_token_economy.py --offline (cache must be populated first)