Preprint · Independent Research
State, Not Tokens: Repository-Scale Agent Reasoning Is Bound by State Architecture
Cary Palmer — Independent Researcher, Dallas, TX
GitHub ·
LinkedIn
Abstract
The agent community has largely treated repository-scale forgetting as a
context-window problem: bigger windows (8k → 128k → 1M) are expected to
yield better whole-repo reasoning. We argue this is a misdiagnosis. Using a hard,
machine-checkable task (strict JavaScript→TypeScript migration of real OSS
repositories under an unforgeable oracle: strict tsc, immutable test
suites, mandatory .js→.ts replacement, zero
type-escape-hatches), we vary a single axis: how state flows between bounded
workers. Three arms hold model, tools, scaffold, and oracle constant: a
single-context monolith, a durable arm that accumulates each completed
dependency layer as a committed artifact on a shared evolving tree, and a
stateless-RAG arm whose per-file workers retrieve context but never see each
other's results. On an independent third-party benchmark (NL2Repo-Bench), the same
durable-state orchestration reaches a 91.1% mean test-pass rate, about
2.28× the published ~40% state of the art. The contribution is a reframing —
state is an asset, not a prompt — with controls that isolate
which capability actually matters.
Key findings
- The naive context thesis is refuted, then reframed. A single navigating
agent cleanly migrates up to 240 interdependent modules — it does
not break where the context-window thesis predicts. It cracks at the full
364-module tree by capacity, not window overflow.
- Durable accumulation > stateless retrieval. Same model, tools, and
decomposition — only accumulation differs. RAG's blind workers emit code that
won't compile (
TS2451 conflicts appear only in RAG); durable
reaches a clean tsc --strict tree where the monolith has no repair seam.
- Two structural properties no transcript can match. Interruption-resumable
consistent checkpoints (durable preserves 70.8% and resumes to PASS vs. monolith 0%),
and zero-marginal-cost re-query — recalling a materialized discovery
is a database read, not an LLM call.
- An honestly localized limit. Parallel headroom grows with repo size,
but usable concurrency is capped at K≈10–12 on the Cursor backend; a second
backend (Claude Code) sustains 100% to C=32, so the cap is a property of the serving
platform, not of durable state.
- External validation on a public benchmark. On NL2Repo-Bench (build a full
Python library from a natural-language spec, scored by the benchmark's own pytest
suites), durable-state orchestration reaches 91.1% mean test-pass and
solves 53% of libraries to a fully green suite — about
2.28× the published ~40% state of the art.
Paper