Preprint · Independent Research

State, Not Tokens: Repository-Scale Agent Reasoning Is Bound by State Architecture

Cary Palmer — Independent Researcher, Dallas, TX
GitHub · LinkedIn

Read the PDF Download Code Dataset Cite (DOI)

Abstract

The agent community has largely treated repository-scale forgetting as a context-window problem: bigger windows (8k → 128k → 1M) are expected to yield better whole-repo reasoning. We argue this is a misdiagnosis. Using a hard, machine-checkable task (strict JavaScript→TypeScript migration of real OSS repositories under an unforgeable oracle: strict tsc, immutable test suites, mandatory .js→.ts replacement, zero type-escape-hatches), we vary a single axis: how state flows between bounded workers. Three arms hold model, tools, scaffold, and oracle constant: a single-context monolith, a durable arm that accumulates each completed dependency layer as a committed artifact on a shared evolving tree, and a stateless-RAG arm whose per-file workers retrieve context but never see each other's results. On an independent third-party benchmark (NL2Repo-Bench), the same durable-state orchestration reaches a 91.1% mean test-pass rate, about 2.28× the published ~40% state of the art. The contribution is a reframing — state is an asset, not a prompt — with controls that isolate which capability actually matters.

Key findings

The naive context thesis is refuted, then reframed. A single navigating agent cleanly migrates up to 240 interdependent modules — it does not break where the context-window thesis predicts. It cracks at the full 364-module tree by capacity, not window overflow.
Durable accumulation > stateless retrieval. Same model, tools, and decomposition — only accumulation differs. RAG's blind workers emit code that won't compile (TS2451 conflicts appear only in RAG); durable reaches a clean tsc --strict tree where the monolith has no repair seam.
Two structural properties no transcript can match. Interruption-resumable consistent checkpoints (durable preserves 70.8% and resumes to PASS vs. monolith 0%), and zero-marginal-cost re-query — recalling a materialized discovery is a database read, not an LLM call.
An honestly localized limit. Parallel headroom grows with repo size, but usable concurrency is capped at K≈10–12 on the Cursor backend; a second backend (Claude Code) sustains 100% to C=32, so the cap is a property of the serving platform, not of durable state.
External validation on a public benchmark. On NL2Repo-Bench (build a full Python library from a natural-language spec, scored by the benchmark's own pytest suites), durable-state orchestration reaches 91.1% mean test-pass and solves 53% of libraries to a fully green suite — about 2.28× the published ~40% state of the art.

Abstract

Key findings

Paper