Claude Code: Cut Agent Token Costs 63–87% with Layer Placement
A Coach agent body I inherited was 2,785 tokens. After the audit it was 1,030. A Competitive Marketing agent was 4,882 tokens. After the audit it was 620. Same judgment, same handoffs, same work: 63% off the first one, 87% off the second. The cuts were not cleverness. The cuts were removing content that should never have lived in the agent body in the first place.
That is the whole post. Where the bytes go, who pays for them, and how to move them to a layer that caches across every agent spawn instead of re-billing on each one.
How Claude Code builds a system prompt
Four layers, appended in order. Every request ships the whole stack.
Layer 1: Claude Code harness (~15,000 tokens)
Layer 2: CLAUDE.md + .claude/rules/ (project-wide, shared across all agents)
Layer 3: Agent body (unique per agent type)
Layer 4: Messages / conversation (unique per session)
Cache matching uses prefix-hash comparison. The system hashes everything up to your breakpoint; a change anywhere in that prefix produces a different hash on the next request, and everything below the breakpoint pays input price. The lookback window is 20 positions (20 cache breakpoints per request). Cached reads are priced at 10% of input. The write-premium math (5-minute TTL at 1.25× input, 1-hour TTL at 2× input) is in Prompt Caching: Read the usage Block.
Layers 1 and 2 are identical for every agent in a repo. Both cache. The divergence point is Layer 3: web-dev has a different body than backend-dev, so from that byte forward the cache is per-agent-type. Layer 4 is per-session.
Which cache tier your content lands in
| Tier | Where the bytes live | What it caches across |
|---|---|---|
| Tier 1 | .claude/rules/*.md | Every agent type, every turn |
| Tier 2 | Agent body (.claude/agents/*.md) | Same agent type, across turns |
| Tier 3 | Read results in messages | Within one session (re-billed on every new session) |
| Tier 4 | On-demand Read | After the turn they land (not available to subsequent turns) |
Tier 1 content caches once and is free across every subagent spawn for the life of the write’s TTL. Tier 3 content gets billed fresh per session per agent. Moving a 6,000-token rules file from a startup Read shared-rules.md (Tier 3) to .claude/rules/shared-rules.md (Tier 1) costs nothing to implement. The savings show up in every spawn.
The anti-pattern to look for: agent bodies with a startup checklist that reads files which could have been symlinked into .claude/rules/. Every such Read call re-bills those bytes against the session’s cache window. Place them in Tier 1 and the content loads into the prefix before the agent body divergence point, which means all agent types share the cache on that content.
Related: placement decisions for shared content across Skills, Rules, and Read calls get fuller treatment in Skills vs Rules vs Read.
The persona-vs-data audit
For every section in an agent body, label it persona, data, or redundant.
Persona: role identity, handoff routing, failure modes, decision heuristics. Keep.
Data: file paths, tool syntax, directory layouts, doc tables. Move to .claude/rules/project-context.md or load on demand.
Redundant is content that restates rules already in .claude/rules/. Cut.
Two real audits, measured before and after:
| Agent | Before | After | Reduction | Persona | Data | Redundant |
|---|---|---|---|---|---|---|
| Coach (orchestrator) | 2,785 | 1,030 | 63% | 27% | 54% | 10% |
| Competitive Marketing | 4,882 | 620 | 87% | 9% | 45% | 25% |
The heuristic: judgment-heavy agents (a UX reviewer, a standards expert) are already about as small as they can get. The body is the role. Procedure-heavy agents (coaches, orchestrators, marketers) are where the bloat hides. Their bulk is process specs and reference tables masquerading as agent definition.
What belongs in an agent body and nothing else: role identity in two or three sentences, a startup checklist of agent-specific reads only (rules auto-load, so they should not appear here), responsibility boundaries, handoff routing, and judgment rules that do not live in shared rules. Point to command specs; do not inline them.
Adding an MCP server mid-session invalidates the tools layer and everything below it; the whole prefix writes fresh on the next call. That pattern deserves its own post and has one: The MCP Tool Prefix Trap.
Claude Code agent frontmatter: tools, memory, maxTurns
The agent frontmatter enforces behavior before the model ever sees it. Zero runtime tokens, no way to talk itself out of it [2].
---
name: web-dev
description: "Front-end development. Handles UI, components, responsive
design, accessibility, and integration tests. Spawned for Web Dev tasks."
tools: Bash, Read, Write, Edit, Glob, Grep, TodoWrite
model: inherit
color: green
memory: project
maxTurns: 30
---
description is what the parent agent reads when deciding whether to delegate. Write it as a delegation hint (“when to call this”), not a role summary. Routing accuracy goes up, which means fewer mis-delegated spawns eating tokens on the wrong agent.
tools is the whitelist. A Coach that should not write code does not have Edit. A reviewer that should not spawn subagents does not have Agent. Structural beats prose: “Don’t write code” is a suggestion; a missing Edit tool is not.
memory controls persistence: project (committed), local (gitignored), user (per-machine). Relevant to the next section.
maxTurns caps runaway agents. effort: low drops reasoning depth on simple tasks.
The subagent memory gap
This is the cost-impactful “aha” and it catches everyone.
Auto-memory (the MEMORY.md and its pointer files) loads into the main session’s system prompt. When the main session spawns a subagent, that subagent gets a fresh context. The parent’s memory does not propagate.
Every preference saved to memory (“use uv not pip,” “Charles is colorblind, avoid green/yellow adjacency”) is invisible to subagents unless you close the gap. Three ways to do it:
.claude/rules/: Tier 1, auto-loaded for every agent including subagents.- The task prompt: stated explicitly when spawning.
- Agent-specific memory via the
memoryfrontmatter field: separate per agent type.
The expensive failure mode is subagents violating preferences the main session knows about, producing work that has to be redone. The redo pays full agent-body tokens a second time plus the output tokens from the rejected first attempt. Moving five preference memories to .claude/rules/ closed this gap in one commit across the repo.
Measuring with OpenTelemetry
Claude Code emits cache metrics per API call. You verify the rewrite worked by reading the trace, not by squinting at the bill a week later. Full setup, Jaeger configuration, per-project tagging, and the dashboard fields that matter live in Track Claude Code spend with OTel and Jaeger. The short version: cache_read_input_tokens high on a subagent’s first API call means the shared prefix is caching across agent types; low means something upstream is busting the cache and the audit is not done.
The math across a pipeline run
Opus 4.6, fetched from Anthropic’s pricing page on 2026-04-14 [1]. Input $5/M. Cached read $0.50/M (input × 0.1). Cache write $6.25/M for the 5-minute TTL (input × 1.25), $10/M for the 1-hour TTL (input × 2).
Single spawn, agent body only, cold start:
Before: 2,800 tokens × $5/M = $0.014
After: 1,030 tokens × $5/M = $0.00515
Delta per cold spawn ≈ $0.009
A pipeline run spawning six subagents with 6,400 tokens of shared rules that used to live in Tier 3 Read calls:
Before: 6 × 6,400 × $5/M = $0.192 redundant input per run
After: shared rules cached Tier 1 = $0.00
A representative week, five features and six agents each, carrying the old 2,800-token Coach body plus the Tier-3 rules pattern:
Before: 5 × 6 × (2,800 + 6,400) × $5/M = $1.38 in redundant spawn tokens
After: 5 × 6 × 1,030 × $5/M + one cache write
= $0.154 + ($6,400 × $6.25/M once) ≈ $0.194
Weekly delta ≈ $1.19
Small per-spawn. A team running dozens daily accumulates these fast. Subagents that redo work because they missed a preference add a full agent-body write plus rejected output tokens to the tab; that cost does not appear in this math.
New project:
- Put universal content in
.claude/rules/ - Keep
CLAUDE.mdshort - Write agent bodies with persona only
- Whitelist
toolsin frontmatter - Set
memory: projecton agents that accumulate - Write
descriptionas a delegation hint - Turn on OTel before you claim anything cached
Existing project:
- Run the persona-vs-data audit
- Move data out, delete redundancy
- Strip startup Reads that load auto-loaded content
- Measure before and after
If the agent body reads like a manual, it is not an agent body. It is a manual that got filed in the wrong drawer and the bill shows up every time the drawer opens.
Fact Check
- Prompt caching (Anthropic API docs): pricing, TTL options, minimum prefix thresholds, write-premium multipliers. Fetched 2026-04-14.
- Claude Code subagents (Anthropic docs): frontmatter fields (
tools,memory,description,maxTurns), subagent spawn behavior, context inheritance.