Skip to main content
FB.
technique claude-code context-engineering rtk token-reduction finops

3/6 · Four Layers, Not a Ranking: Mapping the Token-Reduction Toolbox

Token reduction is four layers, not competing tools. The map, three deep dives, and the cost blind spot that hides spending from Claude Max and Pro users.

13 min read

Context Engineering, a six-part series. You’re reading part 3: the tooling. Part 1: the science · Part 2: the discipline · Part 3: the tooling (you are here) · Part 4: the roles · Part 5: the team · Part 6: the portability

The frameToken reduction is FinOps for AI. Measure first, optimize second
The mapFour layers, each filtering a different point in the token flow
The leverageLayer 1 (CLI output) is the cheapest win: a small share of total volume, but pure noise and trivial to filter
The MCP taxOne GitHub MCP server can cost 44K tokens for a one-line question
The blind spotAPI gateways see nothing on Max/Pro. No API key, nothing to intercept

Disclosure up front: I’m a core contributor to RTK, one of the tools mapped below. I’ve kept the numbers sourced and the comparisons honest, but you should know where I sit.

In 2014, cloud teams were burning money they couldn’t see. EC2 instances running overnight for nothing, forgotten S3 buckets, credits vanishing into a bill nobody could break down. Then FinOps showed up. Not a tool, a discipline: measure first, optimize second, attribute everything. In 2026 the same thing is happening with tokens. The Claude Max license itself is fine, predictable, a fixed cost. What actually hurts is the noise you inject into every session without seeing it.

This article maps the tools that cut that noise. There are a lot of them, and from the outside they look like a crowded field of competitors all claiming the same thing. They aren’t competitors. They sit at different layers of the token flow, and each layer answers a different context-rot vector. It’s a map, not a ranking.


The four-layer map

Think of where a token can be intercepted between your command and the model’s reply. There are four places, and the tools cluster into them.

Layer 1, CLI proxy. Filters command output before it ever enters the context. This is where the easiest wins live, per-command reductions are large and setup is trivial.

ToolMechanismReductionSignal
RTKRust proxy, filters stdout before injection, native Claude Code hooks, SQLite analytics60-90%~56K stars
SnipGo binary, declarative YAML rules per command60-90%RTK alternative
DistillGeneric pipe, any CLI output to LLM-optimized answervarieszero config

Layer 2, MCP sandbox. Intercepts the tool calls that dump into context, and trims the schema cost of MCP servers.

ToolMechanismReductionSignal
context-modeSQLite + FTS5 session tracking, MCP output interception, “Think in Code”up to 98% MCP output10K stars
mcp2cliConverts MCP servers to CLI commands, avoids full schema injection~13K tokens/serverhidden gem
LeanCTXMCP server, 10 context read modes (signatures, diff, task-filtered), file read cache60-99%1.4K stars
Token SaviorAST symbol index, 34 Bash output compactors, cross-session SQLite memory-80% active tokens (author benchmark, synthetic codebase)1K stars
SerenaSemantic code indexing, symbol navigation40-70%code exploration
Context7Documentation retrieval, relevant chunks onlydoc-specific55K stars

Layer 3, output style. Reduces the verbosity of the model’s own responses.

ToolMechanismReductionSignal
CavemanTelegraphic responses, plus a PreCompact hook compressing input40-75% output, 46% input~5K stars
—uc modeSymbol-based communication protocol30-65% outputbuilt-in
LLMLinguaML token pruning before the call, under 2% accuracy loss60-80%6K stars
claude-token-efficientSingle CLAUDE.md file, no code, instructs concise output~63% output tokens (self-reported)5.7K stars

Layer 4, API gateway. Compresses history at the API boundary. Critical caveat below.

ToolMechanismReductionSignal
LiteLLMUnified proxy across 100+ providers, token counting, routingvaries50K stars
BifrostOSS gateway, per-dev budgets, semantic cachingup to 70% via semantic caching (self-reported)5.8K stars
HeadroomInput compression + output shaper, HTTP proxy or MCP server, CCR cross-agent cache47% full-session, 92% code search (author benchmark, v0.5.18)~43K stars
CompresrBackground compaction at 75% context via secondary modelclaims 200xYC-backed

Across all four sits a transversal band: observability tools (ccusage, claude-spend, cc-statistics, the 8.2K-star Claude Code Usage Monitor, CodeBurn) that don’t reduce tokens but tell you where they go. The Claude Code Usage Monitor gives real-time burn-rate predictions in the terminal. CodeBurn classifies spending by turn type across 13 categories and correlates sessions with git commits via codeburn yield, answering not “how much did I spend?” but “on what, and did it ship?”

The four-layer token-reduction map: CLI proxy, MCP sandbox, output style, API gateway, each with representative tools, star counts and reductions
Not a ranking, a map. Each layer answers a distinct context-rot vector, and the right setup usually combines one tool from several layers.

Headroom is worth singling out before the deep-dives. At around 43,000 stars in June 2026, it is the most-starred context engineering tool outside RTK, and it doesn’t fit neatly into one layer. In HTTP proxy mode it intercepts at Layer 4, compressing the full message history before the API call. In MCP mode it acts at Layer 2, triaging tool outputs. Its Compress-Cache-Retrieve (CCR) mechanism stores compressed content in a local SQLite index, sends {{HEADROOM_TAG_N}} placeholders to the model, and retrieves the original when the model calls headroom_retrieve. The headline 92% reduction applies to code search tasks specifically; full-session measurement from the author’s own benchmark runs closer to 47%. Two open production caveats: issue #1158 documents that headroom wrap claude silently caps Claude Max users at 200K context instead of 1M; issue #1227, currently unresolved, exposes CCR endpoints without loopback protection, meaning any local web page can read cached tool outputs without authentication. The CCR retrieval TTL issue from earlier versions was fixed in v0.25.0 via the HEADROOM_CCR_TTL_SECONDS environment variable. Core compression is stable. CCR in proxy mode on a shared workstation needs caution until #1227 is patched.

Three of these deserve a closer look, because they carry most of the practical leverage. The guide keeps a fuller, regularly updated inventory of the token compression tools in its context engineering reference.


Deep-dive 1: RTK and Layer 1

Layer 1 is where most people start, because tool output is pure noise wrapped around a little signal and it’s the easiest traffic to intercept. When Claude runs git diff, pnpm list, or a test suite, the raw output is thousands of lines wrapped around a few lines that matter, and all of it enters the window and stays there, degrading every subsequent turn.

RTK is a synchronous Rust proxy that sits on Claude Code’s PreToolUse hook. Claude triggers a command, RTK intercepts it, filters the noise, and returns only what matters. To the agent, nothing looks unusual. To the context, 60 to 90% of the tokens never arrive, and on the noisiest commands more than that. A git status that costs 2,000 tokens raw drops to around 200. A vitest run drops about 92%. A git log drops 92.3%.

The number I trust most is the longitudinal one. rtk gain --all reads three months of real command history and reports 437.8 million tokens saved across 70,310 commands, a 79.2% average reduction. That’s not an estimate from a benchmark, it’s the bill from actual usage. The whole thing is one 4.8MB Rust binary, no dependencies, sub-10ms startup. It crossed roughly 56,000 GitHub stars in five months, which is less a startup finding a market than a market finding its tool.

$ rtk gain --all   (monthly breakdown)

Month       Cmds      Input     Output      Saved   Save%
──────────────────────────────────────────────────────────
2026-03     6957     100.6M       8.8M      91.8M   91.3%
2026-04    20789     230.7M      41.6M     189.2M   82.0%
2026-05    32099     178.9M      34.8M     144.1M   80.6%
2026-06    10465      42.8M      30.3M      12.6M   29.5%
──────────────────────────────────────────────────────────
TOTAL      70310     553.0M     115.5M     437.8M   79.2%

Three months of real command history: 70,310 commands, 437.8M tokens that never reached the context. The recent dip is honest too. As more of a session is the model’s own output, which a CLI proxy can’t touch, the headline rate falls, exactly what the cost breakdown later in this article predicts.


Deep-dive 2: the MCP tax

The second hidden cost is one most people never see, because it’s paid before the first question. Every active MCP server injects its full tool schema into the system prompt, every turn, whether you use the tools or not.

The benchmark that makes it concrete comes from Scalekit. Asking “what language is this repo?” through the GitHub MCP server, which exposes 43 tools, cost 44,026 tokens. The same question answered with the gh CLI cost 1,365. A factor of 32, for an identical answer. Each active MCP server runs around 13K tokens of system-prompt overhead before you’ve typed anything.

The token count is only part of the problem. On IFTTD, Zineb Bendhiba argued that each additional tool exposed to a model also raises the hallucination rate, because the model must reason across a wider option space on every turn (ep 326). A related design principle from Frédéric Barthelet: build MCP tools as complete user journeys rather than atomic endpoints, because five sequential tool calls each multiply the model’s error probability, while a single well-scoped tool compresses that risk into one decision (ep 329).

At ManoMano, where 250 developers run Claude Code in production, Jocelyn N’takpe described an explicit rule that follows the same logic: anything that can be implemented as a native Claude Code skill should be, because skills carry no schema overhead and run without MCP round-trips (ep 346). MCP servers stay reserved for integrations that genuinely cannot be expressed any other way.

The fix has two halves. The mcp2cli pattern converts MCP servers into plain CLI commands, moving the schema cost client-side and cutting it 96 to 99%. Claude Code also addresses part of this natively with deferred tool loading, where schemas load on demand past a context threshold rather than all at once. One community report combined MCP cleanup with mcp2cli and compression scripts and took a session from 92K tokens to 5,500, a 94% cut, mostly by not paying for schemas nobody was using.

The practical move is duller than the number suggests: run claude mcp list, look at what’s actually loaded, and turn off the servers you’re not using this week. Most setups carry two or three servers they forgot they installed.

The MCP tax: 44,026 tokens through the GitHub MCP server versus 1,365 through the gh CLI for the same question, a 32x ratio
Same answer, 32 times the tokens. Each active MCP server runs around 13K tokens of system-prompt overhead before you type anything.

Deep-dive 3: output, and the Max/Pro blind spot

Layer 3 is smaller but real. Caveman strips the hedging and filler from responses for a 40 to 75% output reduction, and its PreCompact hook compresses input by around 46%. The built-in --uc mode does something similar with a symbol protocol. The practical split: Caveman for interactive sessions where you still read the output, cutting it around 40% in practice, and --uc for batch or agentic runs where density beats readability, closer to 65% on those runs specifically. At the opposite end of that complexity spectrum, claude-token-efficient is a single CLAUDE.md that instructs Claude to be concise: zero install, zero code, and 5.7K GitHub stars that measure exactly how much demand exists for that level of simplicity. The ceiling is real (output tokens only, no effect on file reads or input), but so is the signal. These gains are modest next to Layer 1, but they compound: a more compressed session leaves more room for the rules from part two to stay in the window.

Layer 4 is where the interesting trap lives. Every API gateway, LangFuse, Helicone, Bifrost, and the rest, works by intercepting API calls. That’s the whole mechanism. On Claude Code Max and Pro, there is no API key to intercept. Claude connects directly to Anthropic’s servers with subscription credentials, and the gateway sees nothing. These tools aren’t immature about it, they’re structurally blind to it. For the large population of developers on Max and Pro, the entire enterprise observability layer is dark, and the only thing that can give you per-session analytics is a tool operating at the CLI layer, below the billing boundary.

Cost-aware routing at the prompt level is the approach that survives the Max/Pro blind spot. On IFTTD, Antonio Goncalves described deploying a routing model that assesses prompt complexity and dispatches to the appropriate model tier, reporting that clients who adopted this pattern halved their inference bill (ep 357). A lightweight classifier doing that dispatch works at the application layer, with or without an API key to intercept.

Cost control on a subscription comes from different levers. Running a dual-instance setup, an Opus planner directing Sonnet executors, costs roughly $100 to $200 a month against $500 to $1K for Opus doing everything. Fast Mode trades quota for speed, it runs Opus with faster output and can chew through your Max allowance noticeably faster, so toggle it off with /fast when you’d rather conserve it. Per-task budget caps keep a runaway session from eating the week, with rough tiers of $2 for tests and docs, $5 for a medium refactor, $10 for architecture work. When a session nears its cap it summarizes what is left rather than stopping mid-task, so you lose tokens, not the thread. None of these show up in a gateway dashboard, because the gateway can’t see the traffic.

The Max/Pro blind spot: API gateways intercept API calls, but Claude Code Max and Pro have no API key to intercept, so only the CLI layer sees the traffic
Every API gateway works by intercepting API calls. On Max and Pro there is no key to intercept, so the whole enterprise observability layer goes dark.

What doesn’t work, and measuring first

Before any of that, the deepest measure-first principle is whether you need a model for the specific task. On IFTTD, Charles Cohen described a content-moderation system where rule-based symbolic AI outperformed LLMs on precision, latency, cost, and explainability in production, because the task was deterministic enough for rules to win (ep 333). Every token you never send is free and adds no noise to the context. A complementary finding from production: N’takpe at ManoMano cited a Berkeley study showing that 94% of agent errors would have been caught by a compilation step, which drove their team to migrate parts of their stack to strict TypeScript specifically to improve Claude Code output reliability (ep 346). A typed codebase is infrastructure for the agent layer, not just a style preference.

A lot of clever-sounding token tricks fail in practice, and it’s worth knowing which, so you don’t waste a weekend on them. Base64 and zlib compression fail because the model doesn’t decode binary. Binary diffs aren’t interpretable. Emoji encoding costs more than it saves, each emoji running 2 to 3 tokens. Regex redaction of secrets carries a 60%-plus miss rate, worse than useless because it feels safe. Path-prefix elision backfires when the model copies the shortened paths verbatim and the commands fail.

One honest caveat about the headline numbers, including RTK’s: most CLI-proxy savings are measured in characters or bytes, not BPE tokens. The two correlate well on noisy CLI output, but they aren’t identical, and an enterprise-grade benchmark should validate against the actual tokenizer rather than trust the byte count. I’d rather say that than quote a number I can’t fully stand behind.

And the breakdown that reframes the whole exercise: when you actually measure a session, CLI output is only around 2% of the total token volume. That is the honest counterweight to Layer 1. It is the cheapest, noisiest traffic to cut, which is why the per-command reductions run so high, but it was never the bulk of the bill. Native file reads (Read, Grep, Glob) are 40 to 60%, the single largest cost, and most tools don’t touch them. Static instructions are 5 to 15%. The model’s own output is 30 to 40%. This is the FinOps lesson over folk wisdom: you optimize where the spending is, not where the easy win feels good. Run ccusage, or rtk gain, or codeburn, and look at your own breakdown before you install anything. For tracking this across a team rather than one machine, the guide’s team metrics page covers what to measure and how.

Where the token cost actually is: CLI output ~2%, native file reads 40-60%, static instructions 5-15%, model output 30-40%
Optimize where the spending is, not where the easy win feels good. Native file reads are the largest cost, and most tools never touch them.

What to do Monday

Three actions, escalating effort, all visible by Monday morning.

Run rtk gain (about 2 minutes). It reads your command history and shows the real waste, per command and per session, pulled from actual usage, not a benchmark estimate.

Run claude mcp list and disable the servers you aren’t using (about 5 minutes). That alone reclaims the MCP tax on every future session.

Add two terse rules to CLAUDE.md (about 10 minutes): confirm success in one line, explain only on errors. That trims the output layer for free.

Together, depending on how much MCP overhead and output bloat you were carrying, these often land a 40 to 60% reduction on your next session, with no model change and no workflow change. The science in part one explains why that also makes the output better, not just cheaper. The discipline in part two keeps it that way over months. This is the layer where you prove it on the bill. And if you want the wider view, part four covers the roles this whole practice created, from context engineer to the jobs that didn’t exist three years ago.

From the field, via IFTTD episode transcripts: Zineb Bendhiba, ep 326 on tool count and hallucination; Frédéric Barthelet, ep 329 on MCP tool design as complete journeys; Antonio Goncalves, ep 357 on model routing; Charles Cohen, ep 333 on symbolic AI beating LLMs in production; Jocelyn N’takpe, ep 346 on skills-first configuration at 250-developer scale.

If you’ve measured your own token breakdown and it looks nothing like the 2 / 40-60 / 5-15 / 30-40 split above, I genuinely want to see it. The shape of real usage is more varied than any single study captures.

Talked about in

Where I covered this live or on a podcast.