Do AgentGateway Tool Modes and Headroom Stack? Measuring Two Token-Saving Layers Together

In a previous post I showed that Enterprise agentgateway can cut your LLM bill by changing how GitHub’s MCP tools are presented to the model — its toolMode (Standard / Search / Code) shrinks the tool-catalog tax, the 4,781 tokens of tool schema re-injected on every turn.

But that’s only one line item on the bill. Headroom attacks a different one: it’s a compression proxy that shrinks the content payload — the verbose GitHub JSON results, file contents, and conversation history — before they ever reach the model. It claims 60–95% token savings, including “GitHub issue triage 73%.”

So the obvious question: they touch different layers, so do the savings stack? If AgentGateway removes the catalog tax and Headroom removes the payload tax, do you get both by running them together — or does one make the other redundant?

I measured it. All the code, manifests, harness, judge, dashboard, and raw results are in the demo repo:

👉 sebbycorp/agentgateway-demos / 105-ent-headroom-comp-tokenomics

TL;DR: Yes — but only when there’s real payload to compress, and not for free. On a large repo, Headroom stacks on top of every AGW mode (Code −63%, Search −36%). On a small, catalog-dominated repo it barely helps and actually makes Standard mode 49% more expensive by busting the provider’s prompt cache. And it trades a little accuracy on exact-identifier tasks.

Two knobs, two layers

This is the whole mental model, so it’s worth being precise:

	What it reduces	How
AGW tool modes	the tool-catalog tax + orchestration round-trips	Search: 28 tools → 2 meta-tools (−91% catalog). Code: N calls → 1 round-trip, summary-only result
Headroom	the content payload + history	ML/AST/JSON compressors (SmartCrusher, CodeCompressor, the Kompress model), reversibly

AGW shrinks the schemas and round-trips. Headroom shrinks the result data and history. Because they act on different parts of the request, they can — in principle — compound. The experiment is whether they actually do.

The setup at a glance

Component	Value
Backend	GitHub’s external remote MCP server — `api.githubcopilot.com/mcp/readonly` (28 read-only tools)
Front door	Enterprise agentgateway (`toolMode` Standard/Search/Code; injects the GitHub PAT as a Bearer token)
Compression layer	Headroom proxy (`headroom proxy`, OpenAI-compatible), upstream pointed at the AGW `/openai` route
Model	`gpt-5.5` via the agentgateway `/openai` route
Scope	A small repo (`sebbycorp/agw-tokenomics-sandbox`) and a larger one (`sebbycorp/k8s-iceman`), both read-only-pinned
Matrix	3 tool modes × Headroom OFF/ON × 2 repos = 12 cells, 5 questions each + an LLM-judge quality score

Architecture — two independent switches in one pipeline

The key insight that makes the test clean: the two knobs sit on different arrows.

                          ┌─ Headroom OFF ─► AGW /openai ──► OpenAI gpt-5.5
harness ──build request──>┤
 (catalog baked in         └─ Headroom ON ──► Headroom proxy ─► AGW /openai ─► OpenAI
  per AGW tool mode)                           (compresses payload + history)
        │
        └──/mcp/gh-{std,search,code}── AGW (toolMode) ──TLS+PAT──► GitHub remote MCP

Knob 1 — AGW toolMode acts on the MCP catalog (the /mcp/gh-* arrow).
Knob 2 — Headroom acts on the LLM request body (the /openai arrow). OFF = harness posts straight to AGW; ON = it posts to the Headroom proxy, which compresses and forwards to AGW.

Crucially, the AGW catalog effect is independent of which LLM URL is used: the harness bakes the tool catalog into the request from the MCP tools/list response, so the two knobs stay orthogonal. Headroom forwards to AGW (not OpenAI directly) so AGW still supplies the model + key and traces every call.

A gotcha worth flagging. Headroom is OpenAI-compatible at /v1/chat/completions, and you point its upstream at AGW with OPENAI_TARGET_API_URL. But out of the box, its semantic cache and CCR tool-injection broke the Search/Code tool-orchestration flow — every Search/Code request through Headroom failed with an MCP TaskGroup error until I launched it with --no-cache --no-ccr-inject-tool --no-ccr-marker. In front of an MCP tool-calling agent, those flags are mandatory. (A quick sanity check that compression is actually happening: the same 29 KB tool-result blob went 10,859 → 9,772 prompt tokens through the proxy.)

The numbers

Single run, gpt-5.5 list-price, cache-aware. Δ = the Headroom saving (ON vs OFF): positive means ON is cheaper, negative means ON is more expensive.

Small repo — `agw-tokenomics-sandbox`

AGW mode	OFF cost	ON cost	Headroom Δ	OFF quality	ON quality
Standard	$0.0289	$0.0432	−49% (worse)	5.0	5.0
Search	$0.0149	$0.0138	+7%	5.0	4.2
Code	$0.0315	$0.0204	+35%	5.0	4.8

Large repo — `k8s-iceman`

AGW mode	OFF cost	ON cost	Headroom Δ	OFF quality	ON quality
Standard	$0.0424	$0.0391	+8%	5.0	4.2
Search	$0.0269	$0.0173	+36%	4.8	4.4
Code	$0.0623	$0.0233	+63%	3.4	4.6

The live Grafana dashboard (metrics pushed via Prometheus pushgateway) shows the blended cross-repo picture — Search drops from $0.0209 → $0.0155, a 26% stacked saving:

And the per-repo + quality detail — note Search’s first-call tool context is just 429 tokens (the AGW catalog win, independent of Headroom), and the answer-quality panel that keeps the comparison honest:

Why it comes out this way

1. On a large repo, the layers stack. Big result payloads give Headroom something to compress, and AGW has already collapsed the catalog. The biggest stack is Code + Headroom (−63%) — Code returns summaries only, and Headroom squeezes those further. The cheapest reliable cell overall is Search + Headroom ($0.0173): AGW removes the −91% catalog tax, Headroom removes the result-JSON tax.

2. On a small repo, Headroom can backfire. With little result data to compress, its upside is small — and on Standard mode it loses 49%. The reason is subtle but important: Headroom rewrites the prompt on every turn, which busts gpt-5.5’s prefix cache. On a small workload the big, stable 28-tool catalog is exactly what caches well, so the cache you destroy is worth more than the payload you compress. Payload compression needs payloads.

3. Cheaper isn’t free. The judge flagged it: the “list recent commits” question repeatedly scored 2/5 under Headroom, because compression mangles high-entropy commit SHA hashes. For tasks that depend on verbatim identifiers — hashes, tokens, IDs — lossy text compression is risky. Most other answers held at 4–5/5.

When to run both

Large result payloads (logs, file contents, big JSON)  → YES: AGW Search/Code + Headroom stacks (−36% to −63%)
Small, catalog-dominated workload                      → AGW alone; Headroom can hurt (prefix-cache busting)
Tasks needing exact IDs (SHAs, tokens, hashes)         → be careful: Headroom can mangle them
Any MCP tool-calling agent                             → Headroom needs --no-cache --no-ccr-* to not break

They’re complementary, not competing — AGW on the catalog, Headroom on the payload — and on the right workload they genuinely compound.

Reproduce it yourself

git clone https://github.com/sebbycorp/agentgateway-demos.git
cd agentgateway-demos/105-ent-headroom-comp-tokenomics

cp .env.example .env      # AGENTGATEWAY_LICENSE_KEY, OPENAI_API_KEY, GITHUB_PAT (read-only), REPO_LARGE
set -a; . .env; set +a
./deploy.sh               # kind + AGW + GitHub MCP backends + installs the Headroom proxy

./test.sh                                              # one question, Headroom OFF vs ON
REPO_LARGE=owner/big-readonly-repo ./run_matrix.sh     # full 12-cell matrix + LLM-judge

A Grafana dashboard (AGW Tool Modes + Headroom) visualizes cost, tokens, and quality OFF vs ON. You can even replay a saved run into Prometheus with observability/replay_to_pushgateway.py — no extra LLM spend:

kubectl port-forward svc/grafana -n observability 3001:80

Clean up with ./cleanup.sh.

Takeaways

AGW tool modes and Headroom stack on large payloads — large repo: Code −63%, Search −36%. Different layers, real compounding.
Headroom is payload-dependent. No payload, no benefit — and it can regress cache-friendly workloads by busting the prompt cache (Standard, small repo: +49% cost).
Mind the accuracy cost. Compression mangles exact identifiers like commit SHAs; gate any compression layer with an answer-quality check.
Headroom is not a clean drop-in for MCP. Its semantic cache + CCR tool-injection break tool-calling; disable them with --no-cache --no-ccr-inject-tool --no-ccr-marker.
Measure for your own workload. This is a single run on two repos — the verdict flips with payload size, just as the 104 tool-modes study flipped with catalog size.

The full repo, manifests, harness, judge, and dashboard: 👉 github.com/sebbycorp/agentgateway-demos/tree/main/105-ent-headroom-comp-tokenomics

Related reading: