Track Your Claude Code Spend with OTel + Jaeger (and the Gotcha That Wastes Your Morning)


The Claude Code dashboard shows you the total bill. It does not show you which subagent burned $80 in a single session, which MCP server doubled your prefix, or whether your cache is actually doing anything.

For that you need OpenTelemetry. The good news: Anthropic ships official OTel support in Claude Code, and the setup is one Docker command and a JSON block. The bad news: the most popular off-the-shelf LLM tracing tool will silently return zero tokens for every request — no error, no warning, just zero — and you will spend half a morning convinced your monitoring is broken.

Your monitoring is not broken. Let’s get into it.

Why bother tracing at all

The dashboard tells you the bill. It does not tell you where it came from.

Real example: GitHub issue #29966 documents a subagent caching bug where spawned agents fail to share cache with the parent session. The practical consequence is visible in the thread — one developer reported $400 in a single session. You will not see this in the dashboard. You will see a large number and no explanation.

OTel traces give you per-request breakdowns: which model, how many tokens, how much was cache hit vs cache miss, and how long the model thought before responding. You can spot cache busting the moment it happens, and you can attribute spend to specific tools, agents, or pipeline stages.

You cannot optimize what you cannot measure. This is the setup.

The standard setup

Prerequisites: Docker running locally. That’s it.

Step 1: Run Jaeger

docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  jaegertracing/all-in-one:latest

Jaeger is the trace collector and UI. The 4317 port is the OTLP/gRPC receiver. The 16686 port is the web UI. It runs happily in the background.

Step 2: Enable telemetry in Claude Code

Open (or create) ~/.claude/settings.json and add these to the env block:

{
  "env": {
    "CLAUDE_CODE_ENABLE_TELEMETRY": "1",
    "CLAUDE_CODE_ENHANCED_TELEMETRY_BETA": "1",
    "OTEL_TRACES_EXPORTER": "otlp",
    "OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
    "OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4317"
  }
}

While you’re there, add this too — it lets you filter traces by project:

{
  "env": {
    "OTEL_RESOURCE_ATTRIBUTES": "project=your-project-name"
  }
}

Step 3: Restart Claude Code and watch the traces appear

Quit and relaunch Claude Code. Trigger a few prompts. Open http://localhost:16686, select the claude-code service, and click “Find Traces.” You should see traces flowing within seconds of your first API call.

The traces are there. The data is real. This is where it gets complicated.

The gotcha: why LLM tooling returns zero

Say you’ve connected traceloop/opentelemetry-mcp-server to query your traces. You call get_llm_usage. It returns:

{
  "total_tokens": 0,
  "total_cost": 0,
  "request_count": 0
}

Zero. Everything zero. You check that Jaeger is receiving traces (it is). You check that the timestamps are right (they are). You restart everything and try again (same result).

Here is what is happening.

Claude Code does not follow the standard OpenTelemetry GenAI semantic conventions. It emits flat, un-namespaced attributes instead:

Standard OTel GenAI conventionClaude Code attribute
gen_ai.usage.prompt_tokensinput_tokens
gen_ai.usage.completion_tokensoutput_tokens
gen_ai.request.modelmodel
gen_ai.system(not emitted)

That last row is the killer. The traceloop/opentelemetry-mcp-server classifies a span as an “LLM span” only if it has a gen_ai.system attribute. Claude Code never emits it. Here’s the exact check in the server source (as of 2026-04-14):

# traceloop/opentelemetry-mcp-server, models.py
@property
def is_llm_span(self) -> bool:
    return self.attributes.gen_ai_system is not None  # ← Claude Code never sets this

Every Claude Code span fails this check. The aggregation functions (get_llm_usage, get_llm_model_stats, list_llm_models) filter to LLM spans only, so they return empty results. Silently. With no indication that the filter is the problem.

This is not a bug report — it’s a design mismatch. The MCP server was written for the standard conventions. Claude Code predates or ignores them.

The fix: query spans directly

Skip the aggregation helpers. Use search_traces and get_trace to read raw span attributes instead:

# Instead of get_llm_usage(), query raw spans:
# search_traces(service="claude-code", operation="claude_code.llm_request")
# Then read attributes directly from each span

The data is there. Here is what a real span looks like (anonymized):

{
  "operation_name": "claude_code.llm_request",
  "duration_ms": 2472,
  "attributes": {
    "model": "claude-sonnet-4-6",
    "input_tokens": 1,
    "output_tokens": 67,
    "cache_creation_tokens": 287,
    "cache_read_tokens": 30433,
    "ttft_ms": 1345
  }
}

That cache_read_tokens: 30433 against input_tokens: 1 is a very healthy cache hit. More on that shortly.

Optional: patch the MCP server

If you want the aggregation helpers to work, the fix is five lines. In models.py, change is_llm_span to also match Claude Code’s attribute schema:

@property
def is_llm_span(self) -> bool:
    # Standard OTel GenAI convention
    if self.attributes.gen_ai_system is not None:
        return True
    # Claude Code uses flat attributes without gen_ai.system
    if hasattr(self.attributes, 'model') and hasattr(self.attributes, 'input_tokens'):
        return True
    return False

This is not upstream yet — you’re patching a local copy. Worth it if you lean on the aggregation tools heavily; skip it if you’re comfortable querying spans directly.

A note on the Prometheus/metrics path

If you’re already running a metrics stack (Prometheus + Grafana or VictoriaMetrics), there’s a cleaner path: Claude Code emits a claude_code.token.usage counter that uses simpler attributes and does not have the gen_ai.system classifier problem. Community walkthroughs from tcude.net and Quesma cover this path in detail. It’s worth it if you want persistent dashboards. Jaeger is lighter for ad-hoc trace inspection.

Three queries that pay back the setup time

Once traces are flowing and you can read raw span attributes, here’s what to actually look at.

1. Cache hit rate per session

From the span above: cache_read_tokens: 30433, input_tokens: 1, cache_creation_tokens: 287. The cache hit rate for that request is approximately 30433 / (30433 + 1 + 287) ≈ 99%. Excellent.

The formula is: cache_read_tokens / (input_tokens + cache_read_tokens).

For repeated work on the same codebase, target 50% or higher. If you’re seeing 10-20%, your shared prefix is not caching — check whether your rules files are positioned before agent-specific content.

Pricing context (as of 2026-04-14): Claude Sonnet 4.6 charges $3/M for input tokens and $0.30/M for cache reads. A session with 90% cache hit rate on 1M input tokens costs roughly $300 + (900K × $0.30/M) = $570 instead of $3,000. Cache is not a nice-to-have. The full mechanics of how caching works and what breaks it are in Prompt Caching Basics.

2. Cache miss spikes

Look for spans where cache_creation_tokens is large and cache_read_tokens is small on requests that should be hitting cache. This pattern means cache busting: something in the shared prefix changed between requests.

Common causes:

  • A new MCP server was added (changes the tool list, which is part of the prefix)
  • A rules file was edited
  • Agent definitions were modified mid-session
  • The session spawned agents in a different order

Cache busting burns money on every subsequent request until the new prefix is warm again. Spotting it in traces tells you exactly when it happened and which session triggered it.

3. Subagent vs main session spend split

Filter by the model attribute or by service.name if you’re tagging projects. Subagents show up as separate traces — they start fresh contexts and their cache_read_tokens on turn 1 shows whether they successfully inherited cache from the parent.

This is how you find the #29966 caching bug in practice: subagents with unexpectedly low cache_read_tokens on turn 1, even for large shared prefixes. Without per-request traces, this is invisible — you just see a higher-than-expected bill and no explanation.

What we wish we’d known

The official Claude Code monitoring docs publish the attribute schema — it’s the authoritative reference for what’s in each span. The anthropics/claude-code-monitoring-guide repo has a working setup with Docker Compose.

What neither of those documents mentions: the gen_ai.system classifier issue with third-party tooling. It’s not in any official docs. It’s not in the MCP server README. You’d find it only by reading the server source, which is how we found it.

Hopefully this post saves you the debugging session.


Sources

  1. Monitoring — Claude Code Docs — official Claude Code OTel attribute schema
  2. anthropics/claude-code-monitoring-guide — official monitoring setup repo with Docker Compose
  3. traceloop/opentelemetry-mcp-server — source of the is_llm_span classifier check
  4. How I Monitor My Claude Code Usage with Grafana, OpenTelemetry, and VictoriaMetrics — community metrics-based setup walkthrough
  5. Track Claude Code Usage and Limits with Grafana Cloud — Quesma community walkthrough
  6. anthropics/claude-code#29966 — subagent caching bug; motivating example for per-request tracing