Skip to content

[BUG] PostHog telemetry: exception inflation from double-capture and control-flow misuse of captureException #513

@edelauna

Description

@edelauna

Problem (one or two sentences)

PostHog's $exception event count is ~50–100× higher than conversation message volume (10,800 exceptions vs 10 messages on 2026-06-05), caused by two compounding bugs: captureException being called multiple times for a single API error (double-capture), and control-flow signals (ConsecutiveMistakeError) being routed through captureException when they are not real errors.

Context (who is affected and when)

Affects all users with telemetry enabled. Makes PostHog's Error Tracking dashboard unusable — real errors are buried in noise — and causes misleading metrics. The exception count is growing ~3.5× over 14 days even as DAUs grow only modestly, indicating the inflation compounds with retry volume.

Reproduction steps

  1. Any environment with telemetry opted in
  2. Run a task that hits an API error with auto-approval enabled — provider captures, throws, Task loop retries → provider captures again on each attempt
  3. Run a task where the LLM doesn't use tools → ConsecutiveMistakeError captured as a $exception event repeatedly
  4. Observe PostHog: $exception count dwarfs all other events; $exception_handled is true for 100% of events

Expected result

One $exception event per distinct error occurrence. Control-flow signals (consecutive mistake limits) emitted as named telemetry events, not exceptions. Error tracking dashboard reflects real, actionable errors.

Actual result

3–6 $exception events per API error (due to retry amplification). ~1,000–1,900 no_tools_used and ~100–300 tool_repetition $exception events per day from the consecutive mistake guard. ~7,000–10,000 untagged exceptions per day with no reason, many of which are duplicates.


Audit findings (grounded in PostHog Node.js SDK docs)

Best-practice references

  • PostHog error tracking capture docs: "Never manually construct a $exception event. Always use captureException(). Only capture unexpected errors — not control flow."
  • PostHog Node.js SDK docs: Avoid calling captureException in a catch block that re-throws, if an outer catch will also call captureException on the same error.
  • PostHog burst protection throttles at 10 exceptions of the same type per 10s, but client-side deduplication is still recommended for retry loops.

Bug 1: ConsecutiveMistakeError misrouted through captureException

Files:

  • src/core/task/Task.ts:2384–2394
  • src/core/assistant-message/presentAssistantMessage.ts:655–662

What happens: captureConsecutiveMistakeError() (which correctly fires a named event CONSECUTIVE_MISTAKE_ERROR) is called, then immediately captureException(new ConsecutiveMistakeError(..., "no_tools_used"/"tool_repetition")) is also called. The named event fires correctly. The captureException call is redundant AND routes a deliberate control-flow limit into the error tracking system.

Volume: ~1,000–2,200 spurious $exception events/day.

Fix: Remove the captureException(new ConsecutiveMistakeError(...)) calls. The captureConsecutiveMistakeError() event already captures this signal correctly as a named event.


Bug 2: Provider-level capture + re-throw → retry loop re-captures (N times per error)

Pattern present in all API providers:

// Inside provider (e.g. anthropic.ts:183, openrouter.ts:366, gemini.ts:481, etc.)
} catch (error) {
    TelemetryService.instance.captureException(apiError)  // ← captured here
    throw error  // ← re-thrown
}

// In Task.ts attemptApiRequest (line 4214), with auto-approval:
yield* this.attemptApiRequest(retryAttempt + 1)  // ← re-enters provider, which captures again on next failure

Affected providers: anthropic.ts, openrouter.ts, openai-native.ts, openai-codex.ts, gemini.ts, bedrock.ts, mistral.ts, xai.ts

Amplification: With default retry behavior, one persistent API error = 3–6+ $exception captures. With auto-approval on (common), the multiplier is unbounded up to max retry depth.

Fix options (pick one):

  • Option A (preferred): Remove captureException from all provider catch blocks. Let attemptApiRequest in Task.ts capture once, only on final failure (when it gives up and surfaces to the user).
  • Option B: Add a captured flag to ApiProviderError and check it before capturing, so re-throws of the same error object are no-ops in captureException.

Bug 3: openai-native.ts — outer/inner double-capture in single request

Files: src/api/providers/openai-native.ts:644–648 (outer) and :1127–1130 (inner)

The outer createMessage catch wraps yield* this.handleStreamResponse(...). If handleStreamResponse throws after capturing at line 1130, the outer catch at line 648 captures the same error again — two $exception calls for one stream failure, before any retries.

Fix: handleStreamResponse should either capture-and-not-throw, or throw-and-not-capture. The outer handler captures.


Bug 4: openrouter.tshandleStreamingError + caller both capture

File: src/api/providers/openrouter.ts:194–206, 631–637

handleStreamingError() calls captureException(apiError) then throws a new Error(...). The caller's catch receives this new error and calls captureException again (lines 631/637). Two events for one stream error.

Fix: handleStreamingError should not call captureException — it should be the caller's responsibility, or the method should be renamed to make clear it captures internally and callers should not.


Proposed approach

  1. Remove captureException(new ConsecutiveMistakeError(...)) callscaptureConsecutiveMistakeError() already tracks this correctly as a named event.
  2. Single capture point policy: Capture at the boundary closest to the user (i.e., attemptApiRequest on final failure), not inside every provider catch block. Remove provider-level captureException calls, or suppress re-capture on re-throw.
  3. Fix openai-native.ts nested catch double-capture by removing captureException from handleStreamResponse.
  4. Fix openrouter.ts handleStreamingError by removing its internal captureException call (let callers capture).
  5. Consider client-side deduplication in PostHogTelemetryClient.captureException using a short-lived fingerprint cache (as recommended in PostHog docs) to catch any remaining retry-amplification cases.

Trade-offs / risks

  • Removing provider-level captures means errors that never surface to attemptApiRequest (e.g. in completePrompt which is not called from the retry loop) would need a capture point added at their call site — audit needed before removing.
  • Option B (flag on error object) is safer but more surgical; Option A is cleaner but requires confirming all code paths are covered.

App Version

Observed on current main / feature/bot-pr-human-approvals-3hbkmkjic8h7h — data from PostHog covering 2026-05-23 to 2026-06-06.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions