Skip to content

fix(bigquery-analytics): stop exporting plugin-owned OTel spans (#94)#5

Merged
caohy1988 merged 1 commit into
mainfrom
fix/bqaa-no-otel-span-export
Jun 1, 2026
Merged

fix(bigquery-analytics): stop exporting plugin-owned OTel spans (#94)#5
caohy1988 merged 1 commit into
mainfrom
fix/bqaa-no-otel-span-export

Conversation

@caohy1988

Copy link
Copy Markdown
Owner

Closes the duplicate-span bug Guillaume reported on haiyuan-eng-google#94 — when Agent Engine telemetry is on (GOOGLE_CLOUD_AGENT_ENGINE_ENABLE_TELEMETRY=true), every BQAA-instrumented operation surfaces as two spans in Cloud Trace: the framework's real span plus a duplicate plugin-owned span.

Root cause

TraceManager.push_span() called tracer.start_span(...) to mint an OTel span purely as an ID carrier for BigQuery span_id / parent_span_id columns. The author was aware of one related pitfall — re-parenting the framework's own spans, fixed by not attaching the plugin span to the ambient context (cf. google#4561) — but that guard does not prevent the plugin span from going through the configured SpanProcessor → BatchSpanProcessor → Cloud Trace exporter pipeline. Once Agent Engine wires the global provider to Cloud Trace, every push/pop pair gets exported.

The plugin already tracked every parent/child relationship on its own contextvar-backed _SpanRecord stack — the OTel span object was incidental to correctness.

Fix (scoped to TraceManager + _SpanRecord)

Change Detail
_SpanRecord No longer holds an OTel span. Now: span_id, trace_id, owns_span, start_time_ns, first_token_time.
push_span Generates a 16-hex span_id directly. Inherits trace_id by precedence: top of internal stack → ambient OTel span → freshly generated 32-hex. No tracer.start_span(...) call.
attach_current_span Extracts trace_id / span_id from the ambient span as strings (the existing pattern); no longer stores the OTel object.
pop_span / clear_stack Drop .end() / OTel start_time machinery — there is no OTel span to end.
get_trace_id / get_start_time Read directly from the record.

Method signatures unchanged. The module-level tracer = trace.get_tracer(...) is retained for ABI compat (currently unused by plugin code; can be removed in a follow-up if no external consumers are found).

Cross-system correlation preserved

BigQuery rows still carry trace_id, inherited from the ambient Agent Engine / Runner span when present. Joining agent_events to Cloud Trace by trace_id continues to work end-to-end. That was the actual reason the OTel span was created in the first place — and it's preserved by extracting the trace_id from the ambient context instead of starting a new span.

Tests

Three existing tests directly asserted the bug (the "OTel spans are created and exported" contract). They're rewritten as inverted regression guards against re-introduction:

Old test New test (same code path, inverted assertion)
test_otel_integration test_push_pop_does_not_call_tracer_start_span
test_otel_integration_real_provider test_push_pop_does_not_export_spans_through_real_provider
test_clear_stack_ends_owned_spans test_clear_stack_does_not_export_spans

Each wires a real InMemorySpanExporter (or mock spy on tracer) and asserts the exporter sees zero spans / tracer.start_span is not called after a full push → pop cycle.

Four new tests lock in the lifecycle / inheritance contracts the plugin must keep:

Test Asserts
test_push_span_inherits_ambient_trace_id When an ambient OTel span exists (Agent Engine pattern), the plugin's trace_id matches it.
test_llm_request_response_share_span_id_contract Paired LLM_REQUEST / LLM_RESPONSE rows share one span_id — the documented BQAA join contract.
test_tool_starting_completed_share_span_id_contract Same invariant for the tool lifecycle pair.
test_streaming_llm_response_shares_span_id_until_final_contract Multiple partial LLM_RESPONSE callbacks reuse the same span_id; only the final fire pops. Prevents a future "dedupe" of streaming rows from breaking the contract.

226/229 plugin tests pass (6 skipped for unrelated optional deps; 0 fail). pyink + isort clean.

What's deliberately NOT in this PR

  • No changes to the BQ write path. This fix targets Cloud Trace duplication only. If Guillaume's follow-up confirms BigQuery rows are also duplicated, that's a separate write-path issue tracked separately.
  • No removal of module-level tracer symbol. Kept for ABI compat; can be cleaned up in a follow-up after a downstream-consumer scan.
  • No upstream sync (google/adk-python). PR lands on the fork first; cherry-pick upstream after review.

Diff stat

src/google/adk/plugins/bigquery_agent_analytics_plugin.py | 174 ++++++-------
tests/unittests/plugins/test_bigquery_agent_analytics_plugin.py | 282 ++++++++++++++++-----
2 files changed, 306 insertions(+), 150 deletions(-)

Refs

Test plan

  • pytest tests/unittests/plugins/test_bigquery_agent_analytics_plugin.py -q — 226 pass / 6 skip / 0 fail
  • pyink --config pyproject.toml --check — clean
  • Producers CI on PR head
  • Manual reproduction in a real Agent Engine deployment (defer to maintainer's environment — recommend Guillaume verify on next ADK patch)

…le#94)

When Agent Engine telemetry is enabled
(GOOGLE_CLOUD_AGENT_ENGINE_ENABLE_TELEMETRY=true) and Cloud Trace
export is configured on the global OTel tracer provider, the
BigQuery Agent Analytics plugin causes every instrumented operation
to appear as TWO spans in Cloud Trace — the framework's real span
plus a duplicate plugin-owned span. Reported in
GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK#94 (also at
haiyuan-eng-google#94) by a regular BQAA + Agent Engine user.

Root cause
----------
TraceManager.push_span() called tracer.start_span(...) purely as an
ID carrier, so the plugin could populate BigQuery span_id /
parent_span_id columns. The author was aware of one related class
of pitfall (re-parenting the framework's spans — fixed by not
attaching the plugin span to the ambient context, see google#4561), but
that guard does not stop the plugin span from going through the
configured SpanProcessor → BatchSpanProcessor → Cloud Trace exporter.
Once Agent Engine wires the global provider to Cloud Trace, every
push/pop pair is exported.

The plugin already tracked every parent/child relationship on its
own contextvar-backed _SpanRecord stack — the OTel span object was
incidental to correctness.

Fix (scoped to TraceManager methods + _SpanRecord)
-------------------------------------------------
* _SpanRecord no longer holds an OTel span object. It carries
  span_id, trace_id, owns_span, start_time_ns, first_token_time.
* push_span generates a 16-hex span_id directly and inherits
  trace_id by precedence: top of internal stack → ambient OTel span
  → freshly generated 32-hex. No tracer.start_span call.
* attach_current_span extracts trace_id/span_id from the ambient
  span (the existing path the framework already supports) and
  stores them as plain strings — no longer holds the OTel object.
* pop_span / clear_stack drop the .end()/.start_time machinery
  since there is no OTel span to end.
* get_trace_id / get_start_time read directly from the record.

The signatures of push_span / attach_current_span / pop_span /
clear_stack / get_trace_id / get_start_time are unchanged.
Module-level `tracer = trace.get_tracer(...)` is retained for ABI
compat (currently unused by plugin code; can be removed in a
follow-up if no external consumers are identified).

attach_current_span() is otherwise untouched — it only observed
the ambient span; it never created one. That path was already
correct and remains so.

Cross-system correlation
------------------------
BigQuery rows still carry trace_id, inherited from the ambient
Agent Engine / Runner span when present. Joining `agent_events` to
Cloud Trace by trace_id continues to work end-to-end.

Tests
-----
Three existing tests that were directly asserting the bug
(test_otel_integration, test_otel_integration_real_provider,
test_clear_stack_ends_owned_spans) are rewritten as inverted
regression guards:
  * test_push_pop_does_not_call_tracer_start_span
  * test_push_pop_does_not_export_spans_through_real_provider
  * test_clear_stack_does_not_export_spans
Each asserts that the corresponding code path does NOT export an
OTel span via an InMemorySpanExporter wired to a real provider, or
does NOT invoke tracer.start_span via a mock spy.

Four new tests added to lock in the lifecycle / inheritance
contracts the plugin must keep:
  * test_push_span_inherits_ambient_trace_id — when an ambient
    OTel span exists (the Agent Engine pattern), the plugin's
    trace_id matches it.
  * test_llm_request_response_share_span_id_contract — the
    paired LLM_REQUEST / LLM_RESPONSE rows share one span_id (the
    documented BQAA join contract).
  * test_tool_starting_completed_share_span_id_contract — same
    invariant for the tool lifecycle pair.
  * test_streaming_llm_response_shares_span_id_until_final_contract
    — multiple partial LLM_RESPONSE callbacks reuse the same
    span_id and only the final fire pops the span. Prevents a
    future "dedupe" of streaming rows from breaking the contract.

226/229 plugin tests pass (6 skipped for unrelated optional
deps); pyink + isort clean.

Refs:
  - haiyuan-eng-google/BigQuery-Agent-Analytics-SDK#94 (reported by a customer)
  - google#4561 (prior fix for span-hierarchy re-parenting)
  - google#4645 (prior fix for trace_id fracture)
@caohy1988 caohy1988 force-pushed the fix/bqaa-no-otel-span-export branch from 9c8f27d to a01758a Compare June 1, 2026 23:08
@caohy1988 caohy1988 merged commit 7d29bfa into main Jun 1, 2026
11 checks passed
@caohy1988 caohy1988 deleted the fix/bqaa-no-otel-span-export branch June 1, 2026 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant