Skip to content

chore(weave): Add EvaluationLoggerV2#6675

Open
andrewtruong wants to merge 12 commits intomasterfrom
andrew/eval-logger-v2
Open

chore(weave): Add EvaluationLoggerV2#6675
andrewtruong wants to merge 12 commits intomasterfrom
andrew/eval-logger-v2

Conversation

@andrewtruong
Copy link
Copy Markdown
Collaborator

@andrewtruong andrewtruong commented Apr 22, 2026

This PR adds a variant of EvaluationLogger that uses the server-side APIs instead of our client-side hack.

@wandbot-3000
Copy link
Copy Markdown

wandbot-3000 Bot commented Apr 22, 2026

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

@w-b-hivemind
Copy link
Copy Markdown

w-b-hivemind Bot commented Apr 22, 2026

HiveMind Sessions

2 sessions · 17h 38m · $225

Session Agent Duration Tokens Cost Lines
EvaluationLoggerV2 Implementation and Performance
2a78ad93-0ef2-46a9-9f7c-6fee229fa02f
claude 17h 18m 541.0K $208 +4962 -1128
Run V1 Tests Against V2 Implementation
d59280cc-807b-49a9-9b11-28677b841710
claude 19m 62.4K $17 +761 -110
Total 17h 38m 603.4K $225 +5723 -1238

View all sessions in HiveMind →

Run claude --resume 2a78ad93-0ef2-46a9-9f7c-6fee229fa02f to pickup where you left off.

@andrewtruong andrewtruong marked this pull request as ready for review April 23, 2026 16:05
@andrewtruong andrewtruong requested a review from a team as a code owner April 23, 2026 16:05
Copy link
Copy Markdown
Contributor

@chance-wnb chance-wnb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's zoom into the ALTER TABLE ... UPDATE issue.

# In practice the mutation only fires when a caller mutates
# `pred.output` AFTER logging a score, or overrides via
# `pred.finish(output=...)`. For high-frequency eval workloads that
# rely on that pattern, the right long-term fix is to change
Copy link
Copy Markdown
Contributor

@chance-wnb chance-wnb Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the right long-term fix is to

I am afraid that's the correct fix and should happen before this ships to production traffic. But I think directly attacking the issue by refactoring the AggregatingMergeTree is better. or otherwise sooner or later we will pay.

Your mentioning of the right long-term fix is the bottom line :)

# `call_end` row with the create-time output, and `calls_merged` uses
# `SimpleAggregateFunction(any, ...)` for `output_dump` — so a second
# `call_end` row alone cannot win against the first one. We issue an
# `ALTER TABLE ... UPDATE` mutation on both `call_parts` (source) and
Copy link
Copy Markdown
Contributor

@chance-wnb chance-wnb Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, ALTER TABLE ... UPDATE is very costly. You know our calls_merged table contains data not related to eval business as well, so it might rewrite a multi-GB part per row, on every replica, while blocking merges. (An ALTER TABLE ... UPDATE mutation rewrites parts, not granules.)

# output for this call. `prediction_create` already emitted a
# `call_end` row with the create-time output, and `calls_merged` uses
# `SimpleAggregateFunction(any, ...)` for `output_dump` — so a second
# `call_end` row alone cannot win against the first one. We issue an
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so a second call_end row alone cannot win

I think this could work with the current AggregatingMergeTree

  • Add updated_at DateTime64(6) to call_parts, stamped now64(6) per insert.

  • calls_merged: MODIFY COLUMN output_dump AggregateFunction(argMax, String, DateTime64(6))

  • calls_merged_view: change from anySimpleState(output_dump) to argMaxState(output_dump, updated_at).

  • Query time: SELECT output_dumpargMaxMerge(output_dump).

This will need to rebuild a column of output_dump. I am not sure how realistic it is. Let's ask expert's opinion: @gtarpenning

assert ev._accumulated_predictions[1]._captured_scores == {"s": 0.5}


def test_log_example_after_finalization_raises(client, logger_cls):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we collapse these tests into like 4 testing real user flows with multiple assertions instead of like a zillion ones doing very simple things?

id_param_parts = pb_parts.add_param(call_id)
output_dump_param_parts = pb_parts.add_param(output_dump)
output_refs_param_parts = pb_parts.add_param(output_refs)
call_parts_query = f"""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh man we are going to do two updates? yikes

)

@ddtrace.tracer.wrap(name="clickhouse_trace_server_batched._update_call_output")
def _update_call_output(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of this belongs in a /query_builder/ file

Copy link
Copy Markdown
Member

@gtarpenning gtarpenning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very dangerous

Copy link
Copy Markdown
Collaborator

@neutralino1 neutralino1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerns from @gtarpenning and @chance-wnb make me ask: should we migrate the data model first? As in, create new tables dedicated to evals and decouple from the calls table, before we do this switch over?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants