chore(weave): Add EvaluationLoggerV2 by andrewtruong · Pull Request #6675 · wandb/weave

andrewtruong · 2026-04-22T16:27:27Z

This PR adds a variant of EvaluationLogger that uses the server-side APIs instead of our client-side hack.

wandbot-3000 · 2026-04-22T16:33:07Z

Preview this PR with FeatureBee: https://beta.wandb.ai/?betaVersion=15204526d0b81244876595b312cecf2a926a729a

codecov · 2026-04-22T16:36:21Z

Codecov Report

❌ Patch coverage is 85.79336% with 77 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
weave/evaluation/eval_imperative_v2.py	87.11%	34 Missing and 12 partials ⚠️
weave/evaluation/_imperative_shared.py	80.55%	11 Missing and 10 partials ⚠️
weave/trace_server/sqlite_trace_server.py	79.54%	6 Missing and 3 partials ⚠️
...ve/trace_server/clickhouse_trace_server_batched.py	75.00%	0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

w-b-hivemind · 2026-04-22T16:53:33Z

HiveMind Sessions

2 sessions · 17h 38m · $225

Session	Agent	Duration	Tokens	Cost	Lines
EvaluationLoggerV2 Implementation and Performance `2a78ad93-0ef2-46a9-9f7c-6fee229fa02f`	claude	17h 18m	541.0K	$208	+4962 -1128
Run V1 Tests Against V2 Implementation `d59280cc-807b-49a9-9b11-28677b841710`	claude	19m	62.4K	$17	+761 -110
Total		17h 38m	603.4K	$225	+5723 -1238

View all sessions in HiveMind →

Run claude --resume 2a78ad93-0ef2-46a9-9f7c-6fee229fa02f to pickup where you left off.

chance-wnb

Let's zoom into the ALTER TABLE ... UPDATE issue.

chance-wnb · 2026-04-23T17:55:40Z

+        # In practice the mutation only fires when a caller mutates
+        # `pred.output` AFTER logging a score, or overrides via
+        # `pred.finish(output=...)`. For high-frequency eval workloads that
+        # rely on that pattern, the right long-term fix is to change


the right long-term fix is to

I am afraid that's the correct fix and should happen before this ships to production traffic. But I think directly attacking the issue by refactoring the AggregatingMergeTree is better. or otherwise sooner or later we will pay.

Your mentioning of the right long-term fix is the bottom line :)

chance-wnb · 2026-04-23T18:00:49Z

+        # `call_end` row with the create-time output, and `calls_merged` uses
+        # `SimpleAggregateFunction(any, ...)` for `output_dump` — so a second
+        # `call_end` row alone cannot win against the first one. We issue an
+        # `ALTER TABLE ... UPDATE` mutation on both `call_parts` (source) and


IMHO, ALTER TABLE ... UPDATE is very costly. You know our calls_merged table contains data not related to eval business as well, so it might rewrite a multi-GB part per row, on every replica, while blocking merges. (An ALTER TABLE ... UPDATE mutation rewrites parts, not granules.)

chance-wnb · 2026-04-23T18:16:29Z

+        # output for this call. `prediction_create` already emitted a
+        # `call_end` row with the create-time output, and `calls_merged` uses
+        # `SimpleAggregateFunction(any, ...)` for `output_dump` — so a second
+        # `call_end` row alone cannot win against the first one. We issue an


so a second call_end row alone cannot win

I think this could work with the current AggregatingMergeTree

Add updated_at DateTime64(6) to call_parts, stamped now64(6) per insert.

calls_merged: MODIFY COLUMN output_dump AggregateFunction(argMax, String, DateTime64(6))

calls_merged_view: change from anySimpleState(output_dump) to argMaxState(output_dump, updated_at).

Query time: SELECT output_dump → argMaxMerge(output_dump).

This will need to rebuild a column of output_dump. I am not sure how realistic it is. Let's ask expert's opinion: @gtarpenning

gtarpenning · 2026-04-23T18:47:13Z

+    assert ev._accumulated_predictions[1]._captured_scores == {"s": 0.5}
+
+
+def test_log_example_after_finalization_raises(client, logger_cls):


can we collapse these tests into like 4 testing real user flows with multiple assertions instead of like a zillion ones doing very simple things?

gtarpenning · 2026-04-23T18:47:50Z

+        id_param_parts = pb_parts.add_param(call_id)
+        output_dump_param_parts = pb_parts.add_param(output_dump)
+        output_refs_param_parts = pb_parts.add_param(output_refs)
+        call_parts_query = f"""


oh man we are going to do two updates? yikes

gtarpenning · 2026-04-23T18:48:20Z

        )

+    @ddtrace.tracer.wrap(name="clickhouse_trace_server_batched._update_call_output")
+    def _update_call_output(


most of this belongs in a /query_builder/ file

gtarpenning

This seems very dangerous

neutralino1

Concerns from @gtarpenning and @chance-wnb make me ask: should we migrate the data model first? As in, create new tables dedicated to evals and decouple from the calls table, before we do this switch over?

test

d99d33b

andrewtruong added 7 commits April 22, 2026 15:23

test

7eb224f

test

3547b4b

test

9124f1c

test

70bb953

test

67821d9

test

dc60b2f

test

39efcc6

andrewtruong marked this pull request as ready for review April 23, 2026 16:05

andrewtruong requested a review from a team as a code owner April 23, 2026 16:05

chance-wnb requested changes Apr 23, 2026

View reviewed changes

gtarpenning reviewed Apr 23, 2026

View reviewed changes

gtarpenning requested changes Apr 23, 2026

View reviewed changes

neutralino1 reviewed Apr 23, 2026

View reviewed changes

andrewtruong added 4 commits April 23, 2026 17:57

test

35cc113

test

1cdda36

maybe

28ce64f

test

077f305

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(weave): Add EvaluationLoggerV2#6675

chore(weave): Add EvaluationLoggerV2#6675
andrewtruong wants to merge 12 commits intomasterfrom
andrew/eval-logger-v2

andrewtruong commented Apr 22, 2026 •

edited

Loading

Uh oh!

wandbot-3000 Bot commented Apr 22, 2026

Uh oh!

codecov Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

w-b-hivemind Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

chance-wnb left a comment

Uh oh!

chance-wnb Apr 23, 2026 •

edited

Loading

Uh oh!

chance-wnb Apr 23, 2026 •

edited

Loading

Uh oh!

chance-wnb Apr 23, 2026

Uh oh!

gtarpenning Apr 23, 2026

Uh oh!

gtarpenning Apr 23, 2026

Uh oh!

gtarpenning Apr 23, 2026

Uh oh!

gtarpenning left a comment

Uh oh!

neutralino1 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		assert ev._accumulated_predictions[1]._captured_scores == {"s": 0.5}


		def test_log_example_after_finalization_raises(client, logger_cls):

Conversation

andrewtruong commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wandbot-3000 Bot commented Apr 22, 2026

Uh oh!

codecov Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

w-b-hivemind Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

HiveMind Sessions

Uh oh!

chance-wnb left a comment

Choose a reason for hiding this comment

Uh oh!

chance-wnb Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chance-wnb Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chance-wnb Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

gtarpenning Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

gtarpenning Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

gtarpenning Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

gtarpenning left a comment

Choose a reason for hiding this comment

Uh oh!

neutralino1 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andrewtruong commented Apr 22, 2026 •

edited

Loading

codecov Bot commented Apr 22, 2026 •

edited

Loading

w-b-hivemind Bot commented Apr 22, 2026 •

edited

Loading

chance-wnb Apr 23, 2026 •

edited

Loading

chance-wnb Apr 23, 2026 •

edited

Loading