fix: make async row groups lazy#729
Conversation
Review: PR #729 — fix: make async row groups lazySummaryReplaces eager
Threads the new types through The benchmark numbers in the PR description (253 MiB → 0.018 MiB peak for 1M FindingsCorrectness — looks solid, with one ambiguous protocol contract
Backward compatibility
Validation hardening
Style / project conventions
Test coverage
Performance
Nits (non-blocking)
VerdictApprove with minor suggestions. The refactor is well-scoped, preserves the |
Greptile SummaryThis PR replaces eager, fully-materialized row-group metadata (lists and dicts allocated upfront for every logical row group) with a compact, formula-based
|
| Filename | Overview |
|---|---|
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/row_group_plan.py | New file introducing RowGroupPlanLike protocol, CompactRowGroupPlan, ExplicitRowGroupPlan, and normalize_row_group_plan; core of the memory optimization; dual-strategy approach for min/max stats and offset computation is correct. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py | Replaces eager list/dict row-group construction with CompactRowGroupPlan.fresh/resume; adds early validation guard for corrupt metadata where original_target_num_records > target_num_records. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py | Removes _rg_size_map and _rg_start_offset_map dicts; _get_rg_size and _get_rg_start_offset delegate to RowGroupPlanLike; diagnostics updated to use plan properties. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/completion.py | Replaces _row_group_sizes dict with _row_group_plan reference; adds _row_group_size and _row_group_size_or_default helpers; semantics preserved correctly. |
| packages/data-designer-engine/tests/engine/dataset_builders/test_async_builder_integration.py | Adds memory-regression test for large fresh async preparation asserting peak < 5 MiB. |
| packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py | Adds scheduler-preparation memory bound test; updates start-offset test to use CompactRowGroupPlan.resume instead of raw tuples + manual offset dict. |
| packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py | Adds tests for corrupt-metadata rejection, near-complete resume memory bounds, and half-complete boundary branch. |
Sequence Diagram
sequenceDiagram
participant DB as DatasetBuilder
participant PLAN as CompactRowGroupPlan
participant CT as CompletionTracker
participant SCHED as AsyncTaskScheduler
alt Fresh run
DB->>PLAN: CompactRowGroupPlan.fresh(num_records, buffer_size)
Note over PLAN: O(1) no list/dict allocated
else Resume run
DB->>DB: "validate original_target_num_records <= target_num_records"
DB->>PLAN: CompactRowGroupPlan.resume(original_target, num_records, buffer_size, completed_ids)
Note over PLAN: Stores smaller side of frontier
end
DB->>CT: CompletionTracker.with_graph(graph, plan)
DB->>SCHED: "AsyncTaskScheduler(..., row_groups=plan, ...)"
loop Per row group
SCHED->>PLAN: row_group_size(rg_id)
SCHED->>PLAN: row_group_start_offset(rg_id)
Note over PLAN: Formula-based, no dict lookup
SCHED->>CT: mark_cell_complete(column, rg_id, row_index)
end
Reviews (3): Last reviewed commit: "Merge branch 'main' into codex/fix-726-l..." | Re-trigger Greptile
Avoid preallocating per-row-group list and dictionary metadata for huge async runs. The async builder now passes a compact row-group plan through the completion tracker and scheduler while preserving resume offsets and explicit small-list compatibility. Fixes NVIDIA-NeMo#726 Signed-off-by: Eric W. Tramel <1223539+eric-tramel@users.noreply.github.com>
90a2323 to
ff99564
Compare
johnnygreco
left a comment
There was a problem hiding this comment.
Review complete. I did not find any actionable issues.
I checked the compact row-group plan, resume offset behavior with holes, fresh large-run preparation, and the scheduler/tracker transition from materialized lists and dictionaries to the plan protocol. The implementation preserves row-group sizes and original offsets while avoiding the old per-row-group metadata allocation.
Verification run:
ruff checkon the changed filesruff format --checkon the changed filespytest packages/data-designer-engine/tests/engine/dataset_builders/test_async_builder_integration.py packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py -q(207 passed)git diff --check origin/main...HEAD
johnnygreco
left a comment
There was a problem hiding this comment.
Approved. My review found no actionable issues or required updates.
📋 Summary
Fixes #726 by replacing eager async row-group metadata construction with a compact row-group plan. Scheduler/tracker preparation now stays proportional to the active/sparse row groups instead of materializing list and dictionary metadata for every logical row group, while preserving resume offsets for ordered seed datasets.
🔗 Related Issue
Fixes #726
🔄 Changes
original_target_num_recordsexceeds the requested target.🧪 Testing
make lint-enginemake test-engine— 2220 passed in 36.28sPerformance demonstration, measured locally with
tracemallocfor 2,000,000 records,buffer_size=2, and 1,000,000 logical row groups:[(999998, 2), (999999, 2)]; avoids retaining the near-full completed-ID set in the plan.✅ Checklist