fix: make async row groups lazy by eric-tramel · Pull Request #729 · NVIDIA-NeMo/DataDesigner

eric-tramel · 2026-06-01T23:26:54Z

📋 Summary

Fixes #726 by replacing eager async row-group metadata construction with a compact row-group plan. Scheduler/tracker preparation now stays proportional to the active/sparse row groups instead of materializing list and dictionary metadata for every logical row group, while preserving resume offsets for ordered seed datasets.

🔗 Related Issue

Fixes #726

🔄 Changes

Add compact and explicit row-group plan types for async scheduling metadata.
Thread row-group plans through the async builder, completion tracker, and scheduler instead of preallocating full row-group lists and lookup dictionaries.
Preserve original resume offsets for remaining row groups with holes, and reject corrupt resume metadata where original_target_num_records exceeds the requested target.
Add regression coverage for large fresh async preparation, near-complete resume sparsity, and resume offset behavior.

🧪 Testing

make lint-engine
make test-engine — 2220 passed in 36.28s
Unit tests added/updated
E2E tests: N/A — engine metadata/scheduler fix

Performance demonstration, measured locally with tracemalloc for 2,000,000 records, buffer_size=2, and 1,000,000 logical row groups:

Scenario	Memory	Time	Notes
Simulated old list/dict metadata path	253.3 MiB peak	1.338s	Kept row-group list plus tracker/scheduler/offset dictionaries alive; 4,000,000 retained metadata entries.
New compact scheduler/tracker preparation	0.018 MiB peak	0.000473s	Same 1,000,000 logical row groups and 2,000,000 scheduled records.
New near-complete resume plan with two groups remaining	0.001 MiB retained	0.327970s	Remaining row groups: `[(999998, 2), (999999, 2)]`; avoids retaining the near-full completed-ID set in the plan.

✅ Checklist

Follows commit message conventions
Commits are signed off (DCO)
Architecture docs updated: N/A — internal engine scheduling metadata representation only

github-actions · 2026-06-01T23:29:43Z

Review: PR #729 — fix: make async row groups lazy

Summary

Replaces eager list[tuple[int,int]] + dict[int,int] row-group metadata with
a lazy plan abstraction so scheduler/tracker preparation cost no longer scales
linearly with the logical row-group count. Introduces a new module
row_group_plan.py containing:

RowGroupPlanLike (Protocol) — the scheduler-facing interface.
CompactRowGroupPlan — formula-driven plan used for fresh runs and resume.
ExplicitRowGroupPlan — adapter for already-materialized tuples (kept for
test/small-caller convenience).
normalize_row_group_plan / RowGroupInput — input coercion.

Threads the new types through AsyncTaskScheduler, CompletionTracker, and
DatasetBuilder._prepare_async_run. Drops the now-redundant
row_group_start_offsets parameter, the _rg_size_map / _rg_start_offset_map
caches in the scheduler, and the _build_row_group_start_offsets helper.
Also adds a validation guard rejecting resume metadata where
original_target_num_records > target_num_records.

The benchmark numbers in the PR description (253 MiB → 0.018 MiB peak for 1M
logical groups) are corroborated by new tracemalloc-based regression tests.

Findings

Correctness — looks solid, with one ambiguous protocol contract

The size/offset formulas in CompactRowGroupPlan._row_group_size_for and
row_group_start_offset reproduce the original closed-form computation from
build_row_group_resume_plan. Original-group offsets remain
row_group * buffer_size regardless of holes, preserving the resume-with-
holes invariant the previous row_group_start_offsets dict delivered.
The completion-density heuristic
(valid_completed_count > total_row_groups // 2) correctly switches between
storing scheduled IDs vs completed IDs so the in-memory filter is always
proportional to the smaller set. The near-complete-resume regression test
pins this behavior.
has_row_group is the canonical membership predicate now; _get_rg_start_offset
in AsyncTaskScheduler swallows KeyError and returns None —
but the RowGroupPlanLike protocol declares
row_group_start_offset(self, row_group: int) -> int with no documented
raise contract. A short docstring (or int | None with explicit None
semantics) on the protocol would prevent future implementations from
silently returning 0 or similar. Minor.
describe_known_row_groups for the compact plan returns
"{n} scheduled of {m} total row groups" instead of the previous
sorted(...) of all known IDs. Resulting ValueError messages from
CompletionTracker._validate_row_group are now less specific. Acceptable
given the materialization cost we're avoiding, but worth flagging — anyone
grepping logs for known IDs will see different output.

Backward compatibility

RowGroupResumePlan.remaining_row_groups changes type from
list[tuple[int, int]] to CompactRowGroupPlan, and
row_group_start_offsets is removed entirely from the dataclass. This is
internal to data_designer.engine.dataset_builders, so the layering rule
(interface → engine → config) protects external callers. No public surface
appears affected.
precomputed_row_groups/row_groups parameters are widened to
RowGroupInput, which still accepts Sequence[tuple[int, int]], so
existing direct callers (and the explicit-list test) continue to work.
The _build_async resume completion check changed from
len(completed_ids) >= resume_plan.total_row_groups to
remaining_row_group_count == 0. The new form is actually more correct
because the plan filters out-of-range completed IDs before counting.

Validation hardening

The new original_target_num_records > target_num_records guard in
_load_resume_state is reasonable defense against corrupt metadata and
has direct test coverage
(test_build_resume_raises_when_original_target_metadata_exceeds_target).
CompactRowGroupPlan.__post_init__ rejects negative inputs and
num_records < original_target. Negative-extension test covers one path;
the others are simple invariants.

Style / project conventions

from __future__ import annotations, SPDX header, absolute imports,
modern type syntax, and type annotations on all members are all present.
@dataclass(frozen=True, slots=True) with object.__setattr__ in
__post_init__ is the correct pattern for derived fields on a frozen
dataclass. InitVar is used appropriately for completed_ids.
No relative imports; all imports use data_designer.engine.dataset_builders.*.
Lazy-heavy-imports rule: row_group_plan.py only imports stdlib, so no
concern here.

Test coverage

New tests:
- test_prepare_async_run_uses_compact_plan_for_large_fresh_runs —
  _prepare_async_run end-to-end with 1M groups, asserts
  CompactRowGroupPlan is propagated and peak memory < 5 MiB.
- test_scheduler_preparation_memory_stays_bounded_for_million_row_groups
  — full scheduler construction at 1M groups under 5 MiB.
- test_compact_row_group_plan_rejects_negative_extension — validation.
- test_row_group_resume_plan_stays_sparse_when_almost_complete —
  near-complete resume optimization.
- test_build_resume_raises_when_original_target_metadata_exceeds_target
  — corrupt-metadata guard.
Existing tests updated to use CompactRowGroupPlan.resume(...) and the
iterator/method API instead of list/dict equality.
Coverage gap, minor: I don't see an explicit test for the
intermediate completion-density case (some completed, but
<= total // 2), where the plan stores id_filter = frozenset(completed).
This branch is exercised indirectly by the existing builder tests, but a
direct unit test pinning _filter_includes_scheduled is False would make
the heuristic boundary explicit.

Performance

Behaves as advertised: zero-allocation fresh path, sparse-allocation resume
path. The two tracemalloc tests will catch a regression that
reintroduces eager materialization. 5 MiB ceiling is generous enough to
resist flakiness on shared CI without hiding real regressions.
row_group_min_size / row_group_max_size build a tiny candidate list on
every property access; called from diagnostics so this is fine. If they
show up in hot paths later they can be cached at construction.

Nits (non-blocking)

row_group_plan.py:412 — a one-line comment on the
valid_completed_count > total_row_groups // 2 branch (e.g., "store the
smaller set: scheduled IDs when most groups are complete, completed IDs
otherwise") would help a future reader understand the heuristic without
re-deriving it.
RowGroupPlanLike protocol could document row_group_size /
row_group_start_offset raising KeyError for unknown groups
(AsyncTaskScheduler._get_rg_size / _get_rg_start_offset rely on this).

Verdict

Approve with minor suggestions. The refactor is well-scoped, preserves the
"declare, don't orchestrate" boundary between config and engine, and has
strong regression coverage for the bug it fixes. The protocol-contract
docstring and a single intermediate-density unit test would be small
follow-ups; nothing here blocks merge.

greptile-apps · 2026-06-01T23:33:39Z

Greptile Summary

This PR replaces eager, fully-materialized row-group metadata (lists and dicts allocated upfront for every logical row group) with a compact, formula-based CompactRowGroupPlan that computes sizes and offsets on demand, cutting preparation memory from ~253 MiB to ~0.02 MiB for a 1 M row-group run. The RowGroupPlanLike protocol unifies the fresh, resume, and test-adapter code paths so the scheduler, tracker, and builder all work against a single interface.

row_group_plan.py introduces CompactRowGroupPlan (lazy, formula-driven) and ExplicitRowGroupPlan (adapter for already-materialized tuples used in tests); CompactRowGroupPlan selects between a "store remaining IDs" and "store completed IDs" strategy based on which side of the resume frontier is smaller.
dataset_builder.py / async_scheduler.py / completion.py thread RowGroupPlanLike throughout, removing the pre-built _rg_size_map, _rg_start_offset_map, and row_group_start_offsets dicts.
A new metadata validation guard rejects corrupt resume state where original_target_num_records exceeds target_num_records before attempting to build the plan.

Confidence Score: 5/5

Safe to merge — the refactoring is a clean substitution of eager list/dict allocations with a protocol-backed lazy plan; all production code paths use CompactRowGroupPlan and no old eager path remains reachable.

The offset formula in CompactRowGroupPlan correctly reproduces the original sequential computation for both original and extension groups, the dual-filter strategy keeps retained state proportional to the sparse side, and the updated tests cover fresh runs, near-complete resumes, sparse resumes, and corrupt-metadata rejection. No behavioral regressions were identified.

No files require special attention. The only subtle semantics to be aware of are that ExplicitRowGroupPlan assigns sequential start offsets while CompactRowGroupPlan preserves original offsets via formula — this distinction is intentional and tested.

Important Files Changed

Filename	Overview
packages/data-designer-engine/src/data_designer/engine/dataset_builders/row_group_plan.py	New file introducing RowGroupPlanLike protocol, CompactRowGroupPlan, ExplicitRowGroupPlan, and normalize_row_group_plan; core of the memory optimization; dual-strategy approach for min/max stats and offset computation is correct.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py	Replaces eager list/dict row-group construction with CompactRowGroupPlan.fresh/resume; adds early validation guard for corrupt metadata where original_target_num_records > target_num_records.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py	Removes _rg_size_map and _rg_start_offset_map dicts; _get_rg_size and _get_rg_start_offset delegate to RowGroupPlanLike; diagnostics updated to use plan properties.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/completion.py	Replaces _row_group_sizes dict with _row_group_plan reference; adds _row_group_size and _row_group_size_or_default helpers; semantics preserved correctly.
packages/data-designer-engine/tests/engine/dataset_builders/test_async_builder_integration.py	Adds memory-regression test for large fresh async preparation asserting peak < 5 MiB.
packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py	Adds scheduler-preparation memory bound test; updates start-offset test to use CompactRowGroupPlan.resume instead of raw tuples + manual offset dict.
packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py	Adds tests for corrupt-metadata rejection, near-complete resume memory bounds, and half-complete boundary branch.

Sequence Diagram

sequenceDiagram
    participant DB as DatasetBuilder
    participant PLAN as CompactRowGroupPlan
    participant CT as CompletionTracker
    participant SCHED as AsyncTaskScheduler

    alt Fresh run
        DB->>PLAN: CompactRowGroupPlan.fresh(num_records, buffer_size)
        Note over PLAN: O(1) no list/dict allocated
    else Resume run
        DB->>DB: "validate original_target_num_records <= target_num_records"
        DB->>PLAN: CompactRowGroupPlan.resume(original_target, num_records, buffer_size, completed_ids)
        Note over PLAN: Stores smaller side of frontier
    end
    DB->>CT: CompletionTracker.with_graph(graph, plan)
    DB->>SCHED: "AsyncTaskScheduler(..., row_groups=plan, ...)"
    loop Per row group
        SCHED->>PLAN: row_group_size(rg_id)
        SCHED->>PLAN: row_group_start_offset(rg_id)
        Note over PLAN: Formula-based, no dict lookup
        SCHED->>CT: mark_cell_complete(column, rg_id, row_index)
    end

_{Reviews (3): Last reviewed commit: "Merge branch 'main' into codex/fix-726-l..." | Re-trigger Greptile}

Avoid preallocating per-row-group list and dictionary metadata for huge async runs. The async builder now passes a compact row-group plan through the completion tracker and scheduler while preserving resume offsets and explicit small-list compatibility. Fixes NVIDIA-NeMo#726 Signed-off-by: Eric W. Tramel <1223539+eric-tramel@users.noreply.github.com>

johnnygreco

Review complete. I did not find any actionable issues.

I checked the compact row-group plan, resume offset behavior with holes, fresh large-run preparation, and the scheduler/tracker transition from materialized lists and dictionaries to the plan protocol. The implementation preserves row-group sizes and original offsets while avoiding the old per-row-group metadata allocation.

Verification run:

ruff check on the changed files
ruff format --check on the changed files
pytest packages/data-designer-engine/tests/engine/dataset_builders/test_async_builder_integration.py packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py -q (207 passed)
git diff --check origin/main...HEAD

johnnygreco

Approved. My review found no actionable issues or required updates.

eric-tramel requested a review from a team as a code owner June 1, 2026 23:26

eric-tramel temporarily deployed to agentic-ci June 1, 2026 23:27 — with GitHub Actions Inactive

eric-tramel self-assigned this Jun 1, 2026

eric-tramel added the 🏎️ performance label Jun 1, 2026

eric-tramel force-pushed the codex/fix-726-lazy-row-groups branch from 90a2323 to ff99564 Compare June 1, 2026 23:43

johnnygreco reviewed Jun 2, 2026

View reviewed changes

johnnygreco approved these changes Jun 2, 2026

View reviewed changes

Merge branch 'main' into codex/fix-726-lazy-row-groups

538ed1d

eric-tramel merged commit 1baebd0 into NVIDIA-NeMo:main Jun 2, 2026
60 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: make async row groups lazy#729

fix: make async row groups lazy#729
eric-tramel merged 2 commits into
NVIDIA-NeMo:mainfrom
eric-tramel:codex/fix-726-lazy-row-groups

eric-tramel commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

greptile-apps Bot commented Jun 1, 2026 •

edited

Loading

Confidence Score: 5/5

Sequence Diagram

Uh oh!

johnnygreco left a comment

Uh oh!

johnnygreco left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eric-tramel commented Jun 1, 2026

📋 Summary

🔗 Related Issue

🔄 Changes

🧪 Testing

✅ Checklist

Uh oh!

github-actions Bot commented Jun 1, 2026

Review: PR #729 — fix: make async row groups lazy

Summary

Findings

Correctness — looks solid, with one ambiguous protocol contract

Backward compatibility

Validation hardening

Style / project conventions

Test coverage

Performance

Nits (non-blocking)

Verdict

Uh oh!

greptile-apps Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

johnnygreco left a comment

Choose a reason for hiding this comment

Uh oh!

johnnygreco left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Jun 1, 2026 •

edited

Loading