Skip to content

Public async runs cannot tune row-group horizon/adaptive admission #741

@eric-tramel

Description

@eric-tramel

Priority Level

High

Describe the bug

Public async fire-and-forget settings do not expose a row-group horizon or adaptive row-group admission policy. Users can tune RunConfig.buffer_size, RunConfig.max_in_flight_tasks, and model/provider request caps, but they cannot control how many row groups are actively admitted or whether that horizon adapts to DAG shape, latency tails, and endpoint capacity.

As a result, large async runs can leave fast endpoints idle or delay durable row completion even when public concurrency settings look generous. The useful setting is workload-dependent: a smaller fixed horizon tends to checkpoint earlier but under-exposes ready work, while a wider/adaptive horizon can improve endpoint occupancy but may delay first checkpoint or leave too many row groups open.

This makes it hard for public API users to choose a setting that reliably maximizes durable record completion and endpoint utilization across dependency chains, fan-in/fan-out DAGs, mixed latency, and retry-heavy workloads.

Steps/Code to reproduce bug

  1. Configure a large async dataset with many records and multiple LLM/model columns.
  2. Use either a dependency chain or fan-in/fan-out DAG so later columns depend on earlier model output.
  3. Set only public fire-and-forget controls, for example:
run_config = RunConfig(
    buffer_size=16,
    max_in_flight_tasks=2048,
)

# Configure model/provider caps through the public model config surface,
# then run DataDesigner.create(...) with the run config above.
  1. Increase buffer_size, max_in_flight_tasks, and model/provider request caps across a few runs.
  2. Observe that public settings do not provide a way to directly choose or adapt the active row-group horizon. Depending on workload shape, the run may either checkpoint slowly with low endpoint occupancy or expose more work while delaying durable row completion.

Expected behavior

Users running large async synthetic data jobs should have a documented, supported way to let DataDesigner choose, expose, or adapt the active row-group horizon from declared workload shape and model capacity.

A public fire-and-forget user should be able to reason about the tradeoff between early durable checkpoints and endpoint occupancy without reaching into private scheduler internals.

Agent Diagnostic / Prior Investigation

Important observations from bounded large-job probes:

  • A default/fixed small row-group horizon can leave fast endpoints idle even when buffer_size, max_in_flight_tasks, and model caps are set high enough to suggest more work should run.
  • Wider/adaptive row-group admission can improve endpoint occupancy, especially when some columns are slow, long-tail, or cooling down after retryable responses, but it may delay first checkpoint and keep older row groups open longer.
  • In a same-workload comparison, a default-like fixed small horizon checkpointed records but did not maximize endpoint occupancy; a wider/adaptive policy improved occupancy but could produce no full row-group checkpoint inside the same short observation window.
  • In a practical fan-in DAG, a fixed small horizon made visible record progress while leaving a fast endpoint mostly idle; a wider horizon fed endpoints better but produced no checkpointed records in the same timebox.
  • Public-only sweeps over buffer_size, max_in_flight_tasks, and model caps did not find a setting that consistently achieved both high endpoint utilization and high durable row throughput.

API/config surface check:

  • RunConfig exposes buffer_size and max_in_flight_tasks, but not max_concurrent_row_groups, an adaptive row-group admission toggle, an adaptive initial target, or a max-admitted-row guard.
  • The public builder path constructs the async scheduler without passing row-group horizon/adaptive admission arguments, so users inherit the hidden scheduler default.
  • The scheduler has internal controls for row-group horizon/adaptive admission, but those are not part of the supported public run configuration.

Additional context

Related open issues exist but do not appear to be clear duplicates:

This issue is specifically about the missing public row-group horizon/adaptive admission control and default policy.

Suggested fix

Expose row-group horizon/adaptive row-group admission in RunConfig, for example through a supported policy field rather than requiring direct scheduler construction.

Add an adaptive default or diagnostics-driven policy that can balance endpoint occupancy against first-checkpoint latency and oldest-open-row-group age for common large async DAG shapes.

Surface first-checkpoint, oldest-open-row-group, active-row-group count, row-group admission blocked reason, and configured/effective horizon diagnostics so users can tune without private scheduler instrumentation.

Checklist

  • Searched existing open issues for a clear duplicate.
  • Kept the report anonymized and excluded local paths, hostnames, usernames, branch names, temp harness names, output directory names, commit hashes, and machine-specific setup details.
  • Used the repo bug report section format.
  • Included a concrete suggested fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions