Public async runs cannot tune row-group horizon/adaptive admission

### Priority Level

High

### Describe the bug

Public async fire-and-forget settings do not expose a row-group horizon or adaptive row-group admission policy. Users can tune `RunConfig.buffer_size`, `RunConfig.max_in_flight_tasks`, and model/provider request caps, but they cannot control how many row groups are actively admitted or whether that horizon adapts to DAG shape, latency tails, and endpoint capacity.

As a result, large async runs can leave fast endpoints idle or delay durable row completion even when public concurrency settings look generous. The useful setting is workload-dependent: a smaller fixed horizon tends to checkpoint earlier but under-exposes ready work, while a wider/adaptive horizon can improve endpoint occupancy but may delay first checkpoint or leave too many row groups open.

This makes it hard for public API users to choose a setting that reliably maximizes durable record completion and endpoint utilization across dependency chains, fan-in/fan-out DAGs, mixed latency, and retry-heavy workloads.

### Steps/Code to reproduce bug

1. Configure a large async dataset with many records and multiple LLM/model columns.
2. Use either a dependency chain or fan-in/fan-out DAG so later columns depend on earlier model output.
3. Set only public fire-and-forget controls, for example:

```python
run_config = RunConfig(
    buffer_size=16,
    max_in_flight_tasks=2048,
)

# Configure model/provider caps through the public model config surface,
# then run DataDesigner.create(...) with the run config above.
```

4. Increase `buffer_size`, `max_in_flight_tasks`, and model/provider request caps across a few runs.
5. Observe that public settings do not provide a way to directly choose or adapt the active row-group horizon. Depending on workload shape, the run may either checkpoint slowly with low endpoint occupancy or expose more work while delaying durable row completion.

### Expected behavior

Users running large async synthetic data jobs should have a documented, supported way to let DataDesigner choose, expose, or adapt the active row-group horizon from declared workload shape and model capacity.

A public fire-and-forget user should be able to reason about the tradeoff between early durable checkpoints and endpoint occupancy without reaching into private scheduler internals.

### Agent Diagnostic / Prior Investigation

Important observations from bounded large-job probes:

- A default/fixed small row-group horizon can leave fast endpoints idle even when `buffer_size`, `max_in_flight_tasks`, and model caps are set high enough to suggest more work should run.
- Wider/adaptive row-group admission can improve endpoint occupancy, especially when some columns are slow, long-tail, or cooling down after retryable responses, but it may delay first checkpoint and keep older row groups open longer.
- In a same-workload comparison, a default-like fixed small horizon checkpointed records but did not maximize endpoint occupancy; a wider/adaptive policy improved occupancy but could produce no full row-group checkpoint inside the same short observation window.
- In a practical fan-in DAG, a fixed small horizon made visible record progress while leaving a fast endpoint mostly idle; a wider horizon fed endpoints better but produced no checkpointed records in the same timebox.
- Public-only sweeps over `buffer_size`, `max_in_flight_tasks`, and model caps did not find a setting that consistently achieved both high endpoint utilization and high durable row throughput.

API/config surface check:

- `RunConfig` exposes `buffer_size` and `max_in_flight_tasks`, but not `max_concurrent_row_groups`, an adaptive row-group admission toggle, an adaptive initial target, or a max-admitted-row guard.
- The public builder path constructs the async scheduler without passing row-group horizon/adaptive admission arguments, so users inherit the hidden scheduler default.
- The scheduler has internal controls for row-group horizon/adaptive admission, but those are not part of the supported public run configuration.

### Additional context

Related open issues exist but do not appear to be clear duplicates:

- #727 covers durable async capacity diagnostics for explaining endpoint idle time.
- #700 covers disambiguating `buffer_size` from scheduler work admission semantics.
- #645 tracks the broader async scheduling/resource metadata epic.

This issue is specifically about the missing public row-group horizon/adaptive admission control and default policy.

### Suggested fix

Expose row-group horizon/adaptive row-group admission in `RunConfig`, for example through a supported policy field rather than requiring direct scheduler construction.

Add an adaptive default or diagnostics-driven policy that can balance endpoint occupancy against first-checkpoint latency and oldest-open-row-group age for common large async DAG shapes.

Surface first-checkpoint, oldest-open-row-group, active-row-group count, row-group admission blocked reason, and configured/effective horizon diagnostics so users can tune without private scheduler instrumentation.

### Checklist

- [x] Searched existing open issues for a clear duplicate.
- [x] Kept the report anonymized and excluded local paths, hostnames, usernames, branch names, temp harness names, output directory names, commit hashes, and machine-specific setup details.
- [x] Used the repo bug report section format.
- [x] Included a concrete suggested fix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Public async runs cannot tune row-group horizon/adaptive admission #741

Priority Level

Describe the bug

Steps/Code to reproduce bug

Expected behavior

Agent Diagnostic / Prior Investigation

Additional context

Suggested fix

Checklist

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Public async runs cannot tune row-group horizon/adaptive admission #741

Description

Priority Level

Describe the bug

Steps/Code to reproduce bug

Expected behavior

Agent Diagnostic / Prior Investigation

Additional context

Suggested fix

Checklist

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions