fix: prevent request admission timeout row drops#730
Conversation
Code Review: PR #730 —
|
820e795 to
0c57a64
Compare
Greptile SummaryThis PR fixes Issue #725 where request-admission queue timeouts were misclassified as generic provider timeouts and caused dropped rows. It introduces
|
| Filename | Overview |
|---|---|
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py | Renames rate-limit-specific fields/methods to general preserved retryable equivalents; adds ModelRequestAdmissionTimeoutError to the preserved set; adds asyncio.sleep(0) yield points after dispatch; fixes queue_empty event ordering; integrates per-model request resource limits into task admission |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/resolver.py | Adds per-model request resource keys to SchedulerResourceRequest for each model task, and builds a request_resource_limits map from generator metadata weights |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/resources.py | Widens SchedulerResourceKey from a Literal type to str; adds request_scheduler_resource_key() helper; updates post_init validation |
| packages/data-designer-engine/src/data_designer/engine/models/clients/model_request_executor.py | Extracts _provider_error_from_request_admission() to distinguish queue_timeout from other RequestAdmissionErrors; updates _should_retry to use the classified kind |
| packages/data-designer-engine/src/data_designer/engine/models/errors.py | Adds ModelRequestAdmissionTimeoutError and maps REQUEST_ADMISSION_TIMEOUT to it with a descriptive user-facing message |
| packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py | Adds tests for request admission timeout preservation, pacing, and request resource admission; parameterizes rate-limit tests over both error kinds |
Sequence Diagram
sequenceDiagram
participant S as AsyncTaskScheduler
participant TAC as TaskAdmissionController
participant MRE as ModelRequestExecutor
participant RAC as RequestAdmissionController
Note over S,TAC: Task Admission (new per-model resource limit)
S->>TAC: is_eligible(task, view)?
TAC-->>S: check submission, llm_wait, request:provider/model
alt resource at limit
TAC-->>S: not eligible - defer dispatch
else within limit
TAC-->>S: eligible
S->>MRE: agenerate(data)
S->>S: await asyncio.sleep(0)
end
Note over MRE,RAC: Error Classification
MRE->>RAC: acquire_async(item)
alt queue_timeout
RAC-->>MRE: RequestAdmissionError(queue_timeout)
MRE-->>S: ProviderError(REQUEST_ADMISSION_TIMEOUT)
Note over S: preserved retryable - defer, no row drop
else other
RAC-->>MRE: RequestAdmissionError(other)
MRE-->>S: ProviderError(TIMEOUT)
end
Reviews (4): Last reviewed commit: "Merge branch 'main' into codex/fix-725-r..." | Re-trigger Greptile
0c57a64 to
d9da186
Compare
d9da186 to
6d8c86f
Compare
- Classify local request-admission queue timeouts separately from provider timeouts - Preserve request-admission timeouts through async salvage like rate limits - Bound model task admission by provider/model request capacity - Add regression coverage for Issue NVIDIA-NeMo#725 Fixes NVIDIA-NeMo#725 Signed-off-by: Eric W. Tramel <1223539+eric-tramel@users.noreply.github.com>
6d8c86f to
0756416
Compare
johnnygreco
left a comment
There was a problem hiding this comment.
Review complete. I did not find any actionable issues.
I checked the request-admission timeout classification, DataDesigner error mapping, preserved retryable salvage behavior, scheduler request-resource admission, and the architecture note. The change keeps provider failures distinct from local request-admission pressure and preserves the prior finite-salvage behavior for non-preserved retryables.
Verification run:
ruff checkon the changed filesruff format --checkon the changed Python filespytest packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_resolver.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_resources.py packages/data-designer-engine/tests/engine/models/clients/test_model_request_executor.py packages/data-designer-engine/tests/engine/models/test_model_errors.py -q(146 passed)git diff --check origin/main...HEAD
johnnygreco
left a comment
There was a problem hiding this comment.
Approved. My review found no actionable issues or required updates.
📋 Summary
Fixes Issue #725 by treating local request-admission queue timeouts as scheduler/request-pressure retryables instead of provider failures, and by bounding scheduler model-task admission with provider/model request capacity. This keeps healthy endpoints from dropping rows when async scheduling load creates local request-admission pressure.
🔗 Related Issue
Fixes #725
🔄 Changes
queue_timeoutasProviderErrorKind.REQUEST_ADMISSION_TIMEOUT/ModelRequestAdmissionTimeoutErrorso model callers see the right local-boundary failure.🧪 Testing
uv run ruff check architecture/dataset-builders.md packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/resolver.py packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/resources.py packages/data-designer-engine/src/data_designer/engine/models/clients/errors.py packages/data-designer-engine/src/data_designer/engine/models/clients/model_request_executor.py packages/data-designer-engine/src/data_designer/engine/models/errors.py packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_resolver.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_resources.py packages/data-designer-engine/tests/engine/models/clients/test_model_request_executor.py packages/data-designer-engine/tests/engine/models/test_model_errors.pyuv run ruff format --check <touched Python files>(architecture/dataset-builders.mdexcluded from format check because ruff requires preview for Markdown formatting)uv run pytest packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_resolver.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_resources.py packages/data-designer-engine/tests/engine/models/test_model_errors.py packages/data-designer-engine/tests/engine/models/clients/test_model_request_executor.py -q(146 passed)uv run pytest packages/data-designer-engine/tests -q(2224 passed)Performance demonstration:
origin-main-baselinesimplified-working-tree✅ Checklist