refactor: orchestration improvements#230
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (13)
Disabled knowledge base sources:
WalkthroughIntroduces needs_attention run status and starvation escalation with durable persistence, exposes new scheduler metrics, adds run recovery endpoint and CLI, wires daemon/scheduler/store, updates UI and tests, and adds DB migrations to support new status and starvation tracking. ChangesStarvation escalation and task-run recovery
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI
participant API as HTTP API
participant Task as Task Service
participant DB as GlobalDB
participant Events as Events
CLI->>API: POST /api/runs/:id/recover {reason, metadata}
API->>Task: RecoverRun(runID, request, actor)
Task->>DB: RecoverTaskRun(mutation)
DB-->>Task: RetryRunResult {previous, run}
Task->>Events: Emit task.run_recovered_from_attention
Task-->>API: RetryRunResult
API-->>CLI: 201 {previous_run, run}
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 11
🧹 Nitpick comments (3)
web/e2e/__tests__/knowledge.spec.ts (1)
271-273: ⚡ Quick winMake this prompt-shape assertion newline-tolerant.
This exact multiline literal is prone to platform/serializer newline differences. Prefer a regex with
\r?\nwhile keeping the same structure requirement.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@web/e2e/__tests__/knowledge.spec.ts` around lines 271 - 273, The assertion that checks recalledPrompt uses a hardcoded multiline string which is brittle across newline styles; replace the toContain literal with a regex-based assertion that tolerates CRLF by calling expect(recalledPrompt).toMatch(...) and use a pattern that preserves the same structure but uses \r?\n between lines (e.g. "<\/turn-recall>\r?\n\r?\n<user-message>\r?\nremember me\r?\n<\/user-message>"). Target the recalledPrompt variable in this test and update the assertion accordingly.internal/daemon/daemon_bridge_extension_integration_test.go (1)
479-486: ⚡ Quick winAvoid pinning the production min version literal in test rewrite logic.
This helper will fail on any legitimate
min_agh_versionbump insdk/examples/telegram-reference, creating unrelated integration-test churn. Replace by matching themin_agh_versionkey/value pattern and rewriting that value only.As per coding guidelines, “Never hardcode configuration values — always retrieve configuration from the config system or environment, never from constants or literals.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/daemon/daemon_bridge_extension_integration_test.go` around lines 479 - 486, The current test rewrites the manifest by looking for the literal productionMinVersion and will break whenever that literal changes; instead, update the logic that uses productionMinVersion/e2eMinVersion and strings.Replace to (1) parse and match the key pattern min_agh_version\s*=\s*"<value>" (use a regexp matching the key and capturing the quoted value) and (2) replace only the captured value with the test target version sourced from configuration or an environment variable (do not hardcode e2eMinVersion). Keep the check that fails the test if the pattern was not found (i.e., updated == string(contents) equivalent) so you still error when the manifest lacks a min_agh_version entry.internal/daemon/spawn_reaper_test.go (1)
111-157: ⚡ Quick winUse the required
t.Run("Should ...")subtest shape here.This new case is the only standalone test in the file. Please wrap it in a
t.Run("Should reap only expired starvation workers", ...)block and keept.Parallel()inside that subtest so it matches the repository’s Go test structure.As per coding guidelines, "
**/*_test.go: MUST uset.Run(\"Should...\")pattern for ALL test casesandStructure tests witht.Run(\"Should ...\")subtests and mark them witht.Parallel()by default unless there is a specific reason to disable parallelization.`"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/daemon/spawn_reaper_test.go` around lines 111 - 157, Wrap the existing test body of TestSpawnReaperReapsTTLExpiredStarvationWorkers in a t.Run("Should reap only expired starvation workers", func(t *testing.T) { ... }) subtest and move the existing t.Parallel() call inside that subtest closure; keep the outer test function and all logic (fakeSessionManager setup, newSpawnReaper call, reaper.Sweep invocation, report assertions, and assertStopWithCause) unchanged so symbols like TestSpawnReaperReapsTTLExpiredStarvationWorkers, newSpawnReaper, reaper.Sweep, and assertStopWithCause remain valid.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/config/autonomy_test.go`:
- Around line 63-67: Rename the failing subtests so their t.Run descriptions
follow the "Should ..." pattern (e.g., "Should reject non-positive fan_out",
"Should reject spawn before fan_out", "Should reject event before spawn",
"Should reject needs_attention before event", "Should reject non-positive
min_queued_age") and replace the loose err != nil checks with a specific error
assertion that inspects the error message/type (for example use
require.ErrorContains(t, err, "<expected substring>") or require.ErrorAs when
matching an error type) when validating the SchedulerConfig rejection cases that
modify SchedulerConfig fields FanOutAfter, SpawnAfter, EventAfter,
NeedsAttentionAfter, and MinQueuedAge.
In `@internal/daemon/task_role_runtime.go`:
- Around line 18-26: The two hardcoded constants defaultStarvationWorkerTTL and
defaultStarvationMaxActivePerWorkspace should be moved to configurable settings
and read from the config system (or environment) at startup; update
task_role_runtime.go to remove those literals and instead use config-backed
values (e.g., a TaskRoleRuntimeConfig with fields StarvationWorkerTTL and
StarvationMaxActivePerWorkspace) and replace all uses of
defaultStarvationWorkerTTL and defaultStarvationMaxActivePerWorkspace with the
config values (falling back to the current values if the config key is absent).
Locate references to defaultStarvationWorkerTTL and
defaultStarvationMaxActivePerWorkspace in the file and the spawn/reaper logic
and wire in the config getters (or env parsers) so environments can tune these
parameters without code changes.
In `@internal/daemon/task_runtime_test.go`:
- Around line 991-1076: The test
TestBootTasksWiresSchedulerStarvationAgeToTaskManager must be converted to use a
t.Run("Should ...") subtest: wrap the entire current body (everything inside the
function) in a t.Run call with a descriptive name (e.g., "Should wire scheduler
min queued age to task manager"), move the t.Parallel() call into that subtest
(remove the top-level t.Parallel()), and keep all existing setup and assertions
(resolver creation, daemon.bootTasks, state.tasks.manager.CreateTask/EnqueueRun,
updating runRecord.QueuedAt, and SchedulerStatus checks) unchanged so the logic
in TestBootTasksWiresSchedulerStarvationAgeToTaskManager still executes inside
the subtest.
In `@internal/scheduler/scheduler.go`:
- Around line 546-547: The starvation check is still using s.starvationAge
instead of the new StarvationThresholds value; update the starved computation to
use s.starveThresholds.MinQueuedAge (e.g. replace s.starvationAge > 0 and the
comparison now.Sub(... ) >= s.starvationAge with s.starveThresholds.MinQueuedAge
> 0 and now.Sub(... ) >= s.starveThresholds.MinQueuedAge) so MinQueuedAge
becomes the single source of truth (or alternatively ensure both fields are
synchronized); adjust the starved assignment that references s.starvationAge,
keeping work.Run.QueuedAt and now.Sub(...) unchanged.
In `@internal/scheduler/types.go`:
- Around line 78-96: normalize() currently only backfills zeros from
DefaultStarvationThresholds() but does not enforce the monotonic escalation
invariant; update normalize() (function StarvationThresholds.normalize) to after
filling defaults ensure each later threshold is at least the previous one (e.g.
if t.SpawnAfter < t.FanOutAfter then set t.SpawnAfter = t.FanOutAfter; if
t.EventAfter < t.SpawnAfter then set t.EventAfter = t.SpawnAfter; if
t.NeedsAttentionAfter < t.EventAfter then set t.NeedsAttentionAfter =
t.EventAfter) and likewise ensure MinQueuedAge is not less than
NeedsAttentionAfter, so callers of WithStarvationThresholds() cannot produce a
descending ladder even when non-zero values are provided.
In `@internal/store/globaldb/global_db_task_claim_test.go`:
- Around line 246-278: Add parallelization to the test by calling t.Parallel()
at the start of TestGlobalDBClaimNextRunSkipsNeedsAttention and inside its
subtest (the t.Run block); ensure both the top-level test function
TestGlobalDBClaimNextRunSkipsNeedsAttention and the anonymous subtest accept
parallel execution so openTestGlobalDB(t) and t.TempDir() remain safe when tests
run concurrently.
In `@internal/store/globaldb/global_db_task_force.go`:
- Around line 364-376: The task-run transitions that create `failed` and then
insert a retry (via forceFailedTaskRun -> updateTaskRunRecordWithSnapshotCAS ->
g.retryTaskRunTask -> nextTaskRunAttemptWithExecutor -> g.insertRetryTaskRun) do
not update the task-level current-run projection; add a call to
updateTaskCurrentRunProjectionForRunUpdate(ctx, exec, source, failed.TaskID,
failed.RunID, /* appropriate snapshot or mutated run record */) after the
run-record CAS update (and similarly in the parallel path around lines 393-418
for the force-release/retry flow) so the task current-run/backlog views are
refreshed whenever a run is escalated to needs_attention or a new queued child
is inserted. Ensure you pass the same run identity and context used in
updateTaskRunRecordWithSnapshotCAS so the projection reflects the latest run
state.
In `@internal/store/globaldb/global_db_task_test.go`:
- Around line 2007-2084: The subtests inside TestGlobalDBRecoverTaskRun are
missing t.Parallel() so they don't run in parallel with the rest of the suite;
add t.Parallel() at the top of each subtest closure (the two t.Run blocks:
"Should terminalize a needs_attention run and queue a linked child" and "Should
reject recovering a non-needs_attention run") so each subtest calls t.Parallel()
immediately after entering the closure, ensuring TestGlobalDBRecoverTaskRun and
its subtests follow the suite's parallel execution convention.
In `@internal/store/globaldb/global_db_test.go`:
- Around line 802-817: The test currently sets preV39 :=
globalSchemaMigrations[:len(globalSchemaMigrations)-1], which still includes the
v39 migration (drop_task_run_status_check) when there are 40 migrations, so it
does not produce a DB at v38; change the slice to stop one element earlier (e.g.
preV39 := globalSchemaMigrations[:len(globalSchemaMigrations)-2]) so
RunMigrations runs only up to v38, or alternatively select migrations up to the
migration named "drop_task_run_status_check" by index to ensure the DB is
migrated to v38 before the upgrade test.
In `@internal/task/validate_test.go`:
- Around line 704-706: Rename the subtest case name string from "task run status
needs_attention valid" to follow the "Should..." pattern (e.g., "Should accept
needs_attention run status") so the t.Run call uses the required naming
convention; locate the table entry where name is set and the run closure
invoking TaskRunStatusNeedsAttention.Validate("run.status") and update the name
value accordingly.
In `@internal/task/validate.go`:
- Around line 156-157: Run.Validate() currently only treats TaskRunStatusQueued
as requiring no SessionID, but TaskRunStatusNeedsAttention was added as a valid
status and must follow the same "no session_id" invariant; update the
Run.Validate() logic to include TaskRunStatusNeedsAttention wherever it enforces
the queued-only rules (e.g., the check that SessionID must be empty for
TaskRunStatusQueued), and ensure any other status-specific shape checks that
apply to queued are applied equally to needs_attention (use the Run.Validate()
method, TaskRunStatusQueued, TaskRunStatusNeedsAttention, and the Run.SessionID
field as anchors).
---
Nitpick comments:
In `@internal/daemon/daemon_bridge_extension_integration_test.go`:
- Around line 479-486: The current test rewrites the manifest by looking for the
literal productionMinVersion and will break whenever that literal changes;
instead, update the logic that uses productionMinVersion/e2eMinVersion and
strings.Replace to (1) parse and match the key pattern
min_agh_version\s*=\s*"<value>" (use a regexp matching the key and capturing the
quoted value) and (2) replace only the captured value with the test target
version sourced from configuration or an environment variable (do not hardcode
e2eMinVersion). Keep the check that fails the test if the pattern was not found
(i.e., updated == string(contents) equivalent) so you still error when the
manifest lacks a min_agh_version entry.
In `@internal/daemon/spawn_reaper_test.go`:
- Around line 111-157: Wrap the existing test body of
TestSpawnReaperReapsTTLExpiredStarvationWorkers in a t.Run("Should reap only
expired starvation workers", func(t *testing.T) { ... }) subtest and move the
existing t.Parallel() call inside that subtest closure; keep the outer test
function and all logic (fakeSessionManager setup, newSpawnReaper call,
reaper.Sweep invocation, report assertions, and assertStopWithCause) unchanged
so symbols like TestSpawnReaperReapsTTLExpiredStarvationWorkers, newSpawnReaper,
reaper.Sweep, and assertStopWithCause remain valid.
In `@web/e2e/__tests__/knowledge.spec.ts`:
- Around line 271-273: The assertion that checks recalledPrompt uses a hardcoded
multiline string which is brittle across newline styles; replace the toContain
literal with a regex-based assertion that tolerates CRLF by calling
expect(recalledPrompt).toMatch(...) and use a pattern that preserves the same
structure but uses \r?\n between lines (e.g.
"<\/turn-recall>\r?\n\r?\n<user-message>\r?\nremember me\r?\n<\/user-message>").
Target the recalledPrompt variable in this test and update the assertion
accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 429e36c0-bdae-41e1-b745-7d425c01a887
⛔ Files ignored due to path filters (7)
internal/sandbox/daytona/sidecar_assets/agh-daytona-sidecar-linux-amd64.gzis excluded by!**/*.gzinternal/sandbox/daytona/sidecar_assets/agh-daytona-sidecar-linux-arm64.gzis excluded by!**/*.gzopenapi/agh.jsonis excluded by!**/*.jsonskills/agh/references/capabilities-and-bundles.mdis excluded by!**/*.mdskills/agh/references/tasks-and-orchestration.mdis excluded by!**/*.mdweb/e2e/fixtures/runtime-seed.tsis excluded by!**/fixtures/**web/src/generated/agh-openapi.d.tsis excluded by!**/generated/**
📒 Files selected for processing (87)
internal/api/contract/scheduler.gointernal/api/contract/tasks.gointernal/api/core/scheduler.gointernal/api/core/tasks.gointernal/api/httpapi/handlers_test.gointernal/api/httpapi/routes.gointernal/api/spec/spec.gointernal/api/spec/spec_test.gointernal/api/testutil/task_stub.gointernal/api/udsapi/handlers_test.gointernal/api/udsapi/routes.gointernal/cli/client.gointernal/cli/helpers_test.gointernal/cli/scheduler.gointernal/cli/task.gointernal/cli/task_test.gointernal/config/autonomy.gointernal/config/autonomy_test.gointernal/config/config.gointernal/config/merge.gointernal/daemon/daemon_bridge_extension_integration_test.gointernal/daemon/daemon_test.gointernal/daemon/native_tools_test.gointernal/daemon/review_router.gointernal/daemon/scheduler_runtime.gointernal/daemon/spawn_reaper.gointernal/daemon/spawn_reaper_test.gointernal/daemon/starvation_spawn.gointernal/daemon/starvation_spawn_test.gointernal/daemon/task_role_runtime.gointernal/daemon/task_role_runtime_test.gointernal/daemon/task_runtime.gointernal/daemon/task_runtime_test.gointernal/diagnosticcontract/diagnostics.gointernal/events/registry.gointernal/scheduler/scheduler.gointernal/scheduler/scheduler_integration_test.gointernal/scheduler/scheduler_lifecycle_test.gointernal/scheduler/scheduler_test.gointernal/scheduler/starvation.gointernal/scheduler/types.gointernal/store/globaldb/global_db.gointernal/store/globaldb/global_db_scheduler_pause.gointernal/store/globaldb/global_db_scheduler_pause_test.gointernal/store/globaldb/global_db_task_claim_test.gointernal/store/globaldb/global_db_task_force.gointernal/store/globaldb/global_db_task_pause.gointernal/store/globaldb/global_db_task_starvation.gointernal/store/globaldb/global_db_task_starvation_test.gointernal/store/globaldb/global_db_task_test.gointernal/store/globaldb/global_db_test.gointernal/store/globaldb/migrate_prod_ready_foundation.gointernal/store/globaldb/migrate_task_run_starvation.gointernal/store/globaldb/migrate_task_run_status.gointernal/store/globaldb/migrate_workspace.gointernal/store/schema.gointernal/task/force_ops.gointernal/task/interfaces.gointernal/task/manager.gointernal/task/manager_test.gointernal/task/scheduler_controls.gointernal/task/scheduler_controls_test.gointernal/task/starvation_events.gointernal/task/types.gointernal/task/validate.gointernal/task/validate_test.gopackages/ui/src/components/custom/__tests__/run-card.test.tsxpackages/ui/src/components/custom/run-card.tsxpackages/ui/src/components/custom/stories/run-card.stories.tsxweb/e2e/__tests__/extensibility.spec.tsweb/e2e/__tests__/knowledge.spec.tsweb/e2e/__tests__/session-provider-override.spec.tsweb/src/lib/__tests__/status-tone.test.tsweb/src/lib/status-tone.tsweb/src/systems/scheduler/components/__tests__/scheduler-controls-panel.test.tsxweb/src/systems/scheduler/components/scheduler-controls-panel.tsxweb/src/systems/scheduler/components/stories/scheduler-controls-panel.stories.tsxweb/src/systems/scheduler/mocks/fixtures.tsweb/src/systems/scheduler/mocks/index.tsweb/src/systems/tasks/components/__tests__/task-run-timeline-panel.test.tsxweb/src/systems/tasks/components/stories/task-run-detail.stories.tsxweb/src/systems/tasks/components/task-run-timeline-panel.tsxweb/src/systems/tasks/hooks/__tests__/use-task-stream.test.tsxweb/src/systems/tasks/hooks/use-task-stream.tsweb/src/systems/tasks/lib/__tests__/task-formatters.test.tsweb/src/systems/tasks/lib/task-formatters.tsweb/src/systems/tasks/lib/timeline-visuals.ts
Summary by CodeRabbit