feat(compute): add ComputeStrategy interface with ECS Fargate backend#8
Conversation
236d7bd to
bad8867
Compare
There was a problem hiding this comment.
Pull request overview
Introduces a compute strategy abstraction for orchestrating agent sessions, extracting AgentCore-specific session invocation logic out of the shared orchestrator and updating the durable “start-session” step to delegate compute-specific work to the selected strategy.
Changes:
- Added a
ComputeStrategyinterface plusresolveComputeStrategy()factory keyed offblueprintConfig.compute_type. - Extracted AgentCore session invocation/stop logic into
AgentCoreComputeStrategy, and refactored orchestration to use it. - Added/updated unit tests for the new strategy + factory, and adjusted existing orchestrator-related tests accordingly.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
cdk/src/handlers/shared/compute-strategy.ts |
Adds strategy interface/types and a factory for selecting a compute backend. |
cdk/src/handlers/shared/strategies/agentcore-strategy.ts |
Implements AgentCore-specific session start/poll/stop behind the strategy interface. |
cdk/src/handlers/shared/orchestrator.ts |
Removes AgentCore session start logic from the shared orchestrator. |
cdk/src/handlers/orchestrate-task.ts |
Updates the durable start-session step to resolve and use a compute strategy. |
cdk/test/handlers/shared/compute-strategy.test.ts |
Tests strategy resolution and unknown compute type behavior. |
cdk/test/handlers/shared/strategies/agentcore-strategy.test.ts |
Tests AgentCore strategy session start/poll/stop behavior with SDK mocking. |
cdk/test/handlers/orchestrate-task.test.ts |
Removes old startSession tests and adjusts mocks (now mostly orchestrator-focused). |
.github/workflows/build.yml |
Updates CI environment variables and disables several security/scanning tools via MISE_DISABLE_TOOLS. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thank you ! Couple of findings, will add each one of them as a separate comment. Regarding the CI failures, don't worry about it in this PR, i'll fix it separately |
|
Finding 1 — Dual source of truth for runtimeArn Category: Architecture What happens: constructor(options: { runtimeArn: string }) { Then in startSession, the method immediately overrides it from blueprintConfig: const runtimeArn = input.blueprintConfig.runtime_arn ?? this.runtimeArn; // source 2 Why it matters: return new AgentCoreComputeStrategy({ runtimeArn: blueprintConfig.runtime_arn }); So this.runtimeArn and input.blueprintConfig.runtime_arn are always the same value at construction time. The ?? fallback can never fire. This creates the false impression that two Recommendation: |
|
Finding 2 — Lost SDK client singleton Category: Architecture / Performance Before (main branch): After (PR): The factory resolveComputeStrategy() is called inside the durable execution step 'start-session' at orchestrate-task.ts:120. In durable execution, steps can replay on Lambda Why it matters: Recommendation: let _client: BedrockAgentCoreClient | undefined; |
|
Finding 3 — compute_type is an untyped string Category: Types The CDK construct already defines a strict union at blueprint.ts:50: readonly type?: 'agentcore' | 'ecs'; But the runtime types in BlueprintConfig and RepoConfig declare it as a bare string: // repo-config.ts:49 The factory switch in compute-strategy.ts then does: switch (computeType) { Why it matters:
Recommendation: // repo-config.ts export interface BlueprintConfig { // compute-strategy.ts |
|
Finding 4 — stopSession swallows all errors Category: Error handling async stopSession(handle: SessionHandle): Promise { The entire catch is a logger.warn — the error is swallowed unconditionally. Why it matters:
Logging at warn for all of these treats a billing leak the same as a benign no-op. Recommendation: |
|
Finding 5 — Missing logger.info('Session started') after full sequence Category: Observability Before (main branch — orchestrator.ts:340): After (PR — orchestrate-task.ts:120-133): The strategy itself logs 'AgentCore session invoked' at agentcore-strategy.ts:61, but that fires before the DDB transition and event emit. The original code logged after the entire Why it matters: Recommendation: logger.info('Session started', { task_id: taskId, session_id: handle.sessionId, strategy: handle.strategyType }); |
|
Finding 6 — No integration tests for the rewritten start-session step Category: Testing What was removed: What was added:
What's missing:
No test validates this composition. Specifically:
The old startSession function was monolithic and tested end-to-end. The refactored version split the logic across two layers but only tested the bottom layer. Recommendation: |
|
Finding 7 — pollSession() silently returns 'running' forever Category: Interface contract async pollSession(_handle: SessionHandle): Promise { The ComputeStrategy interface at compute-strategy.ts:38 declares: pollSession(handle: SessionHandle): Promise; This returns SessionStatus, a discriminated union of 'running' | 'completed' | 'failed'. The current implementation always returns 'running'. Why it matters: The existing orchestrator polling in orchestrator.ts:352-366 (pollTaskStatus) reads from DDB, which is the actual source of truth. So currently pollSession is dead code. But:
Recommendation:
|
|
@krokoko All 7 findings addressed in d68edd6. Summary: Finding 1 — Dual source of truth for runtimeArn: Removed Finding 2 — Lost SDK client singleton: Hoisted Finding 3 — compute_type is untyped string: Added Finding 4 — stopSession swallows all errors: Distinguished error types:
Finding 5 — Missing logger.info after full sequence: Added Finding 6 — No integration tests: Created
Finding 7 — pollSession returns 'running' forever: Now throws |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 19 out of 20 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
cdk/src/stacks/agent.ts:257
hasEcsBlueprintis currently hard-coded to always be false (.some(() => false)), soneedsEcseffectively depends only onABCA_ENABLE_ECS. This makes the preceding comment (“Wire ECS infrastructure when any blueprint uses compute.type: 'ecs'”) inaccurate and will silently skip provisioning ECS even if a Blueprint is later configured withcompute.type: 'ecs'. Consider either removinghasEcsBlueprintentirely and documenting that ECS infra is env-flag driven, or implementing a real detection mechanism (e.g., tracking blueprint compute type in code or exposing it from theBlueprintconstruct).
new CfnOutput(this, 'RepoTableName', {
value: repoTable.table.tableName,
description: 'Name of the DynamoDB repo config table',
});
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Keep gitleaks/osv-scanner enabled in CI (only disable trivy/grype/semgrep) - Type ComputeStrategy.type and SessionHandle.strategyType as ComputeType - Trim/filter ECS_SUBNETS to handle whitespace and trailing commas - Handle undefined exit code in ECS pollSession (container never started) - Scope iam:PassRole to specific ECS task/execution role ARNs - Validate all-or-nothing ECS props in TaskOrchestrator constructor - Remove dead hasEcsBlueprint detection; document env-flag driven approach - Add comment noting strategy_type as additive event field
…AgentCore logic Introduce ComputeStrategy interface with SessionHandle/SessionStatus types and resolveComputeStrategy factory. Extract AgentCoreComputeStrategy from orchestrator.ts. Refactor orchestrate-task handler to use strategy pattern for session lifecycle (start/poll/stop). Pure refactor — no behavior change, identical CloudFormation output.
The mise install step downloads tools (trivy) from GitHub releases. Without GITHUB_TOKEN, unauthenticated requests hit the 60 req/hr rate limit, causing flaky CI failures.
Mise uses GITHUB_API_TOKEN (not GITHUB_TOKEN) for authenticated GitHub API requests when downloading aqua tools like trivy.
Trivy, grype, semgrep, osv-scanner, and gitleaks are only needed for security scanning tasks, not for the build/test/synth pipeline. Disable them via MISE_DISABLE_TOOLS to avoid GitHub API rate limits when mise tries to download them on every PR build.
- Keep gitleaks and osv-scanner enabled in CI build (only disable trivy/grype/semgrep which need GitHub API downloads) - Remove unused @aws-sdk/client-bedrock-agentcore mock from orchestrate-task.test.ts (SDK is no longer imported by orchestrator) - Update PR description to note additive strategy_type event field
1. Single source of truth for runtimeArn — removed constructor param,
strategy now reads exclusively from blueprintConfig.runtime_arn
2. Lazy singleton for BedrockAgentCoreClient — module-level shared
client avoids creating new TLS sessions per invocation
3. ComputeType union type ('agentcore' | 'ecs') with exhaustive switch
and never-pattern in resolveComputeStrategy
4. Differentiated error handling in stopSession — ResourceNotFoundException
(info), ThrottlingException/AccessDeniedException (error), others (warn)
5. Added logger.info('Session started') after full invoke+transition+event
sequence in orchestrate-task.ts
6. Added start-session-composition.test.ts with integration tests for
happy path, error path (failTask), and partial failure (transitionTask throws)
7. pollSession now throws NotImplementedError instead of returning stale
'running' status — clear signal for future developers
- Replace require() with ES import for BedrockAgentCoreClient mock - Fix import ordering in start-session-composition test
Wire ECS Fargate as a compute backend behind the existing ComputeStrategy interface, using the existing durable Lambda orchestrator. No separate stacks or Step Functions — ECS is a strategy option alongside AgentCore. Changes: - EcsComputeStrategy: startSession (RunTask), pollSession (DescribeTasks state mapping), stopSession (StopTask with graceful error handling) - EcsAgentCluster construct: ECS Cluster (container insights), Fargate task def (2 vCPU/4GB/ARM64), security group (TCP 443 egress only), CloudWatch log group, task role (DynamoDB, SecretsManager, Bedrock) - TaskOrchestrator: optional ECS props for env vars and IAM policies (ecs:RunTask/DescribeTasks/StopTask conditioned on cluster ARN, iam:PassRole conditioned on ecs-tasks.amazonaws.com) - Orchestrator polling: ECS compute-level crash detection alongside existing DDB polling (non-fatal, wrapped in try/catch) - AgentStack: conditional ECS infrastructure (ABCA_ENABLE_ECS env var) - Full test coverage: 15 ECS strategy tests, 9 construct tests, 5 orchestrator ECS tests. All 563 tests pass. Deployed and verified: stack deploys cleanly, CDK synth passes cdk-nag, agent task running on AgentCore path unaffected.
- Keep gitleaks/osv-scanner enabled in CI (only disable trivy/grype/semgrep) - Type ComputeStrategy.type and SessionHandle.strategyType as ComputeType - Trim/filter ECS_SUBNETS to handle whitespace and trailing commas - Handle undefined exit code in ECS pollSession (container never started) - Scope iam:PassRole to specific ECS task/execution role ARNs - Validate all-or-nothing ECS props in TaskOrchestrator constructor - Remove dead hasEcsBlueprint detection; document env-flag driven approach - Add comment noting strategy_type as additive event field
926778b to
eb0bfa9
Compare
The ECS container's default CMD starts uvicorn server:app which waits for HTTP POST to /invocations — but in standalone ECS nobody sends that request, leaving the agent idle. Override the container command to invoke entrypoint.run_task() directly with the full orchestrator payload via AGENT_PAYLOAD env var. Also add GITHUB_TOKEN_SECRET_ARN to the ECS task definition base environment.
|
Pulled locally and merged main, when looking at it: Critical Issues
cdk/src/handlers/cancel-task.ts:110-125 — The cancel handler hardcodes StopRuntimeSessionCommand for AgentCore. When a user cancels an ECS-backed task, the DDB record transitions to Fix: The cancel handler should use resolveComputeStrategy() to dispatch to the correct strategy's stopSession(), or at minimum store enough metadata (compute_type, session handle) on
cdk/src/handlers/orchestrate-task.ts:128-131 — The old startSession wrote agent_runtime_arn to the task record during the HYDRATING→RUNNING transition. The new strategy-based path omits Fix: Restore writing agent_runtime_arn for AgentCore strategy. For ECS, store compute_type and session metadata on the task record.
cdk/src/constructs/task-orchestrator.ts:180-185, 277-280 — The all-or-nothing validation checks 5 ECS props but omits ecsTaskRoleArn and ecsExecutionRoleArn. If someone provides the 5 Fix: Include role ARNs in the all-or-nothing validation, or throw when ecsClusterArn is set but role ARNs are missing. Never fall back to ['*'].
cdk/src/handlers/orchestrate-task.ts:180-185 — The catch block downgrades ALL exceptions from pollSession to logger.warn, including AccessDeniedException, TypeError, and other Fix: Track consecutive poll failures. After N consecutive failures, escalate to logger.error and fail the task. Distinguish transient errors (throttling) from persistent errors (access Important Issues
cdk/src/handlers/shared/orchestrator.ts:366-395 — The old startSession function and its AgentCore client are still exported but no longer imported by any handler. The remaining test in
cdk/src/constructs/ecs-agent-cluster.ts:75 — /ecs/abca-agent-tasks is hardcoded. Two stacks in the same account/region will collide with ResourceAlreadyExistsException. Let CDK
Both strategies immediately cast fields out of the untyped metadata bag (handle.metadata.runtimeArn as string, handle.metadata.clusterArn as string). A discriminated union keyed on export type SessionHandle =
It throws unconditionally ("not implemented"), so callers cannot safely call pollSession on an arbitrary ComputeStrategy without first checking the type. Either make pollSession
The all-or-nothing constraint is invisible at the type level. A single ecsConfig?: EcsComputeConfig with all fields required would express the constraint at the type level and eliminate
cdk/src/handlers/orchestrate-task.ts:171-179 — Only 'failed' is checked. If the ECS task exits with code 0 but the agent fails to write terminal status to DDB, the orchestrator polls
cdk/src/handlers/orchestrate-task.ts:169 — Creates a new EcsComputeStrategy() up to 1020 times inside the poll loop. Resolve once before the loop and reuse. Test Coverage Assessment: 7/10 Strengths:
Critical Gaps:
Positive Observations
|
End-to-End Verification ✅Deployed the latest Test task: Build a simple calculator web app (HTML/CSS/JS) against
What was verified:
|
|
@MichaelWalker-git could you please just fix the mutation and I'll merge it, thank you ! |
…aker (krokoko review aws-samples#6, aws-samples#8) Two related hardening changes on ``agent/src/progress_writer.py``. Grouped because the shared circuit breaker reuses the error- classification decision to know when NOT to flip itself, and separating the commits would force an awkward "half-fix" intermediate state. ## Findings addressed **aws-samples#6 — Circuit breaker trips on ValidationException (permanent errors)** The pre-fix ``except Exception`` branch fed ALL errors into the same ``_failure_count`` counter. A persistent schema/size error (e.g. ``ValidationException`` from an item >400 KB under a trace-heavy event) counted against the transient-failure budget and tripped the breaker within 3 events, silencing the entire progress stream for the rest of the task — even though most subsequent events were normal size and would have written fine. New behaviour classifies each DDB error into three buckets: - **Permanent (drop event, keep stream alive):** ``ValidationException``, ``ItemCollectionSizeLimitExceededException`` — the individual event is malformed or oversized, retrying or treating as transient would not help. Log at WARN, skip the event, do NOT increment the failure counter. - **Immediate-disable (fatal, don't even try to retry):** ``ResourceNotFoundException``, ``AccessDeniedException``, ``UnauthorizedOperation`` — wrong deploy or IAM misconfig. Disable the breaker on the first occurrence instead of waiting for 3 failures; log at WARN with the error code. Avoids spamming operator dashboards with 3 copies of the same permissions error. - **Transient (trip the breaker on repeated failures, as today):** ``ProvisionedThroughputExceededException``, ``RequestLimitExceeded``, ``ServiceUnavailable``, ``InternalServerError``, plus network-layer (``ConnectionError``, ``EndpointConnectionError``, ``ReadTimeoutError``). Same counter semantics as pre-fix. - **Unknown (default conservative):** counted as transient (counter increments) but logged at ERROR with an explicit ``UNKNOWN`` marker so operators notice and can add new codes to the permanent/transient lists. Does NOT instant-disable — over- correcting from pre-fix behaviour would swap one failure mode for another. Uses ``botocore.exceptions.ClientError`` + ``err.response["Error"]["Code"]`` for AWS errors; class-name matching for non-ClientError (network- layer) paths. Helper: ``_classify_ddb_error(exc) -> Literal["permanent", "immediate_disable", "transient", "unknown"]``. **aws-samples#8 — Dual _ProgressWriter instances with independent circuit breakers** Pre-fix, the runner-level writer (turn/tool events at ``runner.py:240``) and the pipeline-level writer (milestones at ``pipeline.py:303``) each held their own ``_failure_count`` / ``_disabled`` state. If throttling tripped one writer, the other kept writing — creating visible event gaps in the stream that operators could not distinguish from agent activity (milestones firing after turn events stop, or vice versa). Fix: consolidate circuit-breaker state into a module-level ``_SharedCircuitBreaker`` singleton keyed by ``task_id``. Both writers for the same task read/write the same ``(_failure_count, _disabled)`` pair through named methods (``is_disabled``, ``record_failure``, ``record_success``, ``disable``). One task's stream is either healthy (all events flow from both writers) or degraded (no events flow from either). Cannot have a half-alive stream. Semantics notes: - ``record_success`` resets the shared counter but NOT the ``_disabled`` flag. Re-enabling mid-task would let a flaky minute burn the failure budget repeatedly and defeat the breaker's purpose. - Empty-string ``task_id`` (``runner.py`` falls back to sentinel ``"unknown"``) collapses to shared state for all ``"unknown"`` writers. Real task_ids stay isolated. - Writers retain ``_disabled`` / ``_failure_count`` as properties that proxy to the shared map. Existing callers (``hooks.py`` does ``getattr(progress, "_disabled", False)``) and tests that assign ``writer._failure_count = 0`` keep working unchanged — no constructor signature change required, no ``runner.py`` / ``pipeline.py`` edits. - Single ``threading.Lock()`` protects the shared map; DDB write rate (single-digit events/sec) never contends meaningfully. - Test hygiene: ``_reset_circuit_breakers()`` helper rebinds the module global so autouse fixtures give each test a clean slate. ## Tests +24 regression tests net (36 → 60 in ``test_progress_writer.py``). Coverage: - Finding aws-samples#6 classification: ``test_permanent_error_does_not_trip_breaker`` (10 consecutive ``ValidationException`` writes keep ``_disabled=False``), ``test_transient_error_trips_breaker_as_before``, ``test_access_denied_disables_writer_immediately_with_loud_log``, ``test_unknown_exception_treated_as_transient_with_error_log``. - Finding aws-samples#8 sharing: ``test_shared_circuit_breaker_across_writers_same_task_id`` (writer-A trips the breaker; writer-B sees ``is_disabled`` and skips the DDB call), ``test_separate_tasks_have_independent_breakers``, ``test_unknown_sentinel_task_id_is_isolated``, ``test_reset_helper_clears_shared_state_between_tests``. - Plus edge cases: success-interleave resets the counter across writers; ``_disabled`` stays open after re-enabling a success mid-task; thread-safety via concurrent writes. Agent suite: 497 passing (was 473; +24). Refs: krokoko code review on PR aws-samples#52 (findings 6, 8)
Adds the 14 agent-side approval milestone writers (§11.1) on
``_ProgressWriter`` so Chunk 3's hook integration has a typed API
instead of stringly-typed ``write_agent_milestone`` calls, and the
per-task gate counter / per-container sliding-window rate limit /
denial-injection queue on ``PolicyEngine`` that §6.5 requires.
Why now: the hook work lands cleanly only after these surfaces exist
— every code path in ``pre_tool_use_hook``'s REQUIRE_APPROVAL branch
calls one of these helpers. Shipping them separately lets the hook
commit be about the state machine, not the event-shape bookkeeping.
Engine additions:
- ``approval_gate_count`` / ``increment_approval_gate_count``: the
per-task counter §12.9 bounds at ``approvalGateCap``. Session-scoped
in v1; persistence tracked in §17.
- ``approvals_in_last_minute`` / ``record_approval_gate_timestamp``:
sliding-window rate limit (20/min/container, §12.9). Prune on read
so callers see the current count without a separate tick.
- ``queue_denial_injection`` / ``drain_denial_injections``: queue
consumed by ``_denial_between_turns_hook`` at the next Stop seam
(§6.5). Reason is pre-sanitized upstream by ``DenyTaskFn``.
- ``mark_ceiling_shrinking_emitted``: emit-once latch for IMPL-26.
- ``APPROVAL_RATE_LIMIT`` / ``APPROVAL_RATE_WINDOW_S`` module consts
the hook imports rather than re-deriving.
Milestone writers (§11.1 table, 14 agent-emitted of 15):
- ``pre_approvals_loaded``, ``approval_requested``,
``approval_granted``, ``approval_denied``, ``approval_timed_out``,
``approval_stranded``, ``approval_write_failed``,
``approval_resume_failed``, ``approval_poll_degraded``,
``approval_timeout_capped``, ``approval_ceiling_shrinking``,
``approval_cap_exceeded``, ``approval_rate_limit_exceeded``,
``approval_late_win``.
- ``approval_decision_recorded`` (Lambda audit) and
``approval_timeout_capped_at_submit`` (CreateTaskFn) stay on the
Lambda side — Chunk 5 owns those.
Each helper is a thin wrapper over ``_put_event("agent_milestone",
...)`` so the shared circuit-breaker + classifier path (finding aws-samples#6/aws-samples#8)
continues to apply. Metadata keys mirror the §11.1 shapes verbatim
(``maxLifetime_remaining_s`` preserves the design-doc spelling for
downstream parsers).
Tests: +29 total. 17 on ``TestApprovalMilestoneHelpers`` pin the DDB
payload shape for each helper (including the two
``approval_timeout_capped`` reason variants — rule_annotation carries
matching_rule_ids; maxLifetime_ceiling omits the field). 12 on the
engine: counter monotonicity, rate-window prune semantics at window
boundary, denial-queue FIFO + drain-clears, ceiling-shrinking latch
idempotency.
No caller changes — engine and writer surfaces are additive. Hook
integration lands in commit C.
Lands the three read/discovery handlers Chunk 6 (CLI) needs to power
``bgagent pending``, ``bgagent policies list/show``, and
``bgagent notifications configure slack``. Completes §15.2 tasks
14, 15, and 25 (handler side).
Handlers
--------
``get-pending.ts`` (§7.7 — GET /v1/pending)
- Queries ``user_id-status-index`` GSI on TaskApprovalsTable with
``user_id = :caller AND status = :pending``. Without the GSI
this would be a full-table Scan per call — under
``watch -n1 bgagent pending`` that exhausts burst capacity for
the whole fleet (§10.1 finding aws-samples#8).
- Response maps each row to ``PendingApprovalSummary`` with a
derived ``expires_at = created_at + timeout_s`` so the CLI can
render time-to-timeout without doing arithmetic on ISO strings.
- Severity coerced to ``medium`` on unknown values so GSI writes
that drift from the enum don't break the list response.
- Rate-limited 10/min/user (synthetic row on the same table,
namespaced ``RATE#<user>#PENDING`` so it does not collide with
the approve/deny counter).
``get-policies.ts`` (§7.6 — GET /v1/repos/{repo_id}/policies)
- Combines ``BUILTIN_HARD_DENY_POLICIES`` + ``BUILTIN_SOFT_DENY_POLICIES``
with the repo's ``cedar_policies`` blueprint override. Runs the
combined text through ``parseRules`` and returns
``{hard[], soft[]}`` rule summaries.
- 5-minute per-repo in-Lambda cache; cold starts throw it away.
``_resetCacheForTests`` exposed for unit-test isolation.
- Repo ID is URL-decoded from the path (``owner%2Frepo`` common in
CLI UX).
- Rate-limited 30/min/user.
- Blueprint load failure falls back to built-ins with a WARN log;
invalid blueprint cedar text returns 503 ``SERVICE_UNAVAILABLE``
rather than a misleading empty list.
``link-slack-user.ts`` (§11.2 finding aws-samples#4 — POST /v1/notifications/slack/link)
- Writes to SlackUserMappingTable with
``ConditionExpression: attribute_not_exists(slack_user_id)``. This
guard is the entire admission control the §11.2 design hinges on:
even a compromised Slack admin cannot overwrite an existing
mapping.
- Validates ``slack_user_id`` shape (letters, digits, underscores,
2–40 chars) so junk rows cannot land.
- Conflict surface is 409 ``REQUEST_ALREADY_DECIDED`` — reused
error code (the payload message directs the user to unlink via
support).
- Slack link_token end-to-end validation against Slack OAuth is
deferred — v1 accepts the token on trust from the Cognito-authed
caller; it is persisted in CloudWatch for audit.
Supporting primitives
---------------------
``shared/builtin-policies.ts`` — mirrors ``agent/policies/hard_deny.cedar``
and ``agent/policies/soft_deny.cedar`` as TypeScript string constants.
Embedded rather than read from disk because Lambda's esbuild bundler
does not copy non-TS assets by default and a dedicated bundling hook
is more code than the embed. A drift test
(``builtin-policies.test.ts``) asserts byte-equality with the agent
files so any change on one side without the other flips red at build
time.
``shared/cedar-policy.ts`` — ``parseRules`` now skips the unannotated
``base_permit`` entry (both tiers need it as a Cedar catch-all; it
is not a user-facing rule so it stays out of ParsedRule[]). This
matches the agent-side ``_parse_policy_annotations`` behaviour.
Tests: +37 total.
- get-pending (8): 401 on missing auth, 429 on rate limit, empty
result, GSI query shape, row → PendingApprovalSummary with
derived expires_at, severity fallback, missing timeout → expires_at
falls back to created_at, 500 on DDB error.
- get-policies (11): 401/400 validation, built-in rules listed on
empty repo, URL-decoded repo path, custom blueprint rule lands
in soft, per-repo cache across calls, 429 rate limit, 503 on
invalid blueprint cedar, fallback on load failure, hard rules
omit severity / approval_timeout_s, soft rules carry them.
- link-slack-user (8): 401/400 validation, shape check, 201 on
success, 409 on overwrite attempt, 500 on unknown DDB error,
whitespace trim on slack_user_id, ConditionExpression verified.
- builtin-policies (4): drift byte-equality with both agent files,
parseRules round-trip for hard/soft rule IDs.
- cedar-policy (updated): ``base_permit`` is skipped from
ParsedRule[] rather than rejected.
Stack wiring (task-api.ts routes, agent.ts layer attachment,
CreateTaskFn extension, orchestrator + reconciler + fanout) lands in
the next commit.
…ws-samples#79 review aws-samples#8) Original set omitted documented Slack permission/scope failures. Codes outside the set fall to the retryable branch, so a misconfiguration like ``ekm_access_denied`` or ``missing_scope`` would burn 3 Lambda retries before DLQ on every event — even though the failure is fundamentally a configuration bug that no retry can clear. Adds: - Permission/scope: missing_scope, ekm_access_denied, team_access_not_granted, posting_to_general_channel_denied - Payload shape: invalid_arguments Reorganized the set into commented blocks (channel-shape, auth, permission/scope, payload-shape) so future additions go in the right bucket and the rationale stays visible. Test coverage: parametrized over the full TERMINAL_SLACK_API_ERRORS set (21 codes) — every one must throw SlackApiError so the router swallows it. The existing retryable test.each remains intact and covers the negative-class case (codes outside the set throw a plain Error and escalate to retry).
#79) * feat(fanout): migrate SlackNotifyFn to FanOutConsumer subscriber (#64) Move the Slack outbound delivery off its own DynamoDB Streams consumer onto FanOutConsumer as a per-channel dispatcher. Drops TaskEventsTable from 2 concurrent stream readers to 1, restoring headroom for future channels (Email, Teams, etc.) without exceeding the documented DynamoDB Streams 2-reader-per-shard practical limit. The PR also addresses an adversarial code review on the original migration; the body below walks through each piece in the order it landed. ## (a) Migration - `cdk/src/handlers/slack-notify.ts` — rewritten as exported `dispatchSlackEvent(event, ddb)` plus a tagged `SlackApiError` class. The standalone `handler(event)` stream entrypoint is gone; the FanOutConsumer is now the only TaskEventsTable stream reader. Behaviour preserved bit-for-bit: channel_source==='slack' gate, terminal-event dedup via conditional UpdateItem on `channel_metadata.slack_notified_terminal`, threaded replies under the @mention or task_created message, emoji transitions (eyes -> hourglass -> ✅/❌/🚫/⏲), DM channel_id -> user_id rewrite, intermediate session+created message cleanup on terminal events. - `cdk/src/handlers/fanout-task-events.ts` — replaces the log-only `dispatchToSlack` stub with a wrapper that calls dispatchSlackEvent and routes errors via the new typed contract (see (b) below). Slack defaults gain task_created, session_started, task_timed_out so the router fans out the lifecycle events the old SlackNotifyFn handled; the dispatcher's channel_source gate keeps non-Slack tasks unaffected. - `cdk/src/constructs/fanout-consumer.ts` — adds a scoped `secretsmanager:GetSecretValue` grant on `bgagent/slack/*` so the fanout Lambda can fetch per-workspace bot tokens. Same scope the old SlackNotifyFn role held. - `cdk/src/constructs/slack-integration.ts` — deletes SlackNotifyFn, its DynamoEventSource, its IAM policy, and its NagSuppressions entry. Drops the now-unused StartingPosition / FilterCriteria / FilterRule / lambdaEventSources imports. After this lands, `aws lambda list-event-source-mappings` shows exactly one consumer of the TaskEventsTable stream (FanOutFn); verified on the dev stack with end-to-end @mention + cancel + CLI isolation scenarios. ## (b) Review fix #1 — partial-batch retry semantics (BLOCKER) The first review pass found that the post-migration handler silently dropped Slack-side infra errors (DDB throttle on the GetItem, Secrets Manager 5xx, transient Slack timeout). Pre-migration the SlackNotifyFn handler rethrew non-SlackApiError so Lambda retried the batch; post-migration `Promise.allSettled` swallowed the rejection and routeEvent returned an empty list with no escalation path to `batchItemFailures`. routeEvent's return type changed from `NotificationChannel[]` to `{ dispatched, infraRejections }`. The handler now pushes the record into `batchItemFailures` whenever `infraRejections.length>0`, so Lambda replays the record under the partial-batch contract. The warn line on rejection is tagged `retryable: true` so operators can alert distinctly from the channel-terminal swallow path. GitHub got the symmetric treatment: 4xx (excluding the existing 401 and 404 handling) is now treated as a channel-terminal swallow via `fanout.github.api_error` instead of escalating to retry. ## (c) Review fix #2 — split SlackApiError into terminal + retryable Originally any `!result.ok` Slack response was wrapped in SlackApiError and swallowed. That collapsed retryable codes (`ratelimited`, `service_unavailable`, `internal_error`, `fatal_error`, `request_timeout`) into the same swallow as `channel_not_found` — a tier-1 Slack outage would silently drop every message. Introduced `TERMINAL_SLACK_API_ERRORS` set + `classifySlackError` helper. Terminal codes still throw SlackApiError (router swallows). Retryable codes throw a plain Error so the router classifies them as infra rejections and Lambda replays. ## (d) Review fix #3 — NOTIFIABLE_EVENTS / CHANNEL_DEFAULTS drift The original migration added task_created/session_started/task_timed_out to CHANNEL_DEFAULTS.slack but the dispatcher's NOTIFIABLE_EVENTS gate already excluded several events the router was subscribing Slack to (agent_error, pr_created, task_stranded). Result: Slack was reported as `dispatched` for events it silently dropped — telemetry lied, agent_error never reached operators on Slack-origin tasks, and task_stranded rendered the generic "Event: task_stranded for owner/repo" fallback (UX regression). Added render cases for task_stranded and agent_error in slack-blocks.ts and added them to NOTIFIABLE_EVENTS. Forward-compat approval_required and status_response stay out of NOTIFIABLE_EVENTS until their emitters ship; a new cross-file consistency test in fanout-task-events.test.ts fails if anyone re-introduces the drift. The Slack dispatcher wrapper now passes `effectiveEventType` so an agent_milestone(pr_created) wrapper is unwrapped before NOTIFIABLE_EVENTS matching. Without the rewrite, the dispatcher would short-circuit on the wrapper string `agent_milestone`. ## (e) Review fix #4 — conditional UpdateItem on lifecycle persists Once the BLOCKER fix made batches retry, the original task_created and session_started UpdateItem calls became hazardous: a Slack POST that succeeded but whose follow-up UpdateItem failed transiently would, on retry, post a second root and overwrite slack_thread_ts — orphaning every threaded reply that had threaded under the first ts. Both UpdateItems now carry an `attribute_not_exists` ConditionExpression on the relevant `channel_metadata.slack_*_msg_ts`. On ConditionalCheckFailedException the handler logs at info, deletes the duplicate Slack message via `chat.delete`, and returns. Sibling retry wins the race; the duplicate is cleaned up. ## (f) Dev-stack regression: drop pr_created from Slack defaults Live verification surfaced a UX duplication: pr_created (subscribed in CHANNEL_DEFAULTS.slack as the original §6.2 design called for) and task_completed both rendered messages with View PR buttons, posted seconds apart. The original SlackNotifyFn had silently dropped pr_created (NOTIFIABLE_EVENTS gate), so users hadn't relied on it. Removed pr_created from CHANNEL_DEFAULTS.slack and from NOTIFIABLE_EVENTS, and removed the prCreatedMessage renderer. GitHub keeps pr_created (its edit-in-place comment surface genuinely benefits from the early checkpoint). ## Verification - mise //cdk:compile — clean - mise //cdk:test — 1183 / 1183 pass (8 net-new tests added for the review fixes: NOTIFIABLE_EVENTS drift guard, retryable Slack codes, GitHub 4xx swallow, infra rejection escalation, SlackApiError swallow, task_stranded render) - mise //cdk:eslint — clean - mise //cdk:synth — confirms exactly one Lambda::EventSourceMapping on TaskEventsTable, pointing at FanOutFn - Dev-stack scenarios — @mention happy path, Cancel button, CLI submit (channel_source=api -> zero Slack dispatches, GitHub edit-in-place still fires) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(fanout): retry GitHub 403/429 instead of swallowing as terminal (#79 review #1) PR #79 review found that the new 4xx terminal-swallow path treats HTTP 403 and 429 as channel-terminal — but on GitHub these are transient rate-limit responses (403 with "API rate limit exceeded", 429 "Too Many Requests"). Under a reconciliation wave that touches many tasks, an entire window of GitHub comment updates would be permanently dropped with only a warn log. Carve out 403 and 429 from the swallow guard so they propagate as infra rejections through ``Promise.allSettled``. The record lands in ``batchItemFailures`` and Lambda replays until the rate-limit window clears (or DLQs after ``retryAttempts``). Test coverage: parametrized over 403 + 429 with a GitHubCommentError mock at the helper boundary, asserting the record's eventID surfaces in ``batchItemFailures`` rather than being absorbed. * fix(fanout): guard Slack Secrets Manager grant on a prop (#79 review #2) Every other external-service grant in FanOutConsumer (taskTable, repoTable, githubTokenSecret) is gated by ``if (props.X)``, so a deployment that hasn't onboarded the corresponding service stays free of dangling IAM permissions. The original migration broke the pattern with an unconditional ``bgagent/slack/*`` Secrets Manager grant — dev stacks without Slack onboarding ended up holding read permission on a resource pattern they never use, with a misleading ``cdk-nag AwsSolutions-IAM5`` suppression reason. Adds an optional ``slackSecretArnPattern`` prop on ``FanOutConsumerProps``; the policy statement is only attached when the prop is set. ``cdk/src/stacks/agent.ts`` now computes the ``bgagent/slack/*`` ARN inline and passes it through, mirroring the other guarded props. ``ArnFormat`` and ``Stack`` imports moved out of fanout-consumer.ts since the construct no longer needs them. No changes to live behaviour — agent.ts always passes the prop, so the IAM policy still attaches in production. The dispatcher will log-and-fail-retry on a missing pattern (covered by review #3 fix). Test gap covering the construct itself ships in a follow-up commit (test gap #34). * fix(fanout): throw on missing TASK_TABLE_NAME env var (#79 review #3) Pre-fix: when ``TASK_TABLE_NAME`` was unset on a Slack-subscribed event, ``dispatchSlackEvent`` returned silently after a warn line. The router counted Slack as ``dispatched`` and a broken stack quietly dropped every Slack notification — operators only saw it in the warn-rate metric, with no rejected-channel signal. Post-fix: throw a plain Error so the rejection propagates as an infra rejection through ``Promise.allSettled``. The router pushes the record into ``batchItemFailures``, Lambda retries the batch, the ``fanout.dispatcher.rejected`` warn fires per record, and operators get a distinct alarm. Also bumps the existing log line from ``warn`` to ``error`` and attaches an ``error_id: FANOUT_SLACK_MISSING_TASK_TABLE`` so the deployment-bug case can be distinguished from per-record failures. Test: ``throws when TASK_TABLE_NAME env var is missing`` deletes the env var, asserts the throw, asserts no DDB call was attempted (env-var guard fires first). * fix(fanout): match SlackApiError by name as well as instanceof (#79 review #7) When a bundler ever duplicates the slack-notify module (rare with NodejsFunction tree-shaking but possible if dual-bundled), two distinct SlackApiError classes coexist and ``instanceof`` against one fails for instances of the other. The dispatcher would see a foreign-class SlackApiError, fall through to the rethrow branch, and the router would treat it as an infra rejection — flipping a channel-terminal swallow into infinite Lambda retries. Add an ``err.name === 'SlackApiError'`` fallback so the swallow branch fires either way. Mirrors the duck-typed ``GitHubCommentError`` check used elsewhere in the same handler. Test: synthesise a plain Error with name === 'SlackApiError' (NOT an instance of the mock's SlackApiError class) and assert batchItemFailures stays empty — proving the swallow path catches both shapes. * fix(fanout): extend TERMINAL_SLACK_API_ERRORS with permission codes (#79 review #8) Original set omitted documented Slack permission/scope failures. Codes outside the set fall to the retryable branch, so a misconfiguration like ``ekm_access_denied`` or ``missing_scope`` would burn 3 Lambda retries before DLQ on every event — even though the failure is fundamentally a configuration bug that no retry can clear. Adds: - Permission/scope: missing_scope, ekm_access_denied, team_access_not_granted, posting_to_general_channel_denied - Payload shape: invalid_arguments Reorganized the set into commented blocks (channel-shape, auth, permission/scope, payload-shape) so future additions go in the right bucket and the rationale stays visible. Test coverage: parametrized over the full TERMINAL_SLACK_API_ERRORS set (21 codes) — every one must throw SlackApiError so the router swallows it. The existing retryable test.each remains intact and covers the negative-class case (codes outside the set throw a plain Error and escalate to retry). * fix(fanout): promote Slack reaction/delete network errors to error logs (#79 review #5) The reaction / delete helpers (``addReaction``, ``removeReaction``, ``deleteMessage``) used to log every catch at warn with a single generic event key, lumping API-level rejections (e.g. ``no_reaction``) together with infrastructure failures (DNS lookup, TLS handshake, fetch timeout, JSON parse error from a hostile gateway). Operators who alarmed on the warn rate saw a flat signal that masked genuine infra problems. Split the boundary: - API-level (``!result.ok`` after a successful HTTP call) stays at warn with channel-specific event keys (``fanout.slack.reaction_add_api_error``, ``fanout.slack.reaction_remove_api_error``, ``fanout.slack.message_delete_api_error``). These are per-message UX problems; operators don't page. - Network errors (the outer ``catch (err)`` after ``fetch``) promote to ``logger.error`` with dedicated event keys (``fanout.slack.reaction_add_network_error``, ``fanout.slack.reaction_remove_network_error``, ``fanout.slack.message_delete_network_error``) and ``error_id``s (``FANOUT_SLACK_REACTION_NETWORK``, ``FANOUT_SLACK_DELETE_NETWORK``) so each has its own alarmable signal. User-visible symptoms when these fire silently: stale emoji reactions (hourglass never swaps to ✅) and orphaned intermediate messages. Behaviour unchanged: errors are still swallowed (per-message reactions and intermediate cleanup are best-effort by design; they must not fail the batch), but operators now get distinct metrics for each failure class. * fix(fanout): emit fanout.slack.dup_delete_failed on ghost-message accumulation (#79 review #6) The conditional UpdateItem dup-delete path (``task_created`` / ``session_started`` lifecycle persists) calls ``deleteMessage`` to clean up the duplicate Slack message that landed when a sibling retry won the race. The delete is inherently best-effort — but if it fails, the duplicate becomes a permanent ghost in the thread and operators had no way to alarm on the rate. Refactor ``deleteMessage`` to return a boolean (``true`` on success or ``message_not_found``-as-already-gone, ``false`` otherwise) and emit a dedicated ``fanout.slack.dup_delete_failed`` event with an ``error_id: FANOUT_SLACK_DUP_DELETE_FAILED`` from the dup-delete callsites when the cleanup couldn't complete. The terminal-event cleanup paths (``slack_session_msg_ts``, ``slack_created_msg_ts``) intentionally don't fire this event — those paths target genuinely-stale UX cleanup, not retry-driven duplicates, so an alarm there would be noise. No new tests beyond the existing dup-delete coverage; the ``deleteMessage`` return value isn't yet asserted at the unit level, but the behavior is fully exercised by the existing ``dup-delete`` integration paths (test gap #31 will add an explicit failure-path assertion when it lands). * chore(fanout): tighten RouteOutcome arrays to ReadonlyArray (#79 review #9) ``RouteOutcome.dispatched`` and ``infraRejections`` were typed as plain ``NotificationChannel[]`` — which made ``readonly`` on the property prevent reassignment but still allow callers to mutate the underlying array via ``.push``, ``.splice``, or ``.sort``. Inconsistent with the ``ReadonlySet<string>`` used for ``CHANNEL_DEFAULTS`` in the same file. Tightening to ``ReadonlyArray<NotificationChannel>`` makes the contract honest: the router owns the arrays, callers read them. Test suite updated to use ``[...outcome.dispatched].sort()`` where it previously called ``.sort()`` directly — the explicit copy makes the intent clear and would have surfaced any silent test-side mutation. * refactor(fanout): make SlackDispatchEvent a type alias of FanOutEvent (#79 review #10) The two interfaces were structurally identical: same five fields, same readonly modifiers, same metadata shape. The decoupling was purely nominal and a silent-drift footgun — adding a field to ``FanOutEvent`` (e.g. when the router starts plumbing an ``approval_required`` ID through) would not flow into ``SlackDispatchEvent``, leaving the dispatcher unaware until a downstream test happened to fail. Replace with a one-line type alias: export type SlackDispatchEvent = FanOutEvent; The slack-notify module now type-imports ``FanOutEvent`` from fanout-task-events. ``import type`` is erased at compile time, so the runtime bundle still has the one-way dep (fanout-task-events → slack-notify) — no module-cycle hazard. Reviewer-suggested ``Pick<FanOutEvent, 'task_id' | …>`` was considered and rejected: the dispatcher uses every field of ``FanOutEvent``, so the Pick would just enumerate the same five fields with extra noise. A direct alias keeps the intent obvious and prevents drift identically. * fix(fanout): generalize Slack dedup to cover agent_error + log Retry-After (#79 review #4) PR #79 review #4 surfaced a sibling-channel-failure hazard: when GitHub or Email rate-limits, the record lands in ``batchItemFailures``. On the Lambda retry, every Slack-subscribed event for that record runs again. Terminal events were already guarded by ``slack_notified_terminal``; ``agent_error`` was not — operators would page twice on a single agent failure if a sibling channel happened to fail. Generalize the dedup mechanism. ``TERMINAL_EVENTS`` is replaced by a ``SLACK_DEDUP_ATTRIBUTE`` map that marks each event type with the ``channel_metadata`` attribute that should guard the post: - 5 terminals share ``slack_notified_terminal`` (any first-arriving terminal claims the right; subsequent terminals dedup against it) - ``agent_error`` gets its own ``slack_dispatched_agent_error`` so a duplicate agent_error doesn't reuse the terminal slot - ``task_created`` / ``session_started`` map to ``null`` because they already use the per-event ``slack_*_msg_ts`` conditional persists from review #1 — the conditional already provides full idempotency (a separate marker would be redundant) Also surfaces Slack's ``Retry-After`` header on rate-limited responses through a dedicated ``fanout.slack.retryable_api_error`` warn so operators reading CloudWatch can see the recovery window instead of guessing from sustained warn rate. Tests: - logs Retry-After header on rate-limited Slack responses (new): asserts ``retry_after_seconds`` propagates from Slack's response header into the warn metadata - existing terminal-codes parametrized test untouched (terminal branch doesn't read headers) - existing retryable test gains a ``headers: { get: () => null }`` stub on the fetch mock so the headers.get call doesn't crash Reviewer suggested a per-channel dispatch bitmap as the alternative. Rejected as premature: the duplicate-GitHub-PATCH is harmless (idempotent), Email is still a stub, and the dedup map covers the specific agent_error pain identified above. A bitmap would add a new table + IAM grants + per-dispatch DDB cost for a hypothetical problem (Slack rate-limiting AND a sibling channel failure). * test(fanout): conditional UpdateItem race + dup-delete coverage (#79 test gap) Adds 4 tests covering the lifecycle-persist conditional path that review fix #1 introduced and review fix #6 hardened. Pre-PR-#79 the only ConditionalCheckFailed coverage was the terminal-dedup path; the new lifecycle-persist + dup-delete code lacked direct assertions and was flagged 9/10 criticality by the reviewer. - task_created persist ConditionalCheckFailed → posts duplicate then deletes it: pins the cleanup behaviour that prevents ghost task_created posts in the channel - session_started persist ConditionalCheckFailed → posts duplicate then deletes it: parallel coverage for the other lifecycle attribute (slack_session_msg_ts) - dup-delete failure emits fanout.slack.dup_delete_failed with error_id: pins the operator-alarm signal added in review fix #6; asserts both the event key and the FANOUT_SLACK_DUP_DELETE_FAILED error_id propagate - chat.delete returning message_not_found is treated as success (no dup_delete_failed): negative-class assertion. Prevents false-positive alarms when the race resolves cleanly (the duplicate was already deleted by a prior retry). The ghost / message_not_found tests use ``fetchMock.mockImplementation`` URL-routing rather than ``.mockResolvedValueOnce`` chains because ``updateReaction`` issues 2-3 reaction-API fetches between chat.postMessage and chat.delete; routing by URL keeps the test focused on the load-bearing chat.delete behaviour without coupling to reaction call order. * test(fanout): cover task_stranded + agent_error renderers (#79 test gap #32) Pre-PR-#79 the new ``taskStrandedMessage`` and ``agentErrorMessage`` helpers in slack-blocks.ts had no direct unit tests. Reviewer flagged this as a 7/10 gap because the renderers carry the prior_status / error_type / message_preview metadata threaded through from the event source — silent drift in the metadata field names would produce ugly fallback messages in production. Adds 5 tests: - task_stranded WITH metadata renders the prior_status parenthetical (``Task stranded for org/repo (last status: RUNNING)``) so operators can tell at a glance whether the task hung in HYDRATING vs RUNNING — without the parenthetical the reviewer's "generic Event: ..." UX regression would resurface. - task_stranded WITHOUT metadata still renders cleanly (legacy events written before the reconciler started stamping metadata must not crash or leak ``undefined``). - agent_error with full metadata (error_type + message_preview) renders the rotating_light, type, and preview. - agent_error WITHOUT metadata stays sensible — no leaked ``undefined`` strings or empty ``_Type:_`` line. - agent_error truncates a 500-char message_preview to keep Slack channel UX readable. * test(fanout): cover agent_error dedup + dedup-slot isolation (#79 test gap #33) Pre-PR-#79 review-fix #4 there was no direct test for the ``slack_dispatched_agent_error`` dedup attribute or its interaction with the existing ``slack_notified_terminal`` slot. A future refactor that collapsed the two slots — or renamed one of them — would silently break the sibling-channel-failure-retry guarantee that fix #4 added. Adds 4 tests: - ``agent_error claims its own dedup attribute``: pins the UpdateExpression and ConditionExpression strings so a refactor that renames the attribute breaks loudly. - ``agent_error retry hits the dedup guard``: end-to-end scenario matching review #4 — task already has ``slack_dispatched_agent_error: true``, retry must short-circuit before chat.postMessage. Without the guard, a second rotating_light fires. - ``terminal dedup attribute is per-class``: a flaky task_completed-then-task_failed sequence dedups against the same ``slack_notified_terminal`` slot. Catches the regression where the orchestrator emits both terminal types and we'd otherwise post both ✅ and ❌. - ``agent_error and terminals use distinct dedup slots``: the important negative — having ``slack_dispatched_agent_error`` set must NOT shadow a subsequent ``task_completed``. Pins the slot separation so a future merge into a single slot can't silently drop terminals after an agent_error. * test(fanout): add construct-level tests for FanOutConsumer (#79 test gap #34) The construct shipped on issue #64 with no unit-level coverage of its IAM contract. The only synth-level signal lived inside ``slack-integration.test.ts`` ("0 EventSourceMapping") which proved the migration didn't regress the OTHER construct. Reviewer flagged this 6/10 — and the gap is what allowed review #2 (unconditional Slack secret grant) to slip through in the first place. Adds 6 tests: - ``attaches a single DynamoEventSource on the TaskEventsTable stream``: pins the architectural invariant — issue #64 was fundamentally about reaching exactly-one stream reader. Adding a second consumer must fail this test loudly. - ``creates a DLQ for the fanout Lambda``: pins retention period + presence; a DLQ-less deployment would silently drop poison-pill records past retryAttempts. - ``omits the bgagent/slack/* grant when slackSecretArnPattern is not provided``: the review #2 invariant. Iterates every IAM::Policy and asserts NONE of them grant secretsmanager:* on a bgagent/slack/* ARN. A regression that re-introduces the unconditional grant breaks this test. - ``attaches the bgagent/slack/* grant only when slackSecretArnPattern is provided``: the positive case. Pins the grant shape (action, effect, resource pattern). - ``passes TASK_TABLE_NAME env var when taskTable is provided``: review #3 dependency — the dispatcher throws on missing env. - ``omits TASK_TABLE_NAME env var when taskTable is not provided``: graceful degrade for dev stacks that haven't onboarded the TaskTable yet (matches the construct's documented contract). * test(fanout): cover task_stranded through terminal dedup (#79 test gap #35) The reconciler at handlers/reconcile-stranded-tasks.ts:170 emits BOTH ``task_stranded`` and ``task_failed`` for a heartbeat-expired task — one for the operator signal, one to drive the FAILED status transition. Pre-PR-#79 this pair had no test coverage; reviewer flagged this 8/10 because the visible failure mode (a paired "Task stranded" + "Task failed" double-page in Slack) would surface in production but be silent in CI. Adds 2 tests: - ``task_stranded posts and writes the terminal dedup marker on first arrival``: pins that task_stranded participates in the shared terminal slot and renders the warning message with metadata. Catches a regression that omits task_stranded from the dedup map entirely. - ``task_stranded after a sibling task_failed dedups``: the operational scenario — task_failed already claimed ``slack_notified_terminal``; the subsequent task_stranded must short-circuit before chat.postMessage. Without this guard, operators get the double-page the reviewer warned about. * fix(fanout): re-read TaskRecord before terminal cleanup to close orphan-message race Live observation during PR #79 review verification: the same Slack @mention happy path sometimes leaves the 🚀 task_created message in the thread (orphaned beside the ✅ task_completed) and sometimes deletes it cleanly. The race window: 1. ``task_created`` stream batch posts the rocket message and persists ``slack_created_msg_ts`` via the conditional UpdateItem introduced in PR #79 review fix #1. 2. ``task_completed`` stream batch fires ~30s later. Its initial GetItem races the prior UpdateItem and sees a stale ``channel_metadata`` WITHOUT ``slack_created_msg_ts``. 3. The terminal cleanup branch checks ``channelMeta.slack_created_msg_ts`` — undefined — silently skips the chat.delete. The rocket message stays in the thread. Add a fresh GetItem inside the TERMINAL_EVENTS cleanup branch, after the dedup UpdateItem has linearized our view of the table. Any prior ``slack_*_msg_ts`` writes are visible by then, so the cleanup fires correctly. On a re-read failure (DDB throttle / transient blip) we fall back to the dispatch-entry snapshot and emit ``fanout.slack.cleanup_reread_failed`` so operators can alarm on the rate. Pre-existing race (the unconditional UpdateItem in pre-PR-#79 was the same shape — wrote, GetItem on the next batch could miss it). PR #79 doesn't introduce it but doesn't fix it either; this commit does, since the live screenshot evidence appeared during review verification. Tests: - ``terminal cleanup re-reads TaskRecord``: scripts a stale dispatch-entry GetItem followed by a fresh re-read GetItem with ``slack_created_msg_ts`` present; asserts chat.delete fires against the freshly-read ts. - ``terminal cleanup falls back to dispatch-entry snapshot when re-read fails``: defense-in-depth — DDB throttle on the re-read must not break terminal delivery; cleanup uses the entry snapshot and emits the fallback warn. --------- Co-authored-by: bgagent <bgagent@noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Alain Krok <alkrok@amazon.com>
Adds the 14 agent-side approval milestone writers (§11.1) on
``_ProgressWriter`` so Chunk 3's hook integration has a typed API
instead of stringly-typed ``write_agent_milestone`` calls, and the
per-task gate counter / per-container sliding-window rate limit /
denial-injection queue on ``PolicyEngine`` that §6.5 requires.
Why now: the hook work lands cleanly only after these surfaces exist
— every code path in ``pre_tool_use_hook``'s REQUIRE_APPROVAL branch
calls one of these helpers. Shipping them separately lets the hook
commit be about the state machine, not the event-shape bookkeeping.
Engine additions:
- ``approval_gate_count`` / ``increment_approval_gate_count``: the
per-task counter §12.9 bounds at ``approvalGateCap``. Session-scoped
in v1; persistence tracked in §17.
- ``approvals_in_last_minute`` / ``record_approval_gate_timestamp``:
sliding-window rate limit (20/min/container, §12.9). Prune on read
so callers see the current count without a separate tick.
- ``queue_denial_injection`` / ``drain_denial_injections``: queue
consumed by ``_denial_between_turns_hook`` at the next Stop seam
(§6.5). Reason is pre-sanitized upstream by ``DenyTaskFn``.
- ``mark_ceiling_shrinking_emitted``: emit-once latch for IMPL-26.
- ``APPROVAL_RATE_LIMIT`` / ``APPROVAL_RATE_WINDOW_S`` module consts
the hook imports rather than re-deriving.
Milestone writers (§11.1 table, 14 agent-emitted of 15):
- ``pre_approvals_loaded``, ``approval_requested``,
``approval_granted``, ``approval_denied``, ``approval_timed_out``,
``approval_stranded``, ``approval_write_failed``,
``approval_resume_failed``, ``approval_poll_degraded``,
``approval_timeout_capped``, ``approval_ceiling_shrinking``,
``approval_cap_exceeded``, ``approval_rate_limit_exceeded``,
``approval_late_win``.
- ``approval_decision_recorded`` (Lambda audit) and
``approval_timeout_capped_at_submit`` (CreateTaskFn) stay on the
Lambda side — Chunk 5 owns those.
Each helper is a thin wrapper over ``_put_event("agent_milestone",
...)`` so the shared circuit-breaker + classifier path (finding aws-samples#6/aws-samples#8)
continues to apply. Metadata keys mirror the §11.1 shapes verbatim
(``maxLifetime_remaining_s`` preserves the design-doc spelling for
downstream parsers).
Tests: +29 total. 17 on ``TestApprovalMilestoneHelpers`` pin the DDB
payload shape for each helper (including the two
``approval_timeout_capped`` reason variants — rule_annotation carries
matching_rule_ids; maxLifetime_ceiling omits the field). 12 on the
engine: counter monotonicity, rate-window prune semantics at window
boundary, denial-queue FIFO + drain-clears, ceiling-shrinking latch
idempotency.
No caller changes — engine and writer surfaces are additive. Hook
integration lands in commit C.
Lands the three read/discovery handlers Chunk 6 (CLI) needs to power
``bgagent pending``, ``bgagent policies list/show``, and
``bgagent notifications configure slack``. Completes §15.2 tasks
14, 15, and 25 (handler side).
Handlers
--------
``get-pending.ts`` (§7.7 — GET /v1/pending)
- Queries ``user_id-status-index`` GSI on TaskApprovalsTable with
``user_id = :caller AND status = :pending``. Without the GSI
this would be a full-table Scan per call — under
``watch -n1 bgagent pending`` that exhausts burst capacity for
the whole fleet (§10.1 finding aws-samples#8).
- Response maps each row to ``PendingApprovalSummary`` with a
derived ``expires_at = created_at + timeout_s`` so the CLI can
render time-to-timeout without doing arithmetic on ISO strings.
- Severity coerced to ``medium`` on unknown values so GSI writes
that drift from the enum don't break the list response.
- Rate-limited 10/min/user (synthetic row on the same table,
namespaced ``RATE#<user>#PENDING`` so it does not collide with
the approve/deny counter).
``get-policies.ts`` (§7.6 — GET /v1/repos/{repo_id}/policies)
- Combines ``BUILTIN_HARD_DENY_POLICIES`` + ``BUILTIN_SOFT_DENY_POLICIES``
with the repo's ``cedar_policies`` blueprint override. Runs the
combined text through ``parseRules`` and returns
``{hard[], soft[]}`` rule summaries.
- 5-minute per-repo in-Lambda cache; cold starts throw it away.
``_resetCacheForTests`` exposed for unit-test isolation.
- Repo ID is URL-decoded from the path (``owner%2Frepo`` common in
CLI UX).
- Rate-limited 30/min/user.
- Blueprint load failure falls back to built-ins with a WARN log;
invalid blueprint cedar text returns 503 ``SERVICE_UNAVAILABLE``
rather than a misleading empty list.
``link-slack-user.ts`` (§11.2 finding aws-samples#4 — POST /v1/notifications/slack/link)
- Writes to SlackUserMappingTable with
``ConditionExpression: attribute_not_exists(slack_user_id)``. This
guard is the entire admission control the §11.2 design hinges on:
even a compromised Slack admin cannot overwrite an existing
mapping.
- Validates ``slack_user_id`` shape (letters, digits, underscores,
2–40 chars) so junk rows cannot land.
- Conflict surface is 409 ``REQUEST_ALREADY_DECIDED`` — reused
error code (the payload message directs the user to unlink via
support).
- Slack link_token end-to-end validation against Slack OAuth is
deferred — v1 accepts the token on trust from the Cognito-authed
caller; it is persisted in CloudWatch for audit.
Supporting primitives
---------------------
``shared/builtin-policies.ts`` — mirrors ``agent/policies/hard_deny.cedar``
and ``agent/policies/soft_deny.cedar`` as TypeScript string constants.
Embedded rather than read from disk because Lambda's esbuild bundler
does not copy non-TS assets by default and a dedicated bundling hook
is more code than the embed. A drift test
(``builtin-policies.test.ts``) asserts byte-equality with the agent
files so any change on one side without the other flips red at build
time.
``shared/cedar-policy.ts`` — ``parseRules`` now skips the unannotated
``base_permit`` entry (both tiers need it as a Cedar catch-all; it
is not a user-facing rule so it stays out of ParsedRule[]). This
matches the agent-side ``_parse_policy_annotations`` behaviour.
Tests: +37 total.
- get-pending (8): 401 on missing auth, 429 on rate limit, empty
result, GSI query shape, row → PendingApprovalSummary with
derived expires_at, severity fallback, missing timeout → expires_at
falls back to created_at, 500 on DDB error.
- get-policies (11): 401/400 validation, built-in rules listed on
empty repo, URL-decoded repo path, custom blueprint rule lands
in soft, per-repo cache across calls, 429 rate limit, 503 on
invalid blueprint cedar, fallback on load failure, hard rules
omit severity / approval_timeout_s, soft rules carry them.
- link-slack-user (8): 401/400 validation, shape check, 201 on
success, 409 on overwrite attempt, 500 on unknown DDB error,
whitespace trim on slack_user_id, ConditionExpression verified.
- builtin-policies (4): drift byte-equality with both agent files,
parseRules round-trip for hard/soft rule IDs.
- cedar-policy (updated): ``base_permit`` is skipped from
ParsedRule[] rather than rejected.
Stack wiring (task-api.ts routes, agent.ts layer attachment,
CreateTaskFn extension, orchestrator + reconciler + fanout) lands in
the next commit.
* docs(cedar-hitl): restore and revise HITL gates design, fold adversarial findings
Design doc was accidentally removed in 0742ebe; restored from b34d7cd and
substantially revised under a new filename. "Phase 3" framing dropped — this
is the Cedar HITL approval gates feature.
- Renamed PHASE3_CEDAR_HITL.md → CEDAR_HITL_GATES.md; all "phase" gating
removed (Phase 3a/3b → v1 / future work §17).
- Integrated 16 findings from 2026-05-06 adversarial review with realistic
scenarios. Major structural changes:
- Decision #23 (new): cross-engine parity contract between cedarpy (agent,
Python) and @cedar-policy/cedar-wasm@4.10.0 (Lambda, TS).
- §11.2: SlackUserMappingTable with OAuth user-initiated mapping; severity-
gated Slack approvals; admin has no write path.
- §7.1/§12.3: ApproveTaskFn uses cross-table TransactWriteItems for atomicity.
- §10.1: user_id-status-index GSI on TaskApprovalsTable; v1 not v-later.
- §15.6: cedar-wasm as a Lambda layer shared across policy Lambdas.
- Gate-cap revision (2026-05-07): decision #13 — default 50, blueprint-
configurable via security.approvalGateCap (bounded 1–500), persisted on
TaskTable. Cache memory bound decoupled: 50-entry LRU regardless of cap.
IMPL-22 adds telemetry-driven re-evaluation criteria.
- Timeout adversarial+advocate pass (2026-05-07):
- §6.5 VM-throttle race fix: re-read row on failed TIMED_OUT
ConditionCheckFailed; honor APPROVED if user beat the timer. IMPL-24.
- Sub-120s @approval_timeout_s emits blueprint-load WARN. IMPL-25.
- User-visible timeout cap milestones (approval_timeout_capped_at_submit,
approval_ceiling_shrinking). IMPL-26.
- Runtime JWT: no refresh logic in agent/src/ (container uses IAM role);
ceiling stays min(1h, maxLifetime_remaining - 120s). IMPL-27.
- Three new CloudWatch metrics for timeout tuning. IMPL-28.
- §14.8 new: off-hours trade-off section (fail-closed is the invariant).
- §13.13 new: notification-delivery failure does NOT pause the timer
(bypass-prevention).
- Added six mermaid diagrams: three-outcome decision flow, end-to-end round-
trip, TaskApprovalsTable state machine, Slack user-mapping, fail-closed
decision flow, cross-engine parity check.
- Cross-references updated in INTERACTIVE_AGENTS.md and SECURITY.md.
- Starlight mirror regenerated via docs/scripts/sync-starlight.mjs.
No code changes in this commit — design work only. Implementation lands in a
follow-up PR per §15.2 task list.
* feat(cedar-hitl): pin Cedar engines and seed cross-engine parity contract
Chunk 1 of the Cedar HITL gates PR (docs/design/CEDAR_HITL_GATES.md).
Lays the foundation before engine rewrites in Chunk 2+: both Cedar engines
pinned exactly per decision #23, annotation surface validated by Day-1
spikes per decision #22, and the golden-file parity fixtures seeded so
every subsequent chunk can rely on the contract.
- Pin cedarpy==4.8.0 (agent) and @cedar-policy/cedar-wasm@4.10.0 (cdk)
exactly (no ^/~); document both in mise.toml header.
- Add agent/tests/test_cedarpy_annotations_contract.py (10 tests)
validating all 5 annotations round-trip verbatim via
policies_to_json_str() under staticPolicies.<id>.annotations.
- Add cdk/test/handlers/shared/cedar-policy.test.ts (12 tests) validating
policySetTextToParts + policyToJson extract the same annotations
verbatim and isAuthorized returns the documented {type, response}
wrapper shape.
- Add contracts/cedar-parity/ with 5 golden-file fixtures (single-match,
multi-match, hard-deny, soft-deny write, no-match default-allow) +
README documenting the contract. Every fixture policy carries a
@rule_id - including the base permit as @rule_id("base_permit") - so
the parity tests raise if either engine returns an unannotated match
instead of silently dropping it.
- Add agent/tests/test_cedar_parity.py (6 tests, cedarpy side) and
cdk/test/handlers/shared/cedar-parity.test.ts (6 tests, cedar-wasm
side) loading the shared fixtures and asserting (decision, sorted
rule_ids) match expected. Both tests hard-import cedarpy/cedar-wasm
so a dependency regression fails loud rather than silently skipping.
- Update docs/design/CEDAR_HITL_GATES.md sections 15.2 row 3, 15.6
prose and the parity mermaid diagram to point at contracts/cedar-parity/
(the precedent set by contracts/memory-hash-vectors.json) instead of a
new tests/fixtures/ dir. Regenerate the Starlight mirror.
- Add IMPL-29 noting the cedarpy diagnostics.reasons / cedar-wasm
diagnostics.reason naming asymmetry surfaced by the spikes; engine
code normalizes at the boundary.
- Fix rev-4 -> rev-5 cosmetic footer drift.
Test counts: agent 500 -> 516 (+16), cdk 1036 -> 1054 (+18), cli 190
unchanged. No production code changes in this chunk; engine rewrite
lands in Chunk 2.
Follow-up: separate chore issue to move contracts/memory-hash-vectors.json
into a self-named subdir for consistency with contracts/cedar-parity/.
* feat(cedar-hitl): three-outcome PolicyEngine core
Chunk 2 of the Cedar HITL gates PR. Rewrites agent/src/policy.py into
the three-outcome engine specified in docs/design/CEDAR_HITL_GATES.md
section 6. The REQUIRE_APPROVAL outcome is the human-in-the-loop surface
the next chunks (PreToolUse hook extension, REST API, CLI) plug into.
This chunk ships the engine and its load-time validation; no hook or
wire-format changes yet.
Engine:
- Outcome enum (ALLOW, DENY, REQUIRE_APPROVAL) + extended PolicyDecision
with .allowed backward-compat shim for Phase 1a/1b/2 callers. Custom
__init__ accepts both outcome= and legacy allowed= kwargs so existing
tests keep working verbatim.
- Three-outcome pipeline per section 6.2: hard-deny eval (absolute) ->
allowlist fast-path (tool_type/tool_group/bash_pattern/write_path/
all_session) -> recent-decision cache (60s TTL on DENIED/TIMED_OUT) ->
soft-deny eval (with post-eval rule-scope allowlist check and
blueprint_disable filtering) -> default ALLOW.
- ApprovalAllowlist (section 6.4): parses and matches every scope type.
Strips whitespace and rejects empty-after-strip values so
"tool_type: Read " normalizes instead of silently mismatching (review
finding 6).
- RecentDecisionCache (section 12.9): 50-entry LRU, INDEPENDENT of
approvalGateCap. Populated only on DENIED/TIMED_OUT. Session-scoped
(documented section 12.8 caveat).
- Annotation handling (sections 5.2 + 6.3): parses @rule_id, @tier,
@approval_timeout_s, @severity, @category via
cedarpy.policies_to_json_str(); merges on multi-match with min timeout
(clamped by 30s floor) and max severity.
- Load-time validation (sections 5.1, 12.4): rejects missing/mismatched
@tier, missing @rule_id, sub-floor timeouts, duplicate rule_ids across
tiers, blueprint text > 64 KB, disable entries naming built-in
hard-deny rules (finding 9), approval_gate_cap outside [1, 500]
(decision 13). Sub-120s @approval_timeout_s emits WARN but accepts
(IMPL-25).
- Fail-closed posture (section 13): cedarpy parse errors surface via
diagnostics.errors -> RuntimeError raised inside _eval_tier -> outer
handler returns DENY with reason "fail-closed: <ExceptionType>".
TypeError on json.dumps of unhashable tool_input surfaces as distinct
"fail-closed: unhashable_tool_input" reason (review finding 5).
Built-in policies:
- agent/policies/hard_deny.cedar: base_permit catch-all + rm_slash +
write_git_internals + write_git_internals_nested + drop_table +
pr_review-specific Write/Edit forbids (absolute).
- agent/policies/soft_deny.cedar: base_permit (catch-all required in
each tier so cedarpy default-deny does not convert no-match into
DENY) + force_push_any + force_push_main + push_to_protected_branch
+ write_env_files + write_credentials. All soft rules carry @tier,
@rule_id, @approval_timeout_s, @severity, @category per section 15.4
starter set.
Review findings addressed (1 blocker, 8 significant, plus minor):
- blueprint_disable actually disables soft rules at eval time instead
of silently no-op (the blocker: test coverage had been a silent-pass).
- Legacy extra_policies with @tier/@rule_id rejected to avoid undefined
double-annotation behavior.
- _matching_rule_ids logs WARN on unknown policy IDs (state-drift
signal).
- base_permit validator exemption restricted to effect=="permit" so
misnamed forbid rules cannot bypass validation (finding 7).
- Hard-tier Cedar no_decision logged at WARN (signals missing/malformed
base_permit catch-all).
- Allowlist whitespace normalization + empty-value rejection.
- StrEnum upgrade, Callable moved to TYPE_CHECKING, assert replaced
with explicit RuntimeError for S101 compliance.
Phase 1 compatibility:
- All 39 existing test_policy.py tests pass unchanged via the
.allowed property. One test (test_invalid_policy_syntax_fails_closed)
updated to patch _hard_policies instead of the removed _policies
attribute; docstring explains the rewrite.
- extra_policies kwarg preserved; callers with annotated rules must
migrate to blueprint_soft_policies / blueprint_hard_policies.
Test counts: agent 516 -> 576 (+60: 51 three-outcome + 9 regression
fixes). cli 190 unchanged. cdk 1054 unchanged.
Carry-forward to Chunk 3:
- extra_policies semantic shift (Phase 1 DENY -> Chunk 2
REQUIRE_APPROVAL); .allowed=False preserved but .outcome differs.
Switchover happens when hooks.py adopts the three-outcome branching.
- Cross-tier action-context asymmetry (review finding 8): document
rule-authoring constraint in section 5.5 of design.
- Probe entity-shape coverage (finding 10): extend _probe_cedar to
exercise Write/Edit/Bash action paths, not just invoke_tool.
* feat(cedar-hitl): approval milestone writers + engine counters
Adds the 14 agent-side approval milestone writers (§11.1) on
``_ProgressWriter`` so Chunk 3's hook integration has a typed API
instead of stringly-typed ``write_agent_milestone`` calls, and the
per-task gate counter / per-container sliding-window rate limit /
denial-injection queue on ``PolicyEngine`` that §6.5 requires.
Why now: the hook work lands cleanly only after these surfaces exist
— every code path in ``pre_tool_use_hook``'s REQUIRE_APPROVAL branch
calls one of these helpers. Shipping them separately lets the hook
commit be about the state machine, not the event-shape bookkeeping.
Engine additions:
- ``approval_gate_count`` / ``increment_approval_gate_count``: the
per-task counter §12.9 bounds at ``approvalGateCap``. Session-scoped
in v1; persistence tracked in §17.
- ``approvals_in_last_minute`` / ``record_approval_gate_timestamp``:
sliding-window rate limit (20/min/container, §12.9). Prune on read
so callers see the current count without a separate tick.
- ``queue_denial_injection`` / ``drain_denial_injections``: queue
consumed by ``_denial_between_turns_hook`` at the next Stop seam
(§6.5). Reason is pre-sanitized upstream by ``DenyTaskFn``.
- ``mark_ceiling_shrinking_emitted``: emit-once latch for IMPL-26.
- ``APPROVAL_RATE_LIMIT`` / ``APPROVAL_RATE_WINDOW_S`` module consts
the hook imports rather than re-deriving.
Milestone writers (§11.1 table, 14 agent-emitted of 15):
- ``pre_approvals_loaded``, ``approval_requested``,
``approval_granted``, ``approval_denied``, ``approval_timed_out``,
``approval_stranded``, ``approval_write_failed``,
``approval_resume_failed``, ``approval_poll_degraded``,
``approval_timeout_capped``, ``approval_ceiling_shrinking``,
``approval_cap_exceeded``, ``approval_rate_limit_exceeded``,
``approval_late_win``.
- ``approval_decision_recorded`` (Lambda audit) and
``approval_timeout_capped_at_submit`` (CreateTaskFn) stay on the
Lambda side — Chunk 5 owns those.
Each helper is a thin wrapper over ``_put_event("agent_milestone",
...)`` so the shared circuit-breaker + classifier path (finding #6/#8)
continues to apply. Metadata keys mirror the §11.1 shapes verbatim
(``maxLifetime_remaining_s`` preserves the design-doc spelling for
downstream parsers).
Tests: +29 total. 17 on ``TestApprovalMilestoneHelpers`` pin the DDB
payload shape for each helper (including the two
``approval_timeout_capped`` reason variants — rule_annotation carries
matching_rule_ids; maxLifetime_ceiling omits the field). 12 on the
engine: counter monotonicity, rate-window prune semantics at window
boundary, denial-queue FIFO + drain-clears, ceiling-shrinking latch
idempotency.
No caller changes — engine and writer surfaces are additive. Hook
integration lands in commit C.
* feat(cedar-hitl): TaskApprovals + AWAITING_APPROVAL transition primitives
Adds the four agent-side DDB primitives §6.5 + IMPL-24 need for the
three-outcome hook integration in the next commit:
- ``transact_write_approval_request`` — cross-table TransactWriteItems:
Put(TaskApprovalsTable) with ``attribute_not_exists(request_id)`` +
Update(TaskTable) gated on ``status = RUNNING``. Atomic per §12.3 so
a concurrent cancel cannot land the task in AWAITING_APPROVAL with
no matching approval row (or vice versa).
- ``transact_resume_from_approval`` — Update(TaskTable) gated on
``status = AWAITING_APPROVAL AND awaiting_approval_request_id =
:rid``. The ``request_id`` condition prevents resuming with a stale
ID after a reconciler race (§13.9).
- ``best_effort_update_approval_status`` — conditional UpdateItem on
the approval row with ``status = :pending`` guard. Returns False on
``ConditionalCheckFailedException``; this is the signal IMPL-24's
re-read path fires on (§6.5 pseudocode lines 846-879, §13.12).
- ``get_approval_row`` — GetItem with ``ConsistentRead=True`` by
default. Required by IMPL-24's re-read; kept opt-out (bool flag) for
future cold-path callers that don't need the strong read.
Errors:
- ``ApprovalTablesUnavailable`` for env-var-missing — raised loud so
a pre-Chunk-4 deploy fails closed (hook will map to DENY) rather
than silently no-op'ing the gate.
- ``ApprovalWriteError`` / ``ApprovalResumeError`` wrap
``TransactionCanceledException`` with the cancellation reasons
list. The hook uses these to distinguish the "concurrent cancel"
branch from real DDB outages.
- ``ConditionalCheckFailedException`` on ``update_item`` is consumed
and returned as ``False`` from ``best_effort_update_approval_status``
— the caller (hook) needs the boolean to decide whether to
re-read, not to propagate.
- All other DDB errors propagate so the hook's outer try/except can
classify fail-closed with a specific reason.
Implementation notes:
- Uses ``boto3.client("dynamodb")`` low-level API (not resource).
``transact_write_items`` lives on the client, and marshalling the
approval row attributes explicitly gives deterministic DDB shapes
that the tests can assert on. ``_py_to_ddb_attr`` covers the
subset of Python types §10.1 actually uses (str/int/bool/None/list
of str); any other type raises TypeError loudly rather than
silently writing something unexpected.
- ``_extract_error_code`` / ``_extract_cancellation_reasons`` duck-type
on ``exc.response`` so we don't need botocore at import time (tests
use a minimal exception class).
- Errors from unsupported types (floats, dicts, etc.) are caught
BEFORE the DDB round-trip so the unit-test asserts
``transact_write_items`` was not called — catches schema drift
early.
- Status constants (``_STATUS_RUNNING`` / ``_STATUS_AWAITING_APPROVAL``)
named so a rename in CDK cannot silently diverge the Python path.
Tests: +20 total.
- 5 on TransactWriteApprovalRequest: env-missing, happy-path shape
assertion (both items + conditions), TransactionCanceled → ApprovalWriteError
with reasons preserved, other errors propagate, unsupported type rejected
before any DDB call.
- 3 on TransactResumeFromApproval: env-missing, happy-path expression
shape (includes REMOVE awaiting_approval_request_id), cancel →
ApprovalResumeError.
- 4 on BestEffortUpdateApprovalStatus: happy path returns True,
``reason`` kwarg attaches ``deny_reason``, ConditionalCheckFailed
returns False (IMPL-24's signal), other errors propagate.
- 4 on GetApprovalRow: ConsistentRead default True, opt-out False,
row-not-found returns None, row unmarshalling through every
supported DDB attribute type.
- 4 on helpers: error-code extraction with and without
ClientError-shape, cancellation-reasons extraction with and without.
No runtime callers yet — hook integration lands in commit C. Physical
TaskApprovalsTable lands in Chunk 4; Python side is wire-compatible so
the hook work can be unit-tested today with mocked clients.
* feat(cedar-hitl): PreToolUse three-outcome REQUIRE_APPROVAL path
Wires the agent to the full §6.5 pseudocode: cap + rate-limit check,
atomic TransactWriteItems for pending row + TaskTable AWAITING_APPROVAL,
2s→5s ConsistentRead poll, IMPL-24 VM-throttle race re-read, resume
transition, scope propagation to allowlist, and denial-injection queue
consumed at the next Stop seam. Completes §15.2 rows 26 + 27.
Hook control flow (three outcomes)
----------------------------------
- ALLOW / DENY: existing Phase 1 behavior, now switching on
``.outcome`` rather than ``.allowed``. Legacy Phase 1/2 tests still
green because PolicyDecision preserves the ``.allowed`` shim.
- REQUIRE_APPROVAL (new): extracted into ``_handle_require_approval``
for readability. Delegates to ``task_state`` primitives and
``engine.*`` counter surfaces from the prior commits; no new DDB
client construction here.
Key pieces:
- ``_compute_effective_timeout`` applies the §6.5 min(rule, default,
lifetime) formula. The engine's ``_merge_annotations`` has already
clipped decision.timeout_s against the task default; the hook adds
the remaining-lifetime ceiling and floors at FLOOR_30S.
``clip_reason`` distinguishes ``rule_annotation`` (rule was tighter
than task default) from ``maxLifetime_ceiling`` (task is late in
its life) so ``approval_timeout_capped`` carries the right reason.
- ``_remaining_maxlifetime_s`` reads ``AGENTCORE_MAX_LIFETIME_S`` +
``TASK_STARTED_AT`` env vars (8h default). Returns ``None`` when
the start timestamp is absent — the hook treats that as "unknown,
don't clip" rather than pre-DENYing, so Phase 1 test paths that
don't set the env var still see the old task-default behaviour.
Chunk 4/5 will wire these at task launch.
- ``_poll_for_decision`` uses 2s cadence for the first 30s then 5s
(IMPL-12). All polls use ``ConsistentRead=True`` per IMPL-24. 3
consecutive GetItem failures emit ``approval_poll_degraded``; 10
consecutive failures fall through as TIMED_OUT with a specific
reason (§13.2).
- ``_reconcile_late_decision`` implements IMPL-24 re-read: on a
ConditionCheckFailed from the TIMED_OUT write, re-read with
ConsistentRead. APPROVED → rebuild outcome, propagate scope to
allowlist, run normal allow flow, emit ``approval_late_win``.
DENIED → honor the user's sanitized reason. PENDING or row gone
→ fall through with TIMED_OUT (fail-closed, §13.12 last paragraph).
Cancel-wins semantics (finding #2)
----------------------------------
``_denial_between_turns_hook`` is registered AFTER
``_nudge_between_turns_hook`` in ``between_turns_hooks`` so cancel
short-circuits both. The hook re-checks ``_cancel_requested`` itself
as belt-and-braces (matching the nudge hook) so a future reorder does
not silently break cancel-wins. Denial queue is PRESERVED on cancel —
not drained — so a denial still sitting on the queue when the task is
being torn down does not leak across tasks (the engine is per-task
per §IMPL-7).
``stop_hook`` threads ``engine`` into ``ctx`` so the denial hook can
``drain_denial_injections``. ``build_hook_matchers`` accepts a new
``user_id`` kwarg (§12.2) so approval rows carry caller identity for
the REST side's ownership check.
``permissionDecisionReason`` guaranteed surface
-----------------------------------------------
The hook's deny return is the ONLY guaranteed surface the SDK emits
to the agent; denial injection is best-effort (pre-empted by cancel).
``_deny_response`` pipes every reason through ``_strip_ansi`` +
``_truncate(500)``: ANSI sequences can never reach the model, and the
line stays loggable. §12.7 requirement.
Tests: +24 agent hook tests (47 total in test_hooks.py). Run in 0.92s
via a ``_fast_poll`` fixture that collapses ``asyncio.sleep`` to a
no-op AND advances ``hooks.time.monotonic`` by the requested duration
so the poll wall-clock deadline actually trips.
Happy paths:
- APPROVED + scope propagation to allowlist + milestones.
- APPROVED with scope=this_call does NOT grow allowlist.
- DENIED queues denial injection + populates recent-decision cache
(next identical call auto-denies).
- TIMED_OUT writes TIMED_OUT row and emits approval_timed_out.
IMPL-24 race: four branches.
- APPROVED re-read → allow flow, approval_late_win milestone, scope
propagated, resume succeeds.
- DENIED re-read → deny flow, approval_late_win milestone, user's
reason is the permissionDecisionReason.
- Still-PENDING re-read → fail-closed fall-through (no late_win).
- Row-gone re-read → same fail-closed fall-through.
Cap / rate-limit / write failure / resume failure branches all:
- Short-circuit before any DDB write when the local guard fires
(cap, rate limit).
- Emit the right approval_* milestone.
- Return DENY with a specific permissionDecisionReason.
Sanitization:
- ANSI stripped from deny reason.
- Deny reason truncated to ≤500 chars.
Timeout clipping:
- rule_annotation reason when a rule's approval_timeout_s is below
the task default; matching_rule_ids populated.
- maxLifetime_ceiling reason when remaining lifetime is the tightest
bound; matching_rule_ids is None.
- approval_ceiling_shrinking emits exactly once per task (IMPL-26
latch).
Denial injection hook (6 tests):
- Draining produces a <user_denial request_id=... decided_at=...>
block with XML-escaped reason.
- Cancel short-circuit preserves the queue so the denial is not
lost; just not injected into a dying agent.
- Hostile reason (</user_denial>...<user_nudge>) is XML-escaped so
the envelope cannot be forged.
- No-engine ctx returns [] (Phase 1 call sites still work).
- Registered LAST in ``between_turns_hooks`` (invariant for §6.5
finding #2).
- End-to-end via stop_hook: queued denial becomes
``decision=block`` + reason on the Stop return.
Carry-forward
-------------
- ``_remaining_maxlifetime_s`` returns None when TASK_STARTED_AT is
unset — Chunk 4/5 will wire this at task launch. Tracked in §16.
- ``approval_gate_count`` lives on the engine (session-scoped) not on
TaskTable in v1. §13.6 notes that the reconciler + approval_gate_cap
still bound worst-case across container restarts. Chunk 7+ tracks
persistence when telemetry justifies it.
- Denial injection emits a ``user_denial_injected`` milestone that is
NOT in the §11.1 enumerated table. It mirrors ``nudge_acknowledged``
for stream visibility; keep the name distinct from the ``approval_*``
prefix so future §11.1 consumers can't confuse it with an approval
outcome.
* feat(cedar-hitl): TaskApprovalsTable + SlackUserMapping + status enum
Lands the stateless CDK primitives for Cedar-HITL approval gates so
Chunk 5's REST handlers can be wired onto concrete tables. Completes
§15.2 tasks 9, 20, and 25.
Constructs
----------
``TaskApprovalsTable`` (§10.1)
- PK ``task_id`` + SK ``request_id`` (ULID). Matches the agent-side
primitives landed in the prior commit.
- GSI ``user_id-status-index`` with user_id PK + status SK and an
``INCLUDE`` projection limited to the fields GET /v1/pending
renders. Three deny-sensitive attrs (``deny_reason``, ``scope``,
``tool_input_sha256``) deliberately omitted from the projection —
the list endpoint only returns PENDING rows in practice, but
excluding them kills the projection-leak concern outright and
costs no bytes today.
- Exports ``USER_STATUS_INDEX_NAME`` as a module constant + mirrors
it on ``construct.userStatusIndexName`` so handlers referencing
the GSI fail compile-time on a rename.
- TTL attribute ``ttl`` (agent writes ``created_at + timeout_s +
120s``).
- No DynamoDB streams per §11.2. TaskEventsTable carries the audit
fan-out; streams here would duplicate.
- Default RemovalPolicy.DESTROY to match the rest of the sample.
Production deploys override to RETAIN per §10.1.
``SlackUserMappingTable`` (§11.2, finding #4)
- Single-key (``slack_user_id`` PK). No SK, no TTL, no GSI, no
stream. The forward-only shape is the trust boundary — a reverse
GSI (Cognito → Slack) would let a compromised Cognito sub
enumerate Slack identities without adding v1 capability.
- Writes land through LinkSlackUserFn (Chunk 5) which enforces the
``attribute_not_exists(slack_user_id)`` condition so a prior
legitimate mapping cannot be overwritten by a later compromise.
``task-status.ts`` — AWAITING_APPROVAL (§10.3)
- Added to TaskStatus enum + ACTIVE_STATUSES (NOT TERMINAL_STATUSES:
the task is alive, paused on a human decision).
- VALID_TRANSITIONS wires the five edges §10.3 enumerates:
RUNNING → AWAITING_APPROVAL (soft-deny entry)
HYDRATING → AWAITING_APPROVAL (rare early-gate case)
AWAITING_APPROVAL → RUNNING (approve / deny resume)
AWAITING_APPROVAL → CANCELLED (user cancel mid-approval)
AWAITING_APPROVAL → FAILED (stranded-approval reconciler)
- Notably NOT added:
AWAITING_APPROVAL → FINALIZING (approve-during-cleanup race)
AWAITING_APPROVAL → COMPLETED (skip RUNNING)
AWAITING_APPROVAL → TIMED_OUT (timer lives on the approval
row, not the task clock)
These are regression tests so a future refactor cannot quietly
add them and bypass the `awaiting_approval_request_id = :rid`
invariant.
Tests: +29 total.
- TaskApprovalsTable (11 tests): PK/SK schema, PAY_PER_REQUEST,
PITR default + override, TTL attribute, NO streams, GSI schema +
projection + sensitive-attr exclusion, removal policy default +
override, ``USER_STATUS_INDEX_NAME`` constant parity with the
construct field.
- SlackUserMappingTable (8 tests): single-key schema (explicit
KeySchema length assertion), PAY_PER_REQUEST, PITR, no streams,
no reverse GSI, DESTROY default, TTL absent.
- TaskStatus (+10 tests over existing: 5 new assertions on the
9-state cardinality, AWAITING_APPROVAL membership, and the
transition graph including the three forbidden edges). The
existing assertions updated for the new state count.
No stack wiring yet — ``agent.ts`` instantiation + env var plumbing +
grants land in the next commit alongside the Cedar-WASM Lambda layer.
* feat(cedar-hitl): Cedar-wasm layer + wire approval tables into agent stack
Activates the agent-side approval path and ships the Lambda layer
Chunk 5's REST handlers need.
Cedar-wasm Lambda layer (§15.2 task 10)
----------------------------------------
``CedarWasmLayer`` bundles ``@cedar-policy/cedar-wasm@4.10.0`` into
``/opt/nodejs/node_modules/`` so Lambdas can
``require('@cedar-policy/cedar-wasm/nodejs')`` without shipping the
4 MB wasm binary in every function package. A dedicated
``cdk/layers/cedar-wasm/`` directory carries a minimal ``package.json``
pinning the exact version — bundling runs ``npm install --omit=dev``
against that manifest, so the layer build is hermetic from any
``cdk/node_modules/`` drift.
The bundler has two fallbacks:
- Docker (``public.ecr.aws/sam/build-nodejs22.x``) for CI / prod
deploys.
- Local-npm fallback for environments without Docker (unit-test
synths + `cdk synth` on runners that lack Docker). The local
path is safe here because the layer ships pure JS + a prebuilt
wasm binary — no native build step.
Three constants exposed from the module:
- ``CEDAR_WASM_VERSION`` — single source of truth for the pinned
version; tests assert this matches both ``cdk/package.json`` and
the layer manifest, so the three places the version lives stay
in sync.
- ``CEDAR_WASM_MIN_LAMBDA_MEMORY_MB`` — 512 MB floor for attaching
Lambdas per §15.2 task 10.
- ``CedarWasmLayer.layer`` — the underlying ``LayerVersion`` for
Chunk 5 handlers to attach via ``fn.addLayers(...)``.
Agent stack wiring (§15.2 task 19)
------------------------------------
``agent.ts`` now instantiates:
- ``TaskApprovalsTable`` (prior commit) — grants RW to the runtime
so ``pre_tool_use_hook`` can TransactWriteItems + ConsistentRead
the PENDING row.
- ``SlackUserMappingTable`` (prior commit) — not granted to the
runtime; only the link-user Lambda (Chunk 5) writes here.
- ``CedarWasmLayer`` — the layer's asset lands in the synthed
template so Chunk 5 handlers can reference ``.layer`` without
causing a new asset on their deploy.
New runtime env vars:
- ``TASK_APPROVALS_TABLE_NAME`` — consumed by
``task_state._require_tables``; its absence previously raised
``ApprovalTablesUnavailable`` → hook DENY. Now set, so the
approval path is live on deploy.
- ``AGENTCORE_MAX_LIFETIME_S = '28800'`` — 8 hours, matching
``lifecycleConfiguration.maxLifetime``. Consumed by the hook's
``_remaining_maxlifetime_s`` for the maxLifetime ceiling clip
(§6.5). Kept in sync with the lifecycle via a direct test
assertion so drift surfaces at build time.
New CfnOutputs: ``TaskApprovalsTableName``, ``SlackUserMappingTableName``,
``CedarWasmLayerArn``. Each is useful for post-deploy smoke tests
(`aws dynamodb describe-table` / `aws lambda get-layer-version`).
Tests: +8 layer tests + 9 agent-stack assertions.
Layer:
- LayerVersion resource count.
- Compatible runtimes (nodejs20/22).
- Description carries the pinned version.
- CEDAR_WASM_VERSION matches ``cdk/package.json``.
- CEDAR_WASM_VERSION matches ``layers/cedar-wasm/package.json``.
- CEDAR_WASM_MIN_LAMBDA_MEMORY_MB ≥ 512.
- Custom description override works.
- ``.layer`` exposes a real ``LayerVersion``.
Agent stack:
- Table count updated from 6 → 8.
- TaskApprovalsTable schema match (task_id PK / request_id SK,
user_id-status-index GSI presence).
- SlackUserMappingTable single-key schema.
- LayerVersion count + compatibleRuntimes.
- Three new CfnOutputs present.
- TASK_APPROVALS_TABLE_NAME env var on the runtime.
- AGENTCORE_MAX_LIFETIME_S == '28800' (drift guard).
Carry-forward
-------------
- ``TASK_STARTED_AT`` is the other input the hook's
``_remaining_maxlifetime_s`` consumes — it's a PER-TASK value the
orchestrator must stamp at invocation time, not a stack-level env
var. Chunk 5's orchestrator changes need to add it to the runtime
invocation payload / session env. For now the hook's fallback
("unknown, don't clip") keeps approvals functional.
- Chunk 5 will attach the CedarWasmLayer onto ApproveTaskFn,
DenyTaskFn, GetPoliciesFn, CreateTaskFn and assert
``memorySize >= CEDAR_WASM_MIN_LAMBDA_MEMORY_MB`` for each.
* feat(cedar-hitl): approve + deny handlers + shared types (§7.1, §7.2)
Lands the two user-facing REST handlers that flip a PENDING approval
row to APPROVED / DENIED, the shared types both call sites and the
CLI consume, and the Lambda-side Cedar parser future Chunk-5 handlers
(get-policies, create-task validation) will use.
Wire types (shared/types.ts)
----------------------------
- ApprovalScope union covering every shape the agent's
ApprovalAllowlist understands. Typed so approve-task / create-task /
CLI (Chunk 6) all agree at compile time.
- ApprovalRecord / ApprovalRequest / ApprovalResponse / DenyRequest /
DenyResponse / PendingApprovalSummary / GetPendingResponse /
PolicyRuleSummary / GetPoliciesResponse / LinkSlackUserRequest /
LinkSlackUserResponse / SlackUserMappingRecord /
ApprovalDecisionRecordedEvent / CreateTaskApprovalExtensions.
- Constants: DENY_REASON_MAX_LENGTH=2000, INITIAL_APPROVALS_MAX_ENTRIES=20,
INITIAL_APPROVALS_MAX_ENTRY_LENGTH=128, APPROVAL_TIMEOUT_S_MIN=30,
APPROVAL_TIMEOUT_S_MAX=3600, APPROVAL_TIMEOUT_S_DEFAULT=300.
New error codes: REQUEST_NOT_FOUND, REQUEST_ALREADY_DECIDED,
TASK_NOT_AWAITING_APPROVAL.
Shared helpers
--------------
- shared/approval-scope.ts — parseApprovalScope validates every shape;
rejects unknown tool types / groups / prefixes, empty values,
over-128-char strings. isDegeneratePattern implements §7.4 (length
≤ 2, all-wildcard, wildcard ratio > 50%) for Chunk-5 create-task.
- shared/deny-reason-scanner.ts — scanDenyReason redacts AWS keys,
GitHub PATs (classic + fine-grained), Slack tokens, PEM blocks,
bearer tokens with [REDACTED-...] markers. Mirrors
agent/src/output_scanner.py so the deny reason the agent
ultimately reads is never raw user input.
- shared/cedar-policy.ts — parseRules pulls the five HITL annotations
(tier/rule_id/severity/approval_timeout_s/category) into a
ParsedRule[], preserving positional policy_id for IMPL-29
diagnostics-to-rule_id resolution. isHardDenyRule, isValidRuleId,
matchingRuleIds, concatPolicies exposed for future handlers.
Handlers
--------
- approve-task.ts (§7.1) — POST /v1/tasks/{task_id}/approve
- Cross-table TransactWriteItems: approval row PENDING → APPROVED
guarded by user_id = :caller AND status = :pending; TaskTable
no-op Update guarded by status = AWAITING_APPROVAL AND
awaiting_approval_request_id = :rid.
- TransactionCanceledException classified by per-item
CancellationReasons. Approval-row failure collapses to 404
REQUEST_NOT_FOUND (no existence oracle per §7.1 finding #6);
task-row failure → 409 TASK_NOT_AWAITING_APPROVAL.
- Optional scope defaults to this_call.
- Per-user per-minute rate limit (30/min, synthetic row).
- Writes approval_decision_recorded audit event (IMPL-6). Audit
failure is logged but does not fail the request — decision is
already committed.
- deny-task.ts (§7.2) — POST /v1/tasks/{task_id}/deny
- Same cross-table pattern; status → DENIED + deny_reason.
- Reason is scanDenyReason-sanitized + truncated to
DENY_REASON_MAX_LENGTH BEFORE any persistence — agent and audit
both read sanitized form; raw input never stored.
- Same rate-limit namespace as approve.
Tests: +64 total (cedar-policy-parser 24, approval-scope 28,
deny-reason-scanner 13, approve-task 14, deny-task 9). Secret test
fixtures are assembled from string fragments so the source never
holds a contiguous secret literal — Code Defender pre-commit hook
otherwise blocks.
Stack wiring (task-api.ts routes, agent.ts layer attachment,
CreateTaskFn extension, orchestrator + reconciler + fanout +
LinkSlackUserFn + GetPolicies + GetPending) lands in the next
commit.
* feat(cedar-hitl): get-pending + get-policies + link-slack-user handlers
Lands the three read/discovery handlers Chunk 6 (CLI) needs to power
``bgagent pending``, ``bgagent policies list/show``, and
``bgagent notifications configure slack``. Completes §15.2 tasks
14, 15, and 25 (handler side).
Handlers
--------
``get-pending.ts`` (§7.7 — GET /v1/pending)
- Queries ``user_id-status-index`` GSI on TaskApprovalsTable with
``user_id = :caller AND status = :pending``. Without the GSI
this would be a full-table Scan per call — under
``watch -n1 bgagent pending`` that exhausts burst capacity for
the whole fleet (§10.1 finding #8).
- Response maps each row to ``PendingApprovalSummary`` with a
derived ``expires_at = created_at + timeout_s`` so the CLI can
render time-to-timeout without doing arithmetic on ISO strings.
- Severity coerced to ``medium`` on unknown values so GSI writes
that drift from the enum don't break the list response.
- Rate-limited 10/min/user (synthetic row on the same table,
namespaced ``RATE#<user>#PENDING`` so it does not collide with
the approve/deny counter).
``get-policies.ts`` (§7.6 — GET /v1/repos/{repo_id}/policies)
- Combines ``BUILTIN_HARD_DENY_POLICIES`` + ``BUILTIN_SOFT_DENY_POLICIES``
with the repo's ``cedar_policies`` blueprint override. Runs the
combined text through ``parseRules`` and returns
``{hard[], soft[]}`` rule summaries.
- 5-minute per-repo in-Lambda cache; cold starts throw it away.
``_resetCacheForTests`` exposed for unit-test isolation.
- Repo ID is URL-decoded from the path (``owner%2Frepo`` common in
CLI UX).
- Rate-limited 30/min/user.
- Blueprint load failure falls back to built-ins with a WARN log;
invalid blueprint cedar text returns 503 ``SERVICE_UNAVAILABLE``
rather than a misleading empty list.
``link-slack-user.ts`` (§11.2 finding #4 — POST /v1/notifications/slack/link)
- Writes to SlackUserMappingTable with
``ConditionExpression: attribute_not_exists(slack_user_id)``. This
guard is the entire admission control the §11.2 design hinges on:
even a compromised Slack admin cannot overwrite an existing
mapping.
- Validates ``slack_user_id`` shape (letters, digits, underscores,
2–40 chars) so junk rows cannot land.
- Conflict surface is 409 ``REQUEST_ALREADY_DECIDED`` — reused
error code (the payload message directs the user to unlink via
support).
- Slack link_token end-to-end validation against Slack OAuth is
deferred — v1 accepts the token on trust from the Cognito-authed
caller; it is persisted in CloudWatch for audit.
Supporting primitives
---------------------
``shared/builtin-policies.ts`` — mirrors ``agent/policies/hard_deny.cedar``
and ``agent/policies/soft_deny.cedar`` as TypeScript string constants.
Embedded rather than read from disk because Lambda's esbuild bundler
does not copy non-TS assets by default and a dedicated bundling hook
is more code than the embed. A drift test
(``builtin-policies.test.ts``) asserts byte-equality with the agent
files so any change on one side without the other flips red at build
time.
``shared/cedar-policy.ts`` — ``parseRules`` now skips the unannotated
``base_permit`` entry (both tiers need it as a Cedar catch-all; it
is not a user-facing rule so it stays out of ParsedRule[]). This
matches the agent-side ``_parse_policy_annotations`` behaviour.
Tests: +37 total.
- get-pending (8): 401 on missing auth, 429 on rate limit, empty
result, GSI query shape, row → PendingApprovalSummary with
derived expires_at, severity fallback, missing timeout → expires_at
falls back to created_at, 500 on DDB error.
- get-policies (11): 401/400 validation, built-in rules listed on
empty repo, URL-decoded repo path, custom blueprint rule lands
in soft, per-repo cache across calls, 429 rate limit, 503 on
invalid blueprint cedar, fallback on load failure, hard rules
omit severity / approval_timeout_s, soft rules carry them.
- link-slack-user (8): 401/400 validation, shape check, 201 on
success, 409 on overwrite attempt, 500 on unknown DDB error,
whitespace trim on slack_user_id, ConditionExpression verified.
- builtin-policies (4): drift byte-equality with both agent files,
parseRules round-trip for hard/soft rule IDs.
- cedar-policy (updated): ``base_permit`` is skipped from
ParsedRule[] rather than rejected.
Stack wiring (task-api.ts routes, agent.ts layer attachment,
CreateTaskFn extension, orchestrator + reconciler + fanout) lands in
the next commit.
* feat(cedar-hitl): wire Chunk 5 routes + orchestrator + reconciler + agent plumbing
Completes Chunk 5 end-to-end: the five new Lambdas are instantiated
and wired onto the REST API, the orchestrator threads approval-related
data through to the agent runtime, the stranded-task reconciler sweeps
AWAITING_APPROVAL tasks, and the agent pipeline accepts the new
per-task approval configuration.
Stack wiring (agent.ts + task-api.ts)
-------------------------------------
- TaskApi construct accepts `taskApprovalsTable`, `slackUserMappingTable`,
`cedarWasmLayer` props. Approve/Deny/GetPending Lambdas are created
when the approvals table is present; GetPolicies also requires the
cedar-wasm layer + RepoTable. Slack-link Lambda attaches when the
slack mapping table is provided.
- New routes:
POST /tasks/{task_id}/approve
POST /tasks/{task_id}/deny
GET /pending
GET /repos/{repo_id}/policies
POST /notifications/slack/link
- GetPoliciesFn configures `memorySize: 512` (Cedar-wasm floor from
§15.2 task 10) and externalizes `@cedar-policy/cedar-wasm` from the
esbuild bundle so the layer provides the wasm binary at runtime.
- CedarWasmLayer compatibleRuntimes extended to include nodejs24.x
(the Lambda runtime) — the Node 20/22 list was the original §15.2
spec but the actual function uses Node 24.
- agent.ts passes all three new constructs into TaskApi.
Orchestrator (shared/orchestrator.ts)
-------------------------------------
- `finalizeTask` now treats AWAITING_APPROVAL as a "task still alive"
terminal-timeout source: on poll exhaustion the task transitions to
TIMED_OUT with a distinct `approval_poll_timeout` reason + error
message ("Orchestrator poll timeout exceeded while awaiting approval").
The stranded-approval reconciler is the secondary safety net (§13.6)
for tasks the orchestrator already lost track of.
- Invocation payload now carries three new fields:
- `task_started_at` (ISO 8601 at HYDRATING → RUNNING time) —
consumed by the agent hook's `_remaining_maxlifetime_s` so the
§6.5 maxLifetime ceiling math uses the real task clock instead
of the fail-open fallback.
- `approval_timeout_s` (when the submit payload supplied it).
- `initial_approvals` (when the submit payload supplied entries).
Stranded-task reconciler
------------------------
- Sweeps AWAITING_APPROVAL in addition to SUBMITTED/HYDRATING.
- New `APPROVAL_STRANDED_TIMEOUT_SECONDS` env var (default 7200s =
2h) — double §7.3's 1h ceiling so this reconciler never races the
happy-path timer.
- Distinct failure message on approval-stranded vs generic-stranded
so users see "approval stranded — container evicted" rather than
the misleading "no pipeline attached" copy.
Fanout (handlers/fanout-task-events.ts)
---------------------------------------
- Slack channel default set replaces the forward-compat
`approval_required` stub with the real §11.1 events:
`approval_requested` and `approval_stranded`. Other approval
milestones (granted/denied/timed_out/late_win/etc.) stay out of
default routing to avoid notification fatigue — the CLI surfaces
those confirmations directly.
- Email default replaces `approval_required` with `approval_requested`
(high-severity gates only; severity gating happens in the dispatcher).
Create-task validation (shared/create-task-core.ts)
---------------------------------------------------
- New request fields:
- `approval_timeout_s` — integer within
`[APPROVAL_TIMEOUT_S_MIN, APPROVAL_TIMEOUT_S_MAX]`.
- `initial_approvals` — array of scope strings; each entry must
be a valid `ApprovalScope` per `parseApprovalScope`; bash_pattern
and write_path scopes get the §7.4 degenerate-pattern check.
- TaskRecord extended with `approval_timeout_s`, `initial_approvals`,
`approval_gate_count` (seeded to 0 at admission), and
`awaiting_approval_request_id` (written atomically by the agent's
`transact_write_approval_request` primitive).
Agent plumbing (models.py / config.py / pipeline.py / runner.py / server.py)
----------------------------------------------------------------------------
- `TaskConfig` adds `approval_timeout_s`, `initial_approvals`.
- `build_config`, `run_task`, `_run_task_background`, and
`_extract_invocation_params` thread the two new fields from payload
→ config → PolicyEngine.
- `server._extract_invocation_params` stamps `os.environ["TASK_STARTED_AT"]`
from the payload so the hook's `_remaining_maxlifetime_s` returns
real values (carry-forward from Chunk 3 resolved).
- `runner.py` constructs PolicyEngine with `initial_approvals` +
`task_default_timeout_s` when supplied; the engine clamps bad
values at construction time.
Tests
-----
All CDK tests pass: 1219 / 1219.
All agent tests pass: 648 / 648.
Affected suites (changes only):
- test/stacks/agent.test.ts: cedar-wasm layer CompatibleRuntimes
now expects `nodejs24.x`; table count still 8.
- test/constructs/cedar-wasm-layer.test.ts: same runtime expansion.
- test/handlers/fanout-task-events.test.ts: approval_required →
approval_requested/approval_stranded in Slack default set;
approval_required → approval_requested in Email default set.
- test/handlers/reconcile-stranded-tasks.test.ts: primeResponses
now queue a third `Items: []` for AWAITING_APPROVAL queries;
queryCalls assertion bumped to 3.
Carry-forward (non-blocking)
----------------------------
- GetPoliciesFn has write access to TaskApprovalsTable (for the
rate-limit counter path). A future permissions audit should
tighten this to a single-item write scoped to `RATE#<user>#*`.
- TASK_STARTED_AT env var is only set when a payload supplies it;
server.py still supports the Phase 2 no-payload startup path.
* feat(cedar-hitl): Chunk 6 CLI — approve / deny / pending / policies
Ships the four user-facing commands that close the Cedar HITL loop:
once Chunks 1-5 have a PENDING approval row and the Slack/Email fan-out
has notified the user, Chunk 6 is how they actually respond.
New commands (cli/src/commands/)
--------------------------------
- `bgagent approve <task-id> <request-id> [--scope <scope>] [--yes]`
Default scope is `this_call`; callers extend allowlist with
`tool_type:Bash`, `rule:<id>`, etc. `all_session` is the only scope
that requires `--yes` to confirm — mirrors the safety UX from
§8.4. Error classification maps 404 → "run `bgagent pending`", 409
→ "task no longer awaiting approval", 429 → rate-limit, 401 → login.
- `bgagent deny <task-id> <request-id> [--reason ... | --reason-file ...]`
`--reason-file` accepts multi-line reasons that would otherwise
need shell quoting. Client-side `DENY_REASON_MAX_LENGTH` cap avoids
a round-trip on obviously-too-long reasons; the server still
truncates. Reason is sanitized server-side (output_scanner) before
ever reaching the agent.
- `bgagent pending [--output text|json]`
Lists every PENDING approval owned by the caller. Rendered with
approve/deny hints inline so the user can copy-paste the next
command. JSON output for scripting. Rate-limited server-side.
- `bgagent policies list --repo <owner/repo> [--tier hard|soft]`
`bgagent policies show --repo <owner/repo> --rule <rule_id>`
Discovery commands so users can find rule IDs without reading CDK
source. Both subcommands reuse a single `listPolicies` API call
and filter locally.
Wire changes
------------
- `cli/src/api-client.ts`: `approveTask`, `denyTask`, `listPending`,
`listPolicies` — each matching the §7.1 / §7.2 / §7.6 / §7.7
request/response shapes. `approveTask` omits the `scope` body field
when unset so the server's `this_call` default applies.
- `cli/src/types.ts`: mirrors the Chunk 5 server types verbatim —
`ApprovalScope` union, `ApprovalRequest/Response`, `DenyRequest/Response`,
`PendingApprovalSummary`, `GetPoliciesResponse`, `PolicyRuleSummary`,
plus the five constants (`DENY_REASON_MAX_LENGTH`,
`INITIAL_APPROVALS_MAX_ENTRIES`, `INITIAL_APPROVALS_MAX_ENTRY_LENGTH`,
`APPROVAL_TIMEOUT_S_MIN/MAX/DEFAULT`).
- `cli/src/bin/bgagent.ts`: registers the four new commands in the
order they appear in help output.
Tests: +27 new (217 total).
- approve (9): default scope, custom scope, all_session guard +
`--yes` bypass, JSON output, 404/409/401/429 error classifications.
- deny (6): no-reason path, `--reason`, `--reason-file` with
tmpdir fixture, mutually-exclusive rejection, over-length rejection,
404 classification.
- pending (5): empty render, populated render with approve/deny
hints, JSON output, 401 and 429 classifications.
- policies (7): list both tiers, `--tier` filter, `--output json`,
bad `--tier`, show found rule, show unknown rule, 404
repo-not-onboarded classification.
Carry-forward
-------------
- `submit.ts` extension with `--approval-timeout` / `--pre-approve`
flags is deferred to a follow-up commit — the server already accepts
these fields on POST /v1/tasks (Chunk 5), and `bgagent submit`
already forwards unknown payload fields through the existing
request path, so users can set them via `--body-file` today until
the explicit flags land.
- `--output json` on error branches currently returns a CliError
instead of a JSON error envelope; matches the pattern the existing
commands use (status, cancel, nudge). Follow-up to standardize
JSON error envelopes across the whole CLI if that becomes a
common scripting pain point.
* feat(cedar-hitl): Chunk 7a — persist gate counter + IMPL-23 cache observability
Persist approval_gate_count to TaskTable across container restarts per
§13.6 so the cumulative gate budget survives eviction. Emit
pre_approvals_loaded after PolicyEngine init per §4 step 7 / §11.1 so
operators see the starting approval posture in the live SSE stream.
Add IMPL-23 cache-hit observability: cache hits attach metadata to
PolicyDecision, hook forwards to new write_policy_decision_cached
progress helper (decision_source="recent_decision_cache").
Why: container restarts were silently resetting the per-task gate
counter, re-exposing users to another approvalGateCap-worth of gates
per restart. Cache-driven denies were invisible in TaskEventsTable
beyond the initial gate. Fresh tasks emitted no "starting posture"
signal so dashboards could not distinguish "no pre-approvals seeded"
from "agent has not started".
Surface additions:
- task_state.increment_approval_gate_count_in_ddb — best-effort
atomic ADD on approval_gate_count
- PolicyEngine(initial_approval_gate_count=N) — seed session counter
- TaskConfig.initial_approval_gate_count — orchestrator payload field
- progress_writer.write_policy_decision_cached — IMPL-23 emitter
- PolicyDecision.cache_hit_metadata — observability-only field
- _CachedDecision.original_decision_ts — wall-clock preservation
- runner._initialize_policy_engine_and_hooks — extracted helper
Counter survival is a safety bound, not correctness: DDB failure
does NOT block the gate (§13.6). Joint-update invariant on status
+ awaiting_approval_request_id (§10.2) is preserved — counter uses
separate UpdateItem, not merged into resume transaction.
Tests: +36 agent (648→684), +8 CDK (1219→1227), +6 new runner tests.
* feat(cedar-hitl): Chunk 7b — persist approval_gate_cap from blueprint
Capture the per-task approval-gate cap at submit-time (§4 step 5,
decision #13, §13.6) so a blueprint-configured override is frozen
onto the TaskRecord. Mid-task blueprint edits cannot shift the cap
beneath a running task; container restarts re-seed the agent's
PolicyEngine from the persisted value instead of its compile-time
default-50.
Why: Chunk 7a added approval_gate_count persistence but the cap
itself was still resolved from the blueprint on every restart —
so an operator lowering security.approvalGateCap mid-task would
retroactively fail-close the running task. The design has always
said cap is frozen at submit; this chunk makes the implementation
match.
Surface additions:
- BlueprintProps.security.approvalGateCap (CDK, synth-validated
[1, 500] integer) — new per-repo blueprint prop
- RepoConfig.approval_gate_cap + BlueprintConfig.approval_gate_cap
- TaskRecord.approval_gate_cap + APPROVAL_GATE_CAP_{MIN,MAX,DEFAULT}
- create-task-core now calls loadRepoConfig, resolves cap, bounds-
checks, persists; returns 503 SERVICE_UNAVAILABLE on invalid
blueprint data (permanent until admin re-deploys, not transient)
- orchestrator.ts: isValidApprovalGateCap integer+bounds guard;
logs warn if a persisted cap is structurally invalid (schema
drift / hand-edited DDB row)
- TaskConfig.approval_gate_cap: int | None = None (agent-side);
runner threads to PolicyEngine kwarg when not None
- "Task created" log line now carries approval_gate_cap +
approval_gate_cap_source ("blueprint" | "platform_default") so
operators can detect a broken-plumbing deploy at the single
chokepoint where all fallback layers converge
Per silent-failure review:
- HIGH: 500 → 503 + logger.error for permanent misconfig
- HIGH: cap + source in task-created log (catches 4-layer cascade)
- MEDIUM: orchestrator guard tightened past typeof (NaN, Infinity,
floats, out-of-bounds all omitted + warned)
Tests: CDK 1263/1263 (+36), agent 694/694 (+10). CLI unchanged.
* feat(cedar-hitl): Chunk 7c — observability wrap-up for resolved cap + warn path
Close three deferred items from Chunks 7a/7b before Chunks 8-10:
- runner.py init log now carries approval_gate_cap=N +
approval_gate_cap_source=threaded|engine_default. Matches the
handler log key so CloudWatch Insights can join across the
cascade; agent can't distinguish blueprint-override from
platform-default-frozen (handler log is the ground truth).
- server.py adds _warn_cw helper routing [server/warn] lines to
a dedicated CloudWatch stream (server_warn/<task_id>). stdout
print is preserved for local dev + existing capsys tests.
AgentCore does not forward container stdout to APPLICATION_LOGS,
so pre-7c warnings about malformed invocation payloads were
invisible in production. Failure counter shared with _debug_cw
for a single alarm surface; hoisted above writer defs for
import-time ordering safety.
- blueprint.ts emits a synth-time info annotation when
security.approvalGateCap is omitted so operators see a signal
that the repo will rely on the platform default of 50. Without
this, the default was a silent fallback at the handler layer —
only visible by inspecting a TaskRecord at runtime.
Tests: agent 694→700 (+6), cdk 1263→1265 (+2), cli unchanged.
Design refs: §4 step 5, §11.1, §13.6, decision #13.
* feat(cedar-hitl): Chunk 8a — extend approval outcome event schema
Add created_at / effective_timeout_s / matching_rule_ids to
approval_granted / approval_denied / approval_timed_out events so
the incoming ApprovalMetricsPublisher Lambda (Chunk 8b) can compute
decision latency and emit a rule_id-dimensioned timeout breakdown
without a round-trip GetItem against TaskEventsTable.
Fields are added conditionally — omitted from metadata when the
caller did not supply them — so the event stream stays free of
null-value noise and legacy callers continue to produce valid
payloads. Publisher handles missing fields via explicit skip-and-log
on the specific metric branch (not fallback-to-zero).
Agent tests extended: +6 progress_writer tests, +3 hooks tests.
Baseline 700 → 710. No consumer wired yet — this commit is a
forward-compatible superset; Chunk 8b ships the CDK publisher +
dashboard widgets.
* feat(cedar-hitl): Chunk 8b — ApprovalMetricsPublisher + native CloudWatch dashboard widgets
Ship the Cedar-HITL dashboard widgets from §11.3 / IMPL-28 via the
MetricsPublisher architecture (Option E):
- New ApprovalMetricsPublisher Lambda consumes TaskEventsTable DDB
stream as consumer #2 (FanoutConsumer is #1; stream is within its
2-consumer soft cap — documented in task-events-table.ts).
- Handler emits CloudWatch EMF for 3 metrics in namespace
ABCA/Cedar-HITL:
* ApprovalRequestCount + ClippedApprovalCount (reason dim) →
ApprovalTimeoutClipRate widget (MathExpression with IF-guard
against NaN on zero-denominator periods)
* TimedOutEffectiveTimeout (rule_id dim with allowlist
cardinality cap) → ApprovalTimeoutBreakdown widget
* ApprovalDecisionLatencyMs (outcome dim) → ApprovalDecisionLatency
widget with per-outcome p50/p90/p99
- Observability-of-observability (silent-failure review):
* MetricsPublisherHeartbeat per batch so dashboard gaps
distinguish "no traffic" from "pipeline broken"
* MetricEmitSkipped with a reason dim on schema mismatches,
parse anomalies, unknown rule ids — never fall back to
latency=0 or count=0 which would poison percentile widgets
* Expected high-volume skip reasons (non-milestone events,
REMOVE records) DO NOT emit MetricEmitSkipped — only
anomaly reasons (missing keys, missing milestone name) do,
so real signal isn't drowned
* Structured log lines alongside every skip so the absence of
metrics is also observable via CloudWatch Logs Insights
- Cardinality caps via ``RULE_ID_ALLOWLIST`` + ``normalizeClipReason``.
Unknown values collapse to ``other`` / ``unknown`` buckets with
dashboard series so the collapse is discoverable rather than
silently accruing custom-metric cost.
- Event-source-mapping filter pattern rejects non-agent_milestone
records at the service layer; handler-layer allowlist catches
anything that slips through. Filter pattern correctness tested
structurally + positively/negatively probed (silent-failure H3).
- Per-record try/catch + reportBatchItemFailures + SQS DLQ mirror
the fanout-task-events.ts poison-pill pattern exactly.
Deferred to Chunk 10 chore issues:
- DLQ alarms (fanout + publisher) — fire-into-void until
notification channel lands, so wire with §11.5 alarms as a group
- Explicit log-group declaration (IAM drift defense)
- stdout-flush race documentation (pre-existing pattern in fanout)
- EMF 100-updates/sec throttle alarm
Tests: cdk 1265 → 1327 (+62); agent 710 (unchanged); cli 217
(unchanged). All pass. §11.5 alarm plumbing now unblocked —
publisher provides the metrics infrastructure the design always
intended; only the notification-channel SNS wiring is left.
* docs(cedar-hitl): Chunk 9 — sync design doc to Chunks 7b / 8a / 8b
Bring CEDAR_HITL_GATES.md current with the code that shipped in
Chunks 7b (approval_gate_cap persist), 8a (outcome event schema
superset), and 8b (ApprovalMetricsPublisher + dashboard widgets):
- §10.2 adds the missing approval_gate_cap row (carry-forward
drift from Chunk 7b). Bounds + frozen-at-submit semantics
documented.
- §11.1 outcome events (approval_granted / approval_denied /
approval_timed_out) now document the Chunk 8a optional fields
(created_at, effective_timeout_s, matching_rule_ids) plus the
publisher's skip-on-missing-field policy.
- §11.1 intro names ApprovalMetricsPublisherFn as consumer #2 and
points to §11.3 for the metric schema.
- §11.3 rewritten to describe the Option E architecture:
publisher Lambda + EMF + native CloudWatch metrics in namespace
ABCA/Cedar-HITL, MathExpression with divide-by-zero guard,
rule_id cardinality cap, observability-of-observability via
heartbeat + skip meta-metrics, widget layout (12/12 over 24),
2-consumer stream budget. Dropped the stale "Retired the old
bundled widget" line — that widget never shipped.
- §11.5 reframed as "deferred (notification-channel gated)" with
a plumbing-status paragraph noting the metric infra now exists;
only SNS wiring remains. Alarm list expanded to include DLQ
and publisher-health alarms.
- §16 IMPL-28 rewritten for Option E; §15.2 row 46 expanded to
reference the 4 new test files; Appendix B checklist updated.
Starlight mirror regenerated via ``cd docs && node
scripts/sync-starlight.mjs``.
No code changes. Test baselines unchanged. Adversarial
comment-analyzer review verified every new claim against
committed code — zero inaccuracies.
* feat(cedar-hitl): Chunk 10 review fixes — close 2 blockers + tighten 2 mediums
Full-branch adversarial review (code-reviewer + silent-failure-hunter
on all 18 commits) surfaced findings that only appear at final-state.
Addressing the blockers + low-cost meds before deploy:
B2 — stranded approvals were invisible to the dashboard:
- Reconciler writes ``event_type: 'task_stranded'``; the metrics
publisher's event-source filter only accepts
``event_type: 'agent_milestone'``, so AWAITING_APPROVAL evictions
produced zero §11.3 signal.
- Fix: reconciler now additionally emits an ``agent_milestone``
with ``milestone: 'approval_stranded'`` when the stranded task
was AWAITING_APPROVAL. Publisher allowlist extended; classifier
emits ``ApprovalStrandedCount`` counter. SUBMITTED / HYDRATING
stranded events unchanged (guarded by test).
B1 — heartbeat comment was false reassurance:
- Event-source filter blocks Lambda invocation when no
``agent_milestone`` records exist in the poll window, so a
quiet period produces the same widget gap as a broken
pipeline. The code + design-doc wording claimed "gap =
pipeline broken" which would mislead the on-call.
- Fix: corrected module + function docstrings to describe the
heartbeat as "present when active, not pipeline-alive-always."
Operators should alarm on the combination
(heartbeat-absent + recent TaskEventsTable traffic) or wire
a scheduled canary — the latter tracked as a §11.5 follow-up.
M1 — safety-critical milestones produced zero dashboard signal:
- ``approval_cap_exceeded`` (§12.9 per-task cap) and
``approval_rate_limit_exceeded`` (per-user per-minute rate)
were emitted by the agent but not on the publisher allowlist.
A production bug where every gate hit the cap would have
been invisible.
- Fix: both added to APPROVAL_METRIC_MILESTONES with
``ApprovalCapExceededCount`` / ``ApprovalRateLimitExceededCount``
counters. No dimensions — the request_id in the event carries
per-user correlation for ad-hoc log-insights investigation.
H2 — filter / handler eventName disagreement:
- Event-source filter required ``INSERT``; handler accepted
``INSERT`` and ``MODIFY``. Benign today (TaskEventsTable is
put-only), but a future chunk MODIFY-ing records would be
silently dropped by the filter while the handler was ready
to process them.
- Fix: handler now INSERT-only, matching the filter. Single
source of truth on the eventName invariant.
M1-rename — ``expected_non_approval_milestone`` skip reason was
misleading (the non-metric approval milestones like
``approval_late_win`` also land in this bucket). Renamed to
``expected_milestone_not_tracked``.
Tests: cdk 1327 → 1332 (+5: 3 classifier branches for new metrics,
1 reconciler AWAITING_APPROVAL path, 1 SUBMITTED-not-double-counted
guard). Agent + CLI unchanged. All pre-commit hooks green; pre-push
security fails only on the 3 pre-existing CVEs tracked for chore
issue filing.
Deferred findings from the same review (file as chore issues):
- H1: agent-dies-between-TIMED_OUT-and-resume loses latency
(edge, affects p99 bias)
- H3: late-win APPROVED created_at staleness invariant
(works today, document invariant)
- H4: _warn_cw daemon-thread burst under adversarial payload
- M2-M4: late-win metric, rename helpers, etc.
No upstream PR filing this chunk — deploy to Sam's AWS account
for integration testing first.
* fix(cedar-hitl): suppress AwsSolutions-IAM5 on Runtime ExecutionRole overflow policies
Synth + deploy were blocked by cdk-nag: the Cedar HITL additions
(TaskApprovalsTable grant + SlackUserMappingTable + extra env vars
threaded to the AgentCore runtime) pushed the runtime ExecutionRole
past CDK's inline-policy size limit, so CDK auto-splits excess
statements into ``OverflowPolicy1``. The overflow inherits the same
wildcard ``bedrock:InvokeModel*`` / CloudWatch actions as the base
policy but lives at a path
(``Runtime/ExecutionRole/OverflowPolicy1/Resource``) that the
existing ``addResourceSuppressions(runtime, ..., applyToChildren:
true)`` cannot reach — CDK creates overflow policies lazily during
synth ``prepare()``, after the construct tree has been frozen and
after static suppressions have been cached.
Suppress via an Aspect at MUTATING priority so the suppression is
applied before cdk-nag's READONLY visitor runs. Matches any path
containing ``/Runtime/ExecutionRole/OverflowPolicy`` + ending
``/Resource`` so future ``OverflowPolicy2``, etc. are covered
without hardcoding indices.
Verified: ``mise //cdk:synth`` now completes cleanly.
``mise //cdk:test`` still 1332/1332.
* fix(cedar-hitl): E2E deploy-readiness — policies bundle + onboarding gate + CLI error visibility
Three E2E T1.4 + T2.2 findings from the Chunk 10 integration-test
session. Batched into one commit since all three need the same
redeploy to verify:
1. agent/Dockerfile: COPY policies/ into the container image.
``PolicyEngine.__init__`` reads
``/app/policies/hard_deny.cedar`` + ``soft_deny.cedar`` at import
time via ``_POLICIES_DIR = Path(__file__).parent.parent /
"policies"``. The Dockerfile only copied ``src/``, so the
directory was missing and every Cedar-HITL task failed at 0 turns
with ``missing built-in hard-deny policies``. Introduced
alongside Chunk 2 when the policy files were first added —
Dockerfile was never updated. Zero tasks on this branch ever
su…
Summary
Introduces a
ComputeStrategyabstraction that decouples the orchestrator from any specific compute backend. Implements two strategies behind this interface:compute_type: 'ecs'in blueprint configKey changes
ComputeStrategyinterface (compute-strategy.ts):startSession,pollSession,stopSessionwithSessionHandlefor durable serializationEcsComputeStrategy(strategies/ecs-strategy.ts): RunTask with container command override to invokeentrypoint.run_task()directly in batch mode (bypasses the uvicorn server that would otherwise sit idle)EcsAgentClusterCDK construct (ecs-agent-cluster.ts): ECS cluster, Fargate task definition (2 vCPU / 4 GB ARM64), security group (egress 443 only), CloudWatch log group, scoped IAM rolesorchestrate-task.ts): Returns fullSessionHandlefrom start-session step; adds ECS-level crash detection in the polling loop alongside existing DDB pollingTaskOrchestratorconstruct: Conditional ECS env vars, IAM policies (ecs:RunTask/DescribeTasks/StopTask, scopediam:PassRole), all-or-nothing prop validationAgentStackwiring: ECS infrastructure gated behindABCA_ENABLE_ECS=trueenv flag — zero impact on existing deploymentsTesting
ABCA_ENABLE_ECS=true, submitted a task withcompute_type: 'ecs'blueprint, agent completed in ~5 min and opened a PR via ECS FargateTest plan
mise run buildpasses (compile, test, synth, lint, docs)ABCA_ENABLE_ECS=trueproduces valid template