Skip to content

feat(compute): add ComputeStrategy interface with ECS Fargate backend#8

Merged
krokoko merged 15 commits into
mainfrom
feat/compute-strategy
Apr 13, 2026
Merged

feat(compute): add ComputeStrategy interface with ECS Fargate backend#8
krokoko merged 15 commits into
mainfrom
feat/compute-strategy

Conversation

@MichaelWalker-git
Copy link
Copy Markdown
Contributor

@MichaelWalker-git MichaelWalker-git commented Apr 7, 2026

Summary

Introduces a ComputeStrategy abstraction that decouples the orchestrator from any specific compute backend. Implements two strategies behind this interface:

  • AgentCoreComputeStrategy — extracted from the existing hardcoded AgentCore logic (zero behavior change for existing deployments)
  • EcsComputeStrategy — new ECS Fargate backend that runs agent tasks as standalone containers, selected via compute_type: 'ecs' in blueprint config

Key changes

  • ComputeStrategy interface (compute-strategy.ts): startSession, pollSession, stopSession with SessionHandle for durable serialization
  • EcsComputeStrategy (strategies/ecs-strategy.ts): RunTask with container command override to invoke entrypoint.run_task() directly in batch mode (bypasses the uvicorn server that would otherwise sit idle)
  • EcsAgentCluster CDK construct (ecs-agent-cluster.ts): ECS cluster, Fargate task definition (2 vCPU / 4 GB ARM64), security group (egress 443 only), CloudWatch log group, scoped IAM roles
  • Orchestrator adaptation (orchestrate-task.ts): Returns full SessionHandle from start-session step; adds ECS-level crash detection in the polling loop alongside existing DDB polling
  • TaskOrchestrator construct: Conditional ECS env vars, IAM policies (ecs:RunTask/DescribeTasks/StopTask, scoped iam:PassRole), all-or-nothing prop validation
  • AgentStack wiring: ECS infrastructure gated behind ABCA_ENABLE_ECS=true env flag — zero impact on existing deployments
  • CI: Re-enabled gitleaks and osv-scanner in build workflow

Testing

  • Unit tests for both strategies, compute-strategy resolver, ECS construct, orchestrator construct
  • Integration-tested end-to-end: deployed with ABCA_ENABLE_ECS=true, submitted a task with compute_type: 'ecs' blueprint, agent completed in ~5 min and opened a PR via ECS Fargate

Test plan

  • mise run build passes (compile, test, synth, lint, docs)
  • CDK synth with ABCA_ENABLE_ECS=true produces valid template
  • Existing AgentCore path unchanged (no ECS resources created without flag)
  • ECS task starts, agent runs in batch mode, writes terminal status, exits cleanly
  • Orchestrator detects ECS container completion/failure via poll loop

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a compute strategy abstraction for orchestrating agent sessions, extracting AgentCore-specific session invocation logic out of the shared orchestrator and updating the durable “start-session” step to delegate compute-specific work to the selected strategy.

Changes:

  • Added a ComputeStrategy interface plus resolveComputeStrategy() factory keyed off blueprintConfig.compute_type.
  • Extracted AgentCore session invocation/stop logic into AgentCoreComputeStrategy, and refactored orchestration to use it.
  • Added/updated unit tests for the new strategy + factory, and adjusted existing orchestrator-related tests accordingly.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
cdk/src/handlers/shared/compute-strategy.ts Adds strategy interface/types and a factory for selecting a compute backend.
cdk/src/handlers/shared/strategies/agentcore-strategy.ts Implements AgentCore-specific session start/poll/stop behind the strategy interface.
cdk/src/handlers/shared/orchestrator.ts Removes AgentCore session start logic from the shared orchestrator.
cdk/src/handlers/orchestrate-task.ts Updates the durable start-session step to resolve and use a compute strategy.
cdk/test/handlers/shared/compute-strategy.test.ts Tests strategy resolution and unknown compute type behavior.
cdk/test/handlers/shared/strategies/agentcore-strategy.test.ts Tests AgentCore strategy session start/poll/stop behavior with SDK mocking.
cdk/test/handlers/orchestrate-task.test.ts Removes old startSession tests and adjusts mocks (now mostly orchestrator-focused).
.github/workflows/build.yml Updates CI environment variables and disables several security/scanning tools via MISE_DISABLE_TOOLS.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/build.yml Outdated
Comment thread cdk/src/handlers/orchestrate-task.ts
Comment thread cdk/test/handlers/orchestrate-task.test.ts
@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented Apr 8, 2026

Thank you ! Couple of findings, will add each one of them as a separate comment. Regarding the CI failures, don't worry about it in this PR, i'll fix it separately

@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented Apr 8, 2026

Finding 1 — Dual source of truth for runtimeArn

Category: Architecture
Severity: Medium
Location: agentcore-strategy.ts:31 (constructor) and agentcore-strategy.ts:47 (startSession)

What happens:
The AgentCoreComputeStrategy constructor accepts runtimeArn and stores it as this.runtimeArn:

constructor(options: { runtimeArn: string }) {
this.runtimeArn = options.runtimeArn; // source 1: from factory
this.client = new BedrockAgentCoreClient({});
}

Then in startSession, the method immediately overrides it from blueprintConfig:

const runtimeArn = input.blueprintConfig.runtime_arn ?? this.runtimeArn; // source 2

Why it matters:
The factory in compute-strategy.ts:49 already passes blueprintConfig.runtime_arn to the constructor:

return new AgentCoreComputeStrategy({ runtimeArn: blueprintConfig.runtime_arn });

So this.runtimeArn and input.blueprintConfig.runtime_arn are always the same value at construction time. The ?? fallback can never fire. This creates the false impression that two
different ARNs could be in play, which is confusing for future maintainers.

Recommendation:
Either pass the ARN only through the constructor (remove it from the startSession input), or only read it from blueprintConfig at call time (remove the constructor parameter). One
source of truth, not two.

@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented Apr 8, 2026

Finding 2 — Lost SDK client singleton

Category: Architecture / Performance
Severity: Medium
Location: agentcore-strategy.ts:33 vs the deleted orchestrator.ts:35

Before (main branch):
// orchestrator.ts — module-level singleton, reused across Lambda invocations
const agentCoreClient = new BedrockAgentCoreClient({});

After (PR):
// agentcore-strategy.ts — new instance every time resolveComputeStrategy() is called
constructor(options: { runtimeArn: string }) {
this.runtimeArn = options.runtimeArn;
this.client = new BedrockAgentCoreClient({}); // new client per call
}

The factory resolveComputeStrategy() is called inside the durable execution step 'start-session' at orchestrate-task.ts:120. In durable execution, steps can replay on Lambda
re-invocations, and the factory runs on each replay. Even without replay, every new task invocation goes through the factory.

Why it matters:
AWS SDK clients hold HTTP connection pools and TLS sessions. A module-level singleton amortizes the TLS handshake across all invocations on the same Lambda container. Creating a new
client per call forces a fresh TLS negotiation on every task. Under moderate concurrency, this adds latency and garbage-collection pressure.

Recommendation:
Hoist the BedrockAgentCoreClient to module scope or use a lazy singleton pattern:

let _client: BedrockAgentCoreClient | undefined;
function getClient(): BedrockAgentCoreClient {
return (_client ??= new BedrockAgentCoreClient({}));
}

@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented Apr 8, 2026

Finding 3 — compute_type is an untyped string

Category: Types
Severity: Medium
Location: repo-config.ts:49, compute-strategy.ts:46-52

The CDK construct already defines a strict union at blueprint.ts:50:

readonly type?: 'agentcore' | 'ecs';

But the runtime types in BlueprintConfig and RepoConfig declare it as a bare string:

// repo-config.ts:49
readonly compute_type: string;

The factory switch in compute-strategy.ts then does:

switch (computeType) {
case 'agentcore':
return new AgentCoreComputeStrategy({ ... });
default:
throw new Error(Unknown compute_type: '${computeType}');
}

Why it matters:

  1. The default branch is a runtime bomb. If someone adds 'ecs' to the CDK construct but forgets to add a case 'ecs' here, the factory throws at runtime with no compile-time warning.
  2. TypeScript's exhaustiveness checking only works on discriminated unions, not string. With a proper union type, adding a new variant without handling it would be a compile error.

Recommendation:
Define a shared type and use the never exhaustiveness pattern:

// repo-config.ts
export type ComputeType = 'agentcore' | 'ecs';

export interface BlueprintConfig {
readonly compute_type: ComputeType;
// ...
}

// compute-strategy.ts
switch (computeType) {
case 'agentcore':
return new AgentCoreComputeStrategy({ ... });
// case 'ecs': return new EcsComputeStrategy({ ... });
default: {
const _exhaustive: never = computeType;
throw new Error(Unknown compute_type: '${_exhaustive}');
}
}

@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented Apr 8, 2026

Finding 4 — stopSession swallows all errors

Category: Error handling
Severity: High
Location: agentcore-strategy.ts:78-87

async stopSession(handle: SessionHandle): Promise {
// ...
try {
await this.client.send(new StopRuntimeSessionCommand({ ... }));
logger.info('AgentCore session stopped', { session_id: handle.sessionId });
} catch (err) {
logger.warn('Failed to stop AgentCore session (best-effort)', {
session_id: handle.sessionId,
error: err instanceof Error ? err.message : String(err),
});
}
}

The entire catch is a logger.warn — the error is swallowed unconditionally.

Why it matters:
Not all stop failures are equal:

  • ResourceNotFoundException — the session already ended. Swallowing this is correct.
  • ThrottlingException — rate limited. The session is still running, consuming compute, and accumulating cost. Swallowing this means the session runs until it self-terminates or times
    out.
  • AccessDeniedException — a permissions misconfiguration. This will never self-heal and affects all future stop attempts. Swallowing it means the operator never learns about it except
    by reviewing warn-level logs.
  • ServiceUnavailableException — transient. A retry would likely succeed.

Logging at warn for all of these treats a billing leak the same as a benign no-op.

Recommendation:
At minimum, distinguish retriable/transient errors from permanent ones. Consider re-throwing non-retriable errors (or a subset like auth/config errors), and retrying transient ones. If
the intent is truly "best-effort fire-and-forget", at least log AccessDeniedException at error level so alarms can fire.

@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented Apr 8, 2026

Finding 5 — Missing logger.info('Session started') after full sequence

Category: Observability
Severity: Low
Location: orchestrate-task.ts:120-133 (the new start-session step)

Before (main branch — orchestrator.ts:340):
await agentCoreClient.send(command);
await transitionTask(task.task_id, TaskStatus.HYDRATING, TaskStatus.RUNNING, { ... });
await emitTaskEvent(task.task_id, 'session_started', { session_id: sessionId });
logger.info('Session started', { task_id: task.task_id, session_id: sessionId }); // ← present

After (PR — orchestrate-task.ts:120-133):
const strategy = resolveComputeStrategy(blueprintConfig);
const handle = await strategy.startSession({ taskId, payload, blueprintConfig });
await transitionTask(taskId, TaskStatus.HYDRATING, TaskStatus.RUNNING, { ... });
await emitTaskEvent(taskId, 'session_started', { ... });
return handle.sessionId;
// ← no logger.info after the full sequence

The strategy itself logs 'AgentCore session invoked' at agentcore-strategy.ts:61, but that fires before the DDB transition and event emit. The original code logged after the entire
sequence completed (invoke + transition + event), confirming the step succeeded end-to-end.

Why it matters:
In a durable execution context, the invoke can succeed but the DDB transition can fail (and the step retries). The strategy-level log gives a false positive — "session invoked" appears
in logs even when the step ultimately fails. The old post-sequence log confirmed the step completed atomically.

Recommendation:
Add a logger.info after emitTaskEvent in the start-session step:

logger.info('Session started', { task_id: taskId, session_id: handle.sessionId, strategy: handle.strategyType });

@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented Apr 8, 2026

Finding 6 — No integration tests for the rewritten start-session step

Category: Testing
Severity: High
Location: orchestrate-task.test.ts — deleted startSession tests at lines 194-206 and 375-387 (old), no replacements for the orchestrator-level flow

What was removed:
// Deleted: tested startSession() directly
describe('startSession', () => {
test('invokes agent runtime and transitions to RUNNING', async () => { ... });
});
describe('startSession with blueprint config', () => {
test('uses blueprint runtime ARN override', async () => { ... });
});

What was added:

  • compute-strategy.test.ts — 2 tests: factory returns correct type, factory throws on unknown type
  • agentcore-strategy.test.ts — 5 tests: type check, startSession invoke + handle, ARN override, pollSession stub, stopSession (success, failure, missing ARN)

What's missing:
The new strategy tests validate the strategy class in isolation. But the orchestrator's start-session step in orchestrate-task.ts:117-133 is itself a composition of three operations:

  1. resolveComputeStrategy(blueprintConfig)
  2. strategy.startSession({ taskId, payload, blueprintConfig })
  3. transitionTask(...) + emitTaskEvent(...)

No test validates this composition. Specifically:

  • Happy path: strategy starts → DDB transitions → event emits → returns sessionId
  • Error path: strategy.startSession throws → failTask is called → error re-thrown
  • Partial failure: strategy starts successfully → transitionTask throws (e.g., conditional check fails because task was cancelled concurrently)

The old startSession function was monolithic and tested end-to-end. The refactored version split the logic across two layers but only tested the bottom layer.

Recommendation:
Add integration-level tests for the start-session step that mock the strategy and DDB, verifying the full composition, especially the error path where failTask must be called before
re-throwing.

@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented Apr 8, 2026

Finding 7 — pollSession() silently returns 'running' forever

Category: Interface contract
Severity: Medium
Location: agentcore-strategy.ts:67-70

async pollSession(_handle: SessionHandle): Promise {
// Polling is currently done at the orchestrator level via DDB reads.
// This method exists for PR 2 where different strategies may poll differently.
return { status: 'running' };
}

The ComputeStrategy interface at compute-strategy.ts:38 declares:

pollSession(handle: SessionHandle): Promise;

This returns SessionStatus, a discriminated union of 'running' | 'completed' | 'failed'. The current implementation always returns 'running'.

Why it matters:
If any code path ever calls strategy.pollSession(handle) to check whether a session has finished, it will get { status: 'running' } forever, and the polling loop will never terminate.
The caller has no way to know this method is a stub — the type system says it returns a real answer.

The existing orchestrator polling in orchestrator.ts:352-366 (pollTaskStatus) reads from DDB, which is the actual source of truth. So currently pollSession is dead code. But:

  1. A future developer seeing the ComputeStrategy interface will reasonably assume pollSession works and use it.
  2. The test at agentcore-strategy.test.ts:85-92 asserts result.status is 'running', which locks in the stub behavior and will pass even when someone intends to implement it.

Recommendation:
Either:

  • Throw new Error('Not implemented') — makes it immediately clear this is a stub and any caller will fail loudly, or
  • Remove the method from the interface entirely until it's needed (YAGNI), or
  • Add a @deprecated / JSDoc warning and name the test clearly (e.g., 'pollSession is a stub — returns running (NOT IMPLEMENTED)')

@MichaelWalker-git
Copy link
Copy Markdown
Contributor Author

@krokoko All 7 findings addressed in d68edd6. Summary:

Finding 1 — Dual source of truth for runtimeArn: Removed runtimeArn from the constructor. AgentCoreComputeStrategy is now stateless — reads runtime_arn exclusively from blueprintConfig in startSession(). Factory no longer passes constructor args.

Finding 2 — Lost SDK client singleton: Hoisted BedrockAgentCoreClient to a module-level lazy singleton via getClient(). Created once on first use, reused across all strategy instances within the same Lambda execution context.

Finding 3 — compute_type is untyped string: Added ComputeType = 'agentcore' | 'ecs' union type in repo-config.ts. Updated BlueprintConfig.compute_type and RepoConfig.compute_type to use it. Added exhaustiveness checking with never pattern in resolveComputeStrategy switch — adding a new compute type without a case arm is now a compile error.

Finding 4 — stopSession swallows all errors: Distinguished error types:

  • ResourceNotFoundExceptionlogger.info (session already gone, benign)
  • ThrottlingException / AccessDeniedExceptionlogger.error (actionable, potential billing leak or config error)
  • Other errors → logger.warn (best-effort, as before)

Finding 5 — Missing logger.info after full sequence: Added logger.info('Session started', { task_id, session_id, strategy_type }) in orchestrate-task.ts after the invoke + transition + event sequence completes.

Finding 6 — No integration tests: Created start-session-composition.test.ts with 3 tests:

  1. Happy path: strategy.startSession → transitionTask → emitTaskEvent (all succeed)
  2. Error path: strategy.startSession throws → failTask is called
  3. Partial failure: strategy succeeds but transitionTask throws → failTask is called

Finding 7 — pollSession returns 'running' forever: Now throws Error('pollSession is not implemented for AgentCore — use orchestrator-level DDB polling'). Updated test to verify it throws. Clear signal for future developers; future strategies (ECS/Fargate) will implement real compute-level polling.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 20 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

cdk/src/stacks/agent.ts:257

  • hasEcsBlueprint is currently hard-coded to always be false (.some(() => false)), so needsEcs effectively depends only on ABCA_ENABLE_ECS. This makes the preceding comment (“Wire ECS infrastructure when any blueprint uses compute.type: 'ecs'”) inaccurate and will silently skip provisioning ECS even if a Blueprint is later configured with compute.type: 'ecs'. Consider either removing hasEcsBlueprint entirely and documenting that ECS infra is env-flag driven, or implementing a real detection mechanism (e.g., tracking blueprint compute type in code or exposing it from the Blueprint construct).
    new CfnOutput(this, 'RepoTableName', {
      value: repoTable.table.tableName,
      description: 'Name of the DynamoDB repo config table',
    });


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cdk/src/constructs/task-orchestrator.ts
Comment thread cdk/src/handlers/shared/strategies/ecs-strategy.ts Outdated
Comment thread cdk/src/handlers/shared/strategies/ecs-strategy.ts
Comment thread cdk/src/handlers/shared/compute-strategy.ts Outdated
Comment thread cdk/src/constructs/task-orchestrator.ts
MichaelWalker-git added a commit that referenced this pull request Apr 9, 2026
- Keep gitleaks/osv-scanner enabled in CI (only disable trivy/grype/semgrep)
- Type ComputeStrategy.type and SessionHandle.strategyType as ComputeType
- Trim/filter ECS_SUBNETS to handle whitespace and trailing commas
- Handle undefined exit code in ECS pollSession (container never started)
- Scope iam:PassRole to specific ECS task/execution role ARNs
- Validate all-or-nothing ECS props in TaskOrchestrator constructor
- Remove dead hasEcsBlueprint detection; document env-flag driven approach
- Add comment noting strategy_type as additive event field
…AgentCore logic

Introduce ComputeStrategy interface with SessionHandle/SessionStatus types
and resolveComputeStrategy factory. Extract AgentCoreComputeStrategy from
orchestrator.ts. Refactor orchestrate-task handler to use strategy pattern
for session lifecycle (start/poll/stop). Pure refactor — no behavior change,
identical CloudFormation output.
The mise install step downloads tools (trivy) from GitHub releases.
Without GITHUB_TOKEN, unauthenticated requests hit the 60 req/hr
rate limit, causing flaky CI failures.
Mise uses GITHUB_API_TOKEN (not GITHUB_TOKEN) for authenticated
GitHub API requests when downloading aqua tools like trivy.
Trivy, grype, semgrep, osv-scanner, and gitleaks are only needed for
security scanning tasks, not for the build/test/synth pipeline. Disable
them via MISE_DISABLE_TOOLS to avoid GitHub API rate limits when mise
tries to download them on every PR build.
- Keep gitleaks and osv-scanner enabled in CI build (only disable
  trivy/grype/semgrep which need GitHub API downloads)
- Remove unused @aws-sdk/client-bedrock-agentcore mock from
  orchestrate-task.test.ts (SDK is no longer imported by orchestrator)
- Update PR description to note additive strategy_type event field
1. Single source of truth for runtimeArn — removed constructor param,
   strategy now reads exclusively from blueprintConfig.runtime_arn
2. Lazy singleton for BedrockAgentCoreClient — module-level shared
   client avoids creating new TLS sessions per invocation
3. ComputeType union type ('agentcore' | 'ecs') with exhaustive switch
   and never-pattern in resolveComputeStrategy
4. Differentiated error handling in stopSession — ResourceNotFoundException
   (info), ThrottlingException/AccessDeniedException (error), others (warn)
5. Added logger.info('Session started') after full invoke+transition+event
   sequence in orchestrate-task.ts
6. Added start-session-composition.test.ts with integration tests for
   happy path, error path (failTask), and partial failure (transitionTask throws)
7. pollSession now throws NotImplementedError instead of returning stale
   'running' status — clear signal for future developers
- Replace require() with ES import for BedrockAgentCoreClient mock
- Fix import ordering in start-session-composition test
Wire ECS Fargate as a compute backend behind the existing ComputeStrategy
interface, using the existing durable Lambda orchestrator. No separate
stacks or Step Functions — ECS is a strategy option alongside AgentCore.

Changes:
- EcsComputeStrategy: startSession (RunTask), pollSession (DescribeTasks
  state mapping), stopSession (StopTask with graceful error handling)
- EcsAgentCluster construct: ECS Cluster (container insights), Fargate
  task def (2 vCPU/4GB/ARM64), security group (TCP 443 egress only),
  CloudWatch log group, task role (DynamoDB, SecretsManager, Bedrock)
- TaskOrchestrator: optional ECS props for env vars and IAM policies
  (ecs:RunTask/DescribeTasks/StopTask conditioned on cluster ARN,
  iam:PassRole conditioned on ecs-tasks.amazonaws.com)
- Orchestrator polling: ECS compute-level crash detection alongside
  existing DDB polling (non-fatal, wrapped in try/catch)
- AgentStack: conditional ECS infrastructure (ABCA_ENABLE_ECS env var)
- Full test coverage: 15 ECS strategy tests, 9 construct tests,
  5 orchestrator ECS tests. All 563 tests pass.

Deployed and verified: stack deploys cleanly, CDK synth passes cdk-nag,
agent task running on AgentCore path unaffected.
- Keep gitleaks/osv-scanner enabled in CI (only disable trivy/grype/semgrep)
- Type ComputeStrategy.type and SessionHandle.strategyType as ComputeType
- Trim/filter ECS_SUBNETS to handle whitespace and trailing commas
- Handle undefined exit code in ECS pollSession (container never started)
- Scope iam:PassRole to specific ECS task/execution role ARNs
- Validate all-or-nothing ECS props in TaskOrchestrator constructor
- Remove dead hasEcsBlueprint detection; document env-flag driven approach
- Add comment noting strategy_type as additive event field
The ECS container's default CMD starts uvicorn server:app which waits
for HTTP POST to /invocations — but in standalone ECS nobody sends that
request, leaving the agent idle. Override the container command to invoke
entrypoint.run_task() directly with the full orchestrator payload via
AGENT_PAYLOAD env var. Also add GITHUB_TOKEN_SECRET_ARN to the ECS task
definition base environment.
@MichaelWalker-git MichaelWalker-git changed the title feat: introduce ComputeStrategy interface and extract AgentCoreComputeStrategy feat(compute): add ComputeStrategy interface with ECS Fargate backend Apr 9, 2026
@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented Apr 12, 2026

Pulled locally and merged main, when looking at it:

Critical Issues

  1. ECS tasks cannot be cancelled (cancel-task.ts only knows AgentCore)

cdk/src/handlers/cancel-task.ts:110-125 — The cancel handler hardcodes StopRuntimeSessionCommand for AgentCore. When a user cancels an ECS-backed task, the DDB record transitions to
CANCELLED, but the Fargate container keeps running — consuming compute, potentially pushing commits or opening PRs the user explicitly cancelled. This is both a correctness bug and a
cost leak.

Fix: The cancel handler should use resolveComputeStrategy() to dispatch to the correct strategy's stopSession(), or at minimum store enough metadata (compute_type, session handle) on
the task record for the cancel path to use.

  1. Missing agent_runtime_arn in DDB breaks cancellation for blueprint-overridden runtimes

cdk/src/handlers/orchestrate-task.ts:128-131 — The old startSession wrote agent_runtime_arn to the task record during the HYDRATING→RUNNING transition. The new strategy-based path omits
it. The cancel handler falls back to the default RUNTIME_ARN env var, which is incorrect for tasks using per-repo runtime ARN overrides from blueprints.

Fix: Restore writing agent_runtime_arn for AgentCore strategy. For ECS, store compute_type and session metadata on the task record.

  1. iam:PassRole * fallback when ECS role ARNs are omitted

cdk/src/constructs/task-orchestrator.ts:180-185, 277-280 — The all-or-nothing validation checks 5 ECS props but omits ecsTaskRoleArn and ecsExecutionRoleArn. If someone provides the 5
main props but forgets the role ARNs, the iam:PassRole policy silently falls back to resources: ['*'] — allowing PassRole to any role in the account. This is a security gap.

Fix: Include role ARNs in the all-or-nothing validation, or throw when ecsClusterArn is set but role ARNs are missing. Never fall back to ['*'].

  1. ECS pollSession errors silently swallowed in orchestrator polling loop

cdk/src/handlers/orchestrate-task.ts:180-185 — The catch block downgrades ALL exceptions from pollSession to logger.warn, including AccessDeniedException, TypeError, and other
persistent failures. If DescribeTasks consistently fails (e.g., missing IAM permission), the ECS crash detection path is permanently dead. Tasks sit in RUNNING state for up to 8.5 hours
before timing out.

Fix: Track consecutive poll failures. After N consecutive failures, escalate to logger.error and fail the task. Distinguish transient errors (throttling) from persistent errors (access
denied).


Important Issues

  1. Dead code: startSession function in orchestrator.ts is now unused

cdk/src/handlers/shared/orchestrator.ts:366-395 — The old startSession function and its AgentCore client are still exported but no longer imported by any handler. The remaining test in
orchestrate-task.test.ts validates this dead code path.

  1. Hardcoded CloudWatch log group name will collide across stacks

cdk/src/constructs/ecs-agent-cluster.ts:75 — /ecs/abca-agent-tasks is hardcoded. Two stacks in the same account/region will collide with ResourceAlreadyExistsException. Let CDK
auto-generate the name or include the stack name.

  1. SessionHandle.metadata: Record<string, unknown> undermines type safety

Both strategies immediately cast fields out of the untyped metadata bag (handle.metadata.runtimeArn as string, handle.metadata.clusterArn as string). A discriminated union keyed on
strategyType would eliminate all casts and make required fields self-documenting:

export type SessionHandle =
| { sessionId: string; strategyType: 'agentcore'; runtimeArn: string }
| { sessionId: string; strategyType: 'ecs'; clusterArn: string; taskArn: string };

  1. AgentCoreComputeStrategy.pollSession violates Liskov Substitution Principle

It throws unconditionally ("not implemented"), so callers cannot safely call pollSession on an arbitrary ComputeStrategy without first checking the type. Either make pollSession
optional in the interface, or return { status: 'running' } (which is semantically correct for AgentCore — it's running until DDB says otherwise).

  1. Seven flat optional ECS fields in TaskOrchestratorProps should be a grouped object

The all-or-nothing constraint is invisible at the type level. A single ecsConfig?: EcsComputeConfig with all fields required would express the constraint at the type level and eliminate
the runtime validation.

  1. No ecsStatus.status === 'completed' handling in polling loop

cdk/src/handlers/orchestrate-task.ts:171-179 — Only 'failed' is checked. If the ECS task exits with code 0 but the agent fails to write terminal status to DDB, the orchestrator polls
for up to 8.5 hours waiting for a status that never arrives.

  1. resolveComputeStrategy called on every poll cycle

cdk/src/handlers/orchestrate-task.ts:169 — Creates a new EcsComputeStrategy() up to 1020 times inside the poll loop. Resolve once before the loop and reuse.


Test Coverage Assessment: 7/10

Strengths:

  • Strategy isolation tests are thorough (agentcore: 203 lines, ecs: 331 lines)
  • CDK construct tests verify security-critical properties (SG egress, IAM scoping)
  • Composition test covers start-session happy path, error path, and partial failure

Critical Gaps:

  • No test for ECS crash detection polling logic — the most impactful new code path is untested
  • No test for ECS startSession with missing env vars (module-level const binding prevents it)
  • start-session-composition.test.ts only tests AgentCore path — no ECS composition test
  • No test for iam:PassRole * fallback when role ARNs are missing

Positive Observations

  • Clean strategy pattern with exhaustive never switch for compile-time safety
  • Proper cdk-nag suppression coverage with updated reason strings
  • IAM policies are cluster-scoped via conditions (ArnEquals: { 'ecs:cluster': ... })
  • ECS infrastructure is correctly gated behind ABCA_ENABLE_ECS=true — zero impact on existing deployments
  • Security group restricts egress to TCP 443 only (tight network policy)
  • ARM64 Fargate with container insights and 3-month log retention — production-grade defaults

@MichaelWalker-git
Copy link
Copy Markdown
Contributor Author

End-to-End Verification ✅

Deployed the latest feat/compute-strategy branch to backgroundagent-dev and ran an end-to-end test.

Test task: Build a simple calculator web app (HTML/CSS/JS) against MichaelWalker-git/aws-agents

Metric Value
Task ID 01KP431X0TJ7NGZNDJ03R8Y0CG
Status COMPLETED
Duration 153.7s (~2.5 min)
Cost $0.29
PR opened https://github.com/MichaelWalker-git/aws-agents/pull/3

What was verified:

  • Stack deployed successfully with all review changes (discriminated union SessionHandle, EcsComputeConfig grouped interface, scoped PassRole, cancel handler strategy dispatch, consecutive poll error tracking, etc.)
  • AgentCore compute path works end-to-end: task submitted → preflight passed → agent session started → code written → branch pushed → PR opened → task marked COMPLETED
  • All 627 CDK tests + 70 CLI tests pass
  • CDK synth clean (no cdk-nag errors)
  • Full monorepo build passes

@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented Apr 13, 2026

@MichaelWalker-git could you please just fix the mutation and I'll merge it, thank you !

@krokoko krokoko enabled auto-merge April 13, 2026 19:29
@krokoko krokoko disabled auto-merge April 13, 2026 19:29
@krokoko krokoko merged commit cee7331 into main Apr 13, 2026
6 checks passed
scoropeza pushed a commit to scoropeza/sample-autonomous-cloud-coding-agents that referenced this pull request May 5, 2026
…aker (krokoko review aws-samples#6, aws-samples#8)

Two related hardening changes on ``agent/src/progress_writer.py``.
Grouped because the shared circuit breaker reuses the error-
classification decision to know when NOT to flip itself, and
separating the commits would force an awkward "half-fix" intermediate
state.

## Findings addressed

**aws-samples#6 — Circuit breaker trips on ValidationException (permanent errors)**

The pre-fix ``except Exception`` branch fed ALL errors into the same
``_failure_count`` counter. A persistent schema/size error
(e.g. ``ValidationException`` from an item >400 KB under a
trace-heavy event) counted against the transient-failure budget and
tripped the breaker within 3 events, silencing the entire progress
stream for the rest of the task — even though most subsequent
events were normal size and would have written fine.

New behaviour classifies each DDB error into three buckets:

  - **Permanent (drop event, keep stream alive):**
    ``ValidationException``, ``ItemCollectionSizeLimitExceededException``
    — the individual event is malformed or oversized, retrying or
    treating as transient would not help. Log at WARN, skip the
    event, do NOT increment the failure counter.
  - **Immediate-disable (fatal, don't even try to retry):**
    ``ResourceNotFoundException``, ``AccessDeniedException``,
    ``UnauthorizedOperation`` — wrong deploy or IAM misconfig. Disable
    the breaker on the first occurrence instead of waiting for 3
    failures; log at WARN with the error code. Avoids spamming
    operator dashboards with 3 copies of the same permissions error.
  - **Transient (trip the breaker on repeated failures, as today):**
    ``ProvisionedThroughputExceededException``,
    ``RequestLimitExceeded``, ``ServiceUnavailable``,
    ``InternalServerError``, plus network-layer (``ConnectionError``,
    ``EndpointConnectionError``, ``ReadTimeoutError``). Same counter
    semantics as pre-fix.
  - **Unknown (default conservative):** counted as transient (counter
    increments) but logged at ERROR with an explicit ``UNKNOWN``
    marker so operators notice and can add new codes to the
    permanent/transient lists. Does NOT instant-disable — over-
    correcting from pre-fix behaviour would swap one failure mode
    for another.

Uses ``botocore.exceptions.ClientError`` + ``err.response["Error"]["Code"]``
for AWS errors; class-name matching for non-ClientError (network-
layer) paths. Helper: ``_classify_ddb_error(exc) ->
Literal["permanent", "immediate_disable", "transient", "unknown"]``.

**aws-samples#8 — Dual _ProgressWriter instances with independent circuit breakers**

Pre-fix, the runner-level writer (turn/tool events at ``runner.py:240``)
and the pipeline-level writer (milestones at ``pipeline.py:303``) each
held their own ``_failure_count`` / ``_disabled`` state. If throttling
tripped one writer, the other kept writing — creating visible event
gaps in the stream that operators could not distinguish from agent
activity (milestones firing after turn events stop, or vice versa).

Fix: consolidate circuit-breaker state into a module-level
``_SharedCircuitBreaker`` singleton keyed by ``task_id``. Both writers
for the same task read/write the same ``(_failure_count, _disabled)``
pair through named methods (``is_disabled``, ``record_failure``,
``record_success``, ``disable``). One task's stream is either healthy
(all events flow from both writers) or degraded (no events flow from
either). Cannot have a half-alive stream.

Semantics notes:

  - ``record_success`` resets the shared counter but NOT the
    ``_disabled`` flag. Re-enabling mid-task would let a flaky
    minute burn the failure budget repeatedly and defeat the
    breaker's purpose.
  - Empty-string ``task_id`` (``runner.py`` falls back to sentinel
    ``"unknown"``) collapses to shared state for all ``"unknown"``
    writers. Real task_ids stay isolated.
  - Writers retain ``_disabled`` / ``_failure_count`` as properties
    that proxy to the shared map. Existing callers (``hooks.py``
    does ``getattr(progress, "_disabled", False)``) and tests that
    assign ``writer._failure_count = 0`` keep working unchanged —
    no constructor signature change required, no
    ``runner.py`` / ``pipeline.py`` edits.
  - Single ``threading.Lock()`` protects the shared map; DDB write
    rate (single-digit events/sec) never contends meaningfully.
  - Test hygiene: ``_reset_circuit_breakers()`` helper rebinds the
    module global so autouse fixtures give each test a clean slate.

## Tests

+24 regression tests net (36 → 60 in ``test_progress_writer.py``).
Coverage:

  - Finding aws-samples#6 classification:
    ``test_permanent_error_does_not_trip_breaker`` (10 consecutive
    ``ValidationException`` writes keep ``_disabled=False``),
    ``test_transient_error_trips_breaker_as_before``,
    ``test_access_denied_disables_writer_immediately_with_loud_log``,
    ``test_unknown_exception_treated_as_transient_with_error_log``.
  - Finding aws-samples#8 sharing:
    ``test_shared_circuit_breaker_across_writers_same_task_id``
    (writer-A trips the breaker; writer-B sees ``is_disabled`` and
    skips the DDB call),
    ``test_separate_tasks_have_independent_breakers``,
    ``test_unknown_sentinel_task_id_is_isolated``,
    ``test_reset_helper_clears_shared_state_between_tests``.
  - Plus edge cases: success-interleave resets the counter across
    writers; ``_disabled`` stays open after re-enabling a success
    mid-task; thread-safety via concurrent writes.

Agent suite: 497 passing (was 473; +24).

Refs: krokoko code review on PR aws-samples#52 (findings 6, 8)
scoropeza pushed a commit to scoropeza/sample-autonomous-cloud-coding-agents that referenced this pull request May 7, 2026
Adds the 14 agent-side approval milestone writers (§11.1) on
``_ProgressWriter`` so Chunk 3's hook integration has a typed API
instead of stringly-typed ``write_agent_milestone`` calls, and the
per-task gate counter / per-container sliding-window rate limit /
denial-injection queue on ``PolicyEngine`` that §6.5 requires.

Why now: the hook work lands cleanly only after these surfaces exist
— every code path in ``pre_tool_use_hook``'s REQUIRE_APPROVAL branch
calls one of these helpers. Shipping them separately lets the hook
commit be about the state machine, not the event-shape bookkeeping.

Engine additions:
  - ``approval_gate_count`` / ``increment_approval_gate_count``: the
    per-task counter §12.9 bounds at ``approvalGateCap``. Session-scoped
    in v1; persistence tracked in §17.
  - ``approvals_in_last_minute`` / ``record_approval_gate_timestamp``:
    sliding-window rate limit (20/min/container, §12.9). Prune on read
    so callers see the current count without a separate tick.
  - ``queue_denial_injection`` / ``drain_denial_injections``: queue
    consumed by ``_denial_between_turns_hook`` at the next Stop seam
    (§6.5). Reason is pre-sanitized upstream by ``DenyTaskFn``.
  - ``mark_ceiling_shrinking_emitted``: emit-once latch for IMPL-26.
  - ``APPROVAL_RATE_LIMIT`` / ``APPROVAL_RATE_WINDOW_S`` module consts
    the hook imports rather than re-deriving.

Milestone writers (§11.1 table, 14 agent-emitted of 15):
  - ``pre_approvals_loaded``, ``approval_requested``,
    ``approval_granted``, ``approval_denied``, ``approval_timed_out``,
    ``approval_stranded``, ``approval_write_failed``,
    ``approval_resume_failed``, ``approval_poll_degraded``,
    ``approval_timeout_capped``, ``approval_ceiling_shrinking``,
    ``approval_cap_exceeded``, ``approval_rate_limit_exceeded``,
    ``approval_late_win``.
  - ``approval_decision_recorded`` (Lambda audit) and
    ``approval_timeout_capped_at_submit`` (CreateTaskFn) stay on the
    Lambda side — Chunk 5 owns those.

Each helper is a thin wrapper over ``_put_event("agent_milestone",
...)`` so the shared circuit-breaker + classifier path (finding aws-samples#6/aws-samples#8)
continues to apply. Metadata keys mirror the §11.1 shapes verbatim
(``maxLifetime_remaining_s`` preserves the design-doc spelling for
downstream parsers).

Tests: +29 total. 17 on ``TestApprovalMilestoneHelpers`` pin the DDB
payload shape for each helper (including the two
``approval_timeout_capped`` reason variants — rule_annotation carries
matching_rule_ids; maxLifetime_ceiling omits the field). 12 on the
engine: counter monotonicity, rate-window prune semantics at window
boundary, denial-queue FIFO + drain-clears, ceiling-shrinking latch
idempotency.

No caller changes — engine and writer surfaces are additive. Hook
integration lands in commit C.
scoropeza pushed a commit to scoropeza/sample-autonomous-cloud-coding-agents that referenced this pull request May 7, 2026
Lands the three read/discovery handlers Chunk 6 (CLI) needs to power
``bgagent pending``, ``bgagent policies list/show``, and
``bgagent notifications configure slack``. Completes §15.2 tasks
14, 15, and 25 (handler side).

Handlers
--------

``get-pending.ts`` (§7.7 — GET /v1/pending)
  - Queries ``user_id-status-index`` GSI on TaskApprovalsTable with
    ``user_id = :caller AND status = :pending``. Without the GSI
    this would be a full-table Scan per call — under
    ``watch -n1 bgagent pending`` that exhausts burst capacity for
    the whole fleet (§10.1 finding aws-samples#8).
  - Response maps each row to ``PendingApprovalSummary`` with a
    derived ``expires_at = created_at + timeout_s`` so the CLI can
    render time-to-timeout without doing arithmetic on ISO strings.
  - Severity coerced to ``medium`` on unknown values so GSI writes
    that drift from the enum don't break the list response.
  - Rate-limited 10/min/user (synthetic row on the same table,
    namespaced ``RATE#<user>#PENDING`` so it does not collide with
    the approve/deny counter).

``get-policies.ts`` (§7.6 — GET /v1/repos/{repo_id}/policies)
  - Combines ``BUILTIN_HARD_DENY_POLICIES`` + ``BUILTIN_SOFT_DENY_POLICIES``
    with the repo's ``cedar_policies`` blueprint override. Runs the
    combined text through ``parseRules`` and returns
    ``{hard[], soft[]}`` rule summaries.
  - 5-minute per-repo in-Lambda cache; cold starts throw it away.
    ``_resetCacheForTests`` exposed for unit-test isolation.
  - Repo ID is URL-decoded from the path (``owner%2Frepo`` common in
    CLI UX).
  - Rate-limited 30/min/user.
  - Blueprint load failure falls back to built-ins with a WARN log;
    invalid blueprint cedar text returns 503 ``SERVICE_UNAVAILABLE``
    rather than a misleading empty list.

``link-slack-user.ts`` (§11.2 finding aws-samples#4 — POST /v1/notifications/slack/link)
  - Writes to SlackUserMappingTable with
    ``ConditionExpression: attribute_not_exists(slack_user_id)``. This
    guard is the entire admission control the §11.2 design hinges on:
    even a compromised Slack admin cannot overwrite an existing
    mapping.
  - Validates ``slack_user_id`` shape (letters, digits, underscores,
    2–40 chars) so junk rows cannot land.
  - Conflict surface is 409 ``REQUEST_ALREADY_DECIDED`` — reused
    error code (the payload message directs the user to unlink via
    support).
  - Slack link_token end-to-end validation against Slack OAuth is
    deferred — v1 accepts the token on trust from the Cognito-authed
    caller; it is persisted in CloudWatch for audit.

Supporting primitives
---------------------

``shared/builtin-policies.ts`` — mirrors ``agent/policies/hard_deny.cedar``
and ``agent/policies/soft_deny.cedar`` as TypeScript string constants.
Embedded rather than read from disk because Lambda's esbuild bundler
does not copy non-TS assets by default and a dedicated bundling hook
is more code than the embed. A drift test
(``builtin-policies.test.ts``) asserts byte-equality with the agent
files so any change on one side without the other flips red at build
time.

``shared/cedar-policy.ts`` — ``parseRules`` now skips the unannotated
``base_permit`` entry (both tiers need it as a Cedar catch-all; it
is not a user-facing rule so it stays out of ParsedRule[]). This
matches the agent-side ``_parse_policy_annotations`` behaviour.

Tests: +37 total.
  - get-pending (8): 401 on missing auth, 429 on rate limit, empty
    result, GSI query shape, row → PendingApprovalSummary with
    derived expires_at, severity fallback, missing timeout → expires_at
    falls back to created_at, 500 on DDB error.
  - get-policies (11): 401/400 validation, built-in rules listed on
    empty repo, URL-decoded repo path, custom blueprint rule lands
    in soft, per-repo cache across calls, 429 rate limit, 503 on
    invalid blueprint cedar, fallback on load failure, hard rules
    omit severity / approval_timeout_s, soft rules carry them.
  - link-slack-user (8): 401/400 validation, shape check, 201 on
    success, 409 on overwrite attempt, 500 on unknown DDB error,
    whitespace trim on slack_user_id, ConditionExpression verified.
  - builtin-policies (4): drift byte-equality with both agent files,
    parseRules round-trip for hard/soft rule IDs.
  - cedar-policy (updated): ``base_permit`` is skipped from
    ParsedRule[] rather than rejected.

Stack wiring (task-api.ts routes, agent.ts layer attachment,
CreateTaskFn extension, orchestrator + reconciler + fanout) lands in
the next commit.
isadeks pushed a commit to isadeks/sample-autonomous-cloud-coding-agents that referenced this pull request May 13, 2026
…ws-samples#79 review aws-samples#8)

Original set omitted documented Slack permission/scope failures.
Codes outside the set fall to the retryable branch, so a
misconfiguration like ``ekm_access_denied`` or ``missing_scope``
would burn 3 Lambda retries before DLQ on every event — even
though the failure is fundamentally a configuration bug that no
retry can clear.

Adds:
  - Permission/scope: missing_scope, ekm_access_denied,
    team_access_not_granted, posting_to_general_channel_denied
  - Payload shape: invalid_arguments

Reorganized the set into commented blocks (channel-shape, auth,
permission/scope, payload-shape) so future additions go in the
right bucket and the rationale stays visible.

Test coverage: parametrized over the full TERMINAL_SLACK_API_ERRORS
set (21 codes) — every one must throw SlackApiError so the router
swallows it. The existing retryable test.each remains intact and
covers the negative-class case (codes outside the set throw a
plain Error and escalate to retry).
github-merge-queue Bot pushed a commit that referenced this pull request May 13, 2026
#79)

* feat(fanout): migrate SlackNotifyFn to FanOutConsumer subscriber (#64)

Move the Slack outbound delivery off its own DynamoDB Streams consumer
onto FanOutConsumer as a per-channel dispatcher. Drops TaskEventsTable
from 2 concurrent stream readers to 1, restoring headroom for future
channels (Email, Teams, etc.) without exceeding the documented
DynamoDB Streams 2-reader-per-shard practical limit.

The PR also addresses an adversarial code review on the original
migration; the body below walks through each piece in the order it
landed.

## (a) Migration

- `cdk/src/handlers/slack-notify.ts` — rewritten as exported
  `dispatchSlackEvent(event, ddb)` plus a tagged `SlackApiError`
  class. The standalone `handler(event)` stream entrypoint is gone;
  the FanOutConsumer is now the only TaskEventsTable stream reader.
  Behaviour preserved bit-for-bit: channel_source==='slack' gate,
  terminal-event dedup via conditional UpdateItem on
  `channel_metadata.slack_notified_terminal`, threaded replies under
  the @mention or task_created message, emoji transitions
  (eyes -> hourglass -> ✅/❌/🚫/⏲), DM channel_id -> user_id rewrite,
  intermediate session+created message cleanup on terminal events.

- `cdk/src/handlers/fanout-task-events.ts` — replaces the log-only
  `dispatchToSlack` stub with a wrapper that calls dispatchSlackEvent
  and routes errors via the new typed contract (see (b) below). Slack
  defaults gain task_created, session_started, task_timed_out so the
  router fans out the lifecycle events the old SlackNotifyFn handled;
  the dispatcher's channel_source gate keeps non-Slack tasks unaffected.

- `cdk/src/constructs/fanout-consumer.ts` — adds a scoped
  `secretsmanager:GetSecretValue` grant on `bgagent/slack/*` so the
  fanout Lambda can fetch per-workspace bot tokens. Same scope the old
  SlackNotifyFn role held.

- `cdk/src/constructs/slack-integration.ts` — deletes SlackNotifyFn,
  its DynamoEventSource, its IAM policy, and its NagSuppressions
  entry. Drops the now-unused StartingPosition / FilterCriteria /
  FilterRule / lambdaEventSources imports.

After this lands, `aws lambda list-event-source-mappings` shows
exactly one consumer of the TaskEventsTable stream (FanOutFn);
verified on the dev stack with end-to-end @mention + cancel + CLI
isolation scenarios.

## (b) Review fix #1 — partial-batch retry semantics (BLOCKER)

The first review pass found that the post-migration handler silently
dropped Slack-side infra errors (DDB throttle on the GetItem,
Secrets Manager 5xx, transient Slack timeout). Pre-migration the
SlackNotifyFn handler rethrew non-SlackApiError so Lambda retried
the batch; post-migration `Promise.allSettled` swallowed the
rejection and routeEvent returned an empty list with no escalation
path to `batchItemFailures`.

routeEvent's return type changed from `NotificationChannel[]` to
`{ dispatched, infraRejections }`. The handler now pushes the
record into `batchItemFailures` whenever `infraRejections.length>0`,
so Lambda replays the record under the partial-batch contract. The
warn line on rejection is tagged `retryable: true` so operators can
alert distinctly from the channel-terminal swallow path.

GitHub got the symmetric treatment: 4xx (excluding the existing 401
and 404 handling) is now treated as a channel-terminal swallow via
`fanout.github.api_error` instead of escalating to retry.

## (c) Review fix #2 — split SlackApiError into terminal + retryable

Originally any `!result.ok` Slack response was wrapped in
SlackApiError and swallowed. That collapsed retryable codes
(`ratelimited`, `service_unavailable`, `internal_error`,
`fatal_error`, `request_timeout`) into the same swallow as
`channel_not_found` — a tier-1 Slack outage would silently drop
every message.

Introduced `TERMINAL_SLACK_API_ERRORS` set + `classifySlackError`
helper. Terminal codes still throw SlackApiError (router swallows).
Retryable codes throw a plain Error so the router classifies them
as infra rejections and Lambda replays.

## (d) Review fix #3 — NOTIFIABLE_EVENTS / CHANNEL_DEFAULTS drift

The original migration added task_created/session_started/task_timed_out
to CHANNEL_DEFAULTS.slack but the dispatcher's NOTIFIABLE_EVENTS gate
already excluded several events the router was subscribing Slack to
(agent_error, pr_created, task_stranded). Result: Slack was reported
as `dispatched` for events it silently dropped — telemetry lied,
agent_error never reached operators on Slack-origin tasks, and
task_stranded rendered the generic "Event: task_stranded for
owner/repo" fallback (UX regression).

Added render cases for task_stranded and agent_error in slack-blocks.ts
and added them to NOTIFIABLE_EVENTS. Forward-compat approval_required
and status_response stay out of NOTIFIABLE_EVENTS until their emitters
ship; a new cross-file consistency test in fanout-task-events.test.ts
fails if anyone re-introduces the drift.

The Slack dispatcher wrapper now passes `effectiveEventType` so an
agent_milestone(pr_created) wrapper is unwrapped before NOTIFIABLE_EVENTS
matching. Without the rewrite, the dispatcher would short-circuit
on the wrapper string `agent_milestone`.

## (e) Review fix #4 — conditional UpdateItem on lifecycle persists

Once the BLOCKER fix made batches retry, the original task_created
and session_started UpdateItem calls became hazardous: a Slack POST
that succeeded but whose follow-up UpdateItem failed transiently
would, on retry, post a second root and overwrite slack_thread_ts —
orphaning every threaded reply that had threaded under the first ts.

Both UpdateItems now carry an `attribute_not_exists` ConditionExpression
on the relevant `channel_metadata.slack_*_msg_ts`. On
ConditionalCheckFailedException the handler logs at info, deletes
the duplicate Slack message via `chat.delete`, and returns. Sibling
retry wins the race; the duplicate is cleaned up.

## (f) Dev-stack regression: drop pr_created from Slack defaults

Live verification surfaced a UX duplication: pr_created (subscribed
in CHANNEL_DEFAULTS.slack as the original §6.2 design called for) and
task_completed both rendered messages with View PR buttons, posted
seconds apart. The original SlackNotifyFn had silently dropped
pr_created (NOTIFIABLE_EVENTS gate), so users hadn't relied on it.

Removed pr_created from CHANNEL_DEFAULTS.slack and from
NOTIFIABLE_EVENTS, and removed the prCreatedMessage renderer.
GitHub keeps pr_created (its edit-in-place comment surface
genuinely benefits from the early checkpoint).

## Verification

- mise //cdk:compile  — clean
- mise //cdk:test     — 1183 / 1183 pass (8 net-new tests added for
                        the review fixes: NOTIFIABLE_EVENTS drift
                        guard, retryable Slack codes, GitHub 4xx
                        swallow, infra rejection escalation,
                        SlackApiError swallow, task_stranded render)
- mise //cdk:eslint   — clean
- mise //cdk:synth    — confirms exactly one Lambda::EventSourceMapping
                        on TaskEventsTable, pointing at FanOutFn
- Dev-stack scenarios — @mention happy path, Cancel button, CLI submit
                        (channel_source=api -> zero Slack dispatches,
                        GitHub edit-in-place still fires)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(fanout): retry GitHub 403/429 instead of swallowing as terminal (#79 review #1)

PR #79 review found that the new 4xx terminal-swallow path treats
HTTP 403 and 429 as channel-terminal — but on GitHub these are
transient rate-limit responses (403 with "API rate limit exceeded",
429 "Too Many Requests"). Under a reconciliation wave that touches
many tasks, an entire window of GitHub comment updates would be
permanently dropped with only a warn log.

Carve out 403 and 429 from the swallow guard so they propagate as
infra rejections through ``Promise.allSettled``. The record lands
in ``batchItemFailures`` and Lambda replays until the rate-limit
window clears (or DLQs after ``retryAttempts``).

Test coverage: parametrized over 403 + 429 with a GitHubCommentError
mock at the helper boundary, asserting the record's eventID surfaces
in ``batchItemFailures`` rather than being absorbed.

* fix(fanout): guard Slack Secrets Manager grant on a prop (#79 review #2)

Every other external-service grant in FanOutConsumer (taskTable,
repoTable, githubTokenSecret) is gated by ``if (props.X)``, so a
deployment that hasn't onboarded the corresponding service stays
free of dangling IAM permissions. The original migration broke the
pattern with an unconditional ``bgagent/slack/*`` Secrets Manager
grant — dev stacks without Slack onboarding ended up holding read
permission on a resource pattern they never use, with a misleading
``cdk-nag AwsSolutions-IAM5`` suppression reason.

Adds an optional ``slackSecretArnPattern`` prop on
``FanOutConsumerProps``; the policy statement is only attached when
the prop is set. ``cdk/src/stacks/agent.ts`` now computes the
``bgagent/slack/*`` ARN inline and passes it through, mirroring the
other guarded props. ``ArnFormat`` and ``Stack`` imports moved out
of fanout-consumer.ts since the construct no longer needs them.

No changes to live behaviour — agent.ts always passes the prop, so
the IAM policy still attaches in production. The dispatcher will
log-and-fail-retry on a missing pattern (covered by review #3 fix).
Test gap covering the construct itself ships in a follow-up commit
(test gap #34).

* fix(fanout): throw on missing TASK_TABLE_NAME env var (#79 review #3)

Pre-fix: when ``TASK_TABLE_NAME`` was unset on a Slack-subscribed
event, ``dispatchSlackEvent`` returned silently after a warn line.
The router counted Slack as ``dispatched`` and a broken stack
quietly dropped every Slack notification — operators only saw it
in the warn-rate metric, with no rejected-channel signal.

Post-fix: throw a plain Error so the rejection propagates as an
infra rejection through ``Promise.allSettled``. The router pushes
the record into ``batchItemFailures``, Lambda retries the batch,
the ``fanout.dispatcher.rejected`` warn fires per record, and
operators get a distinct alarm.

Also bumps the existing log line from ``warn`` to ``error`` and
attaches an ``error_id: FANOUT_SLACK_MISSING_TASK_TABLE`` so the
deployment-bug case can be distinguished from per-record failures.

Test: ``throws when TASK_TABLE_NAME env var is missing`` deletes
the env var, asserts the throw, asserts no DDB call was attempted
(env-var guard fires first).

* fix(fanout): match SlackApiError by name as well as instanceof (#79 review #7)

When a bundler ever duplicates the slack-notify module (rare with
NodejsFunction tree-shaking but possible if dual-bundled), two
distinct SlackApiError classes coexist and ``instanceof`` against
one fails for instances of the other. The dispatcher would see a
foreign-class SlackApiError, fall through to the rethrow branch,
and the router would treat it as an infra rejection — flipping a
channel-terminal swallow into infinite Lambda retries.

Add an ``err.name === 'SlackApiError'`` fallback so the swallow
branch fires either way. Mirrors the duck-typed
``GitHubCommentError`` check used elsewhere in the same handler.

Test: synthesise a plain Error with name === 'SlackApiError'
(NOT an instance of the mock's SlackApiError class) and assert
batchItemFailures stays empty — proving the swallow path catches
both shapes.

* fix(fanout): extend TERMINAL_SLACK_API_ERRORS with permission codes (#79 review #8)

Original set omitted documented Slack permission/scope failures.
Codes outside the set fall to the retryable branch, so a
misconfiguration like ``ekm_access_denied`` or ``missing_scope``
would burn 3 Lambda retries before DLQ on every event — even
though the failure is fundamentally a configuration bug that no
retry can clear.

Adds:
  - Permission/scope: missing_scope, ekm_access_denied,
    team_access_not_granted, posting_to_general_channel_denied
  - Payload shape: invalid_arguments

Reorganized the set into commented blocks (channel-shape, auth,
permission/scope, payload-shape) so future additions go in the
right bucket and the rationale stays visible.

Test coverage: parametrized over the full TERMINAL_SLACK_API_ERRORS
set (21 codes) — every one must throw SlackApiError so the router
swallows it. The existing retryable test.each remains intact and
covers the negative-class case (codes outside the set throw a
plain Error and escalate to retry).

* fix(fanout): promote Slack reaction/delete network errors to error logs (#79 review #5)

The reaction / delete helpers (``addReaction``, ``removeReaction``,
``deleteMessage``) used to log every catch at warn with a single
generic event key, lumping API-level rejections (e.g. ``no_reaction``)
together with infrastructure failures (DNS lookup, TLS handshake,
fetch timeout, JSON parse error from a hostile gateway). Operators
who alarmed on the warn rate saw a flat signal that masked
genuine infra problems.

Split the boundary:

  - API-level (``!result.ok`` after a successful HTTP call) stays at
    warn with channel-specific event keys
    (``fanout.slack.reaction_add_api_error``,
    ``fanout.slack.reaction_remove_api_error``,
    ``fanout.slack.message_delete_api_error``). These are per-message
    UX problems; operators don't page.

  - Network errors (the outer ``catch (err)`` after ``fetch``)
    promote to ``logger.error`` with dedicated event keys
    (``fanout.slack.reaction_add_network_error``,
    ``fanout.slack.reaction_remove_network_error``,
    ``fanout.slack.message_delete_network_error``) and ``error_id``s
    (``FANOUT_SLACK_REACTION_NETWORK``,
    ``FANOUT_SLACK_DELETE_NETWORK``) so each has its own alarmable
    signal. User-visible symptoms when these fire silently:
    stale emoji reactions (hourglass never swaps to ✅) and
    orphaned intermediate messages.

Behaviour unchanged: errors are still swallowed (per-message
reactions and intermediate cleanup are best-effort by design;
they must not fail the batch), but operators now get distinct
metrics for each failure class.

* fix(fanout): emit fanout.slack.dup_delete_failed on ghost-message accumulation (#79 review #6)

The conditional UpdateItem dup-delete path
(``task_created`` / ``session_started`` lifecycle persists)
calls ``deleteMessage`` to clean up the duplicate Slack message
that landed when a sibling retry won the race. The delete is
inherently best-effort — but if it fails, the duplicate becomes a
permanent ghost in the thread and operators had no way to alarm
on the rate.

Refactor ``deleteMessage`` to return a boolean (``true`` on success
or ``message_not_found``-as-already-gone, ``false`` otherwise) and
emit a dedicated ``fanout.slack.dup_delete_failed`` event with an
``error_id: FANOUT_SLACK_DUP_DELETE_FAILED`` from the dup-delete
callsites when the cleanup couldn't complete.

The terminal-event cleanup paths (``slack_session_msg_ts``,
``slack_created_msg_ts``) intentionally don't fire this event —
those paths target genuinely-stale UX cleanup, not retry-driven
duplicates, so an alarm there would be noise.

No new tests beyond the existing dup-delete coverage; the
``deleteMessage`` return value isn't yet asserted at the unit
level, but the behavior is fully exercised by the existing
``dup-delete`` integration paths (test gap #31 will add an
explicit failure-path assertion when it lands).

* chore(fanout): tighten RouteOutcome arrays to ReadonlyArray (#79 review #9)

``RouteOutcome.dispatched`` and ``infraRejections`` were typed as
plain ``NotificationChannel[]`` — which made ``readonly`` on the
property prevent reassignment but still allow callers to mutate the
underlying array via ``.push``, ``.splice``, or ``.sort``. Inconsistent
with the ``ReadonlySet<string>`` used for ``CHANNEL_DEFAULTS`` in the
same file.

Tightening to ``ReadonlyArray<NotificationChannel>`` makes the
contract honest: the router owns the arrays, callers read them.
Test suite updated to use ``[...outcome.dispatched].sort()`` where it
previously called ``.sort()`` directly — the explicit copy makes the
intent clear and would have surfaced any silent test-side mutation.

* refactor(fanout): make SlackDispatchEvent a type alias of FanOutEvent (#79 review #10)

The two interfaces were structurally identical: same five fields,
same readonly modifiers, same metadata shape. The decoupling was
purely nominal and a silent-drift footgun — adding a field to
``FanOutEvent`` (e.g. when the router starts plumbing an
``approval_required`` ID through) would not flow into
``SlackDispatchEvent``, leaving the dispatcher unaware until a
downstream test happened to fail.

Replace with a one-line type alias:

  export type SlackDispatchEvent = FanOutEvent;

The slack-notify module now type-imports ``FanOutEvent`` from
fanout-task-events. ``import type`` is erased at compile time, so
the runtime bundle still has the one-way dep
(fanout-task-events → slack-notify) — no module-cycle hazard.

Reviewer-suggested ``Pick<FanOutEvent, 'task_id' | …>`` was
considered and rejected: the dispatcher uses every field of
``FanOutEvent``, so the Pick would just enumerate the same five
fields with extra noise. A direct alias keeps the intent obvious
and prevents drift identically.

* fix(fanout): generalize Slack dedup to cover agent_error + log Retry-After (#79 review #4)

PR #79 review #4 surfaced a sibling-channel-failure hazard: when
GitHub or Email rate-limits, the record lands in
``batchItemFailures``. On the Lambda retry, every Slack-subscribed
event for that record runs again. Terminal events were already
guarded by ``slack_notified_terminal``; ``agent_error`` was not —
operators would page twice on a single agent failure if a sibling
channel happened to fail.

Generalize the dedup mechanism. ``TERMINAL_EVENTS`` is replaced by
a ``SLACK_DEDUP_ATTRIBUTE`` map that marks each event type with the
``channel_metadata`` attribute that should guard the post:

  - 5 terminals share ``slack_notified_terminal`` (any first-arriving
    terminal claims the right; subsequent terminals dedup against it)
  - ``agent_error`` gets its own ``slack_dispatched_agent_error``
    so a duplicate agent_error doesn't reuse the terminal slot
  - ``task_created`` / ``session_started`` map to ``null`` because
    they already use the per-event ``slack_*_msg_ts`` conditional
    persists from review #1 — the conditional already provides
    full idempotency (a separate marker would be redundant)

Also surfaces Slack's ``Retry-After`` header on rate-limited
responses through a dedicated ``fanout.slack.retryable_api_error``
warn so operators reading CloudWatch can see the recovery window
instead of guessing from sustained warn rate.

Tests:
  - logs Retry-After header on rate-limited Slack responses (new):
    asserts ``retry_after_seconds`` propagates from Slack's response
    header into the warn metadata
  - existing terminal-codes parametrized test untouched (terminal
    branch doesn't read headers)
  - existing retryable test gains a ``headers: { get: () => null }``
    stub on the fetch mock so the headers.get call doesn't crash

Reviewer suggested a per-channel dispatch bitmap as the alternative.
Rejected as premature: the duplicate-GitHub-PATCH is harmless
(idempotent), Email is still a stub, and the dedup map covers
the specific agent_error pain identified above. A bitmap would add
a new table + IAM grants + per-dispatch DDB cost for a hypothetical
problem (Slack rate-limiting AND a sibling channel failure).

* test(fanout): conditional UpdateItem race + dup-delete coverage (#79 test gap)

Adds 4 tests covering the lifecycle-persist conditional path that
review fix #1 introduced and review fix #6 hardened. Pre-PR-#79 the
only ConditionalCheckFailed coverage was the terminal-dedup path;
the new lifecycle-persist + dup-delete code lacked direct assertions
and was flagged 9/10 criticality by the reviewer.

  - task_created persist ConditionalCheckFailed → posts duplicate
    then deletes it: pins the cleanup behaviour that prevents ghost
    task_created posts in the channel
  - session_started persist ConditionalCheckFailed → posts duplicate
    then deletes it: parallel coverage for the other lifecycle
    attribute (slack_session_msg_ts)
  - dup-delete failure emits fanout.slack.dup_delete_failed with
    error_id: pins the operator-alarm signal added in review fix #6;
    asserts both the event key and the FANOUT_SLACK_DUP_DELETE_FAILED
    error_id propagate
  - chat.delete returning message_not_found is treated as success
    (no dup_delete_failed): negative-class assertion. Prevents
    false-positive alarms when the race resolves cleanly (the
    duplicate was already deleted by a prior retry).

The ghost / message_not_found tests use ``fetchMock.mockImplementation``
URL-routing rather than ``.mockResolvedValueOnce`` chains because
``updateReaction`` issues 2-3 reaction-API fetches between
chat.postMessage and chat.delete; routing by URL keeps the test
focused on the load-bearing chat.delete behaviour without coupling
to reaction call order.

* test(fanout): cover task_stranded + agent_error renderers (#79 test gap #32)

Pre-PR-#79 the new ``taskStrandedMessage`` and ``agentErrorMessage``
helpers in slack-blocks.ts had no direct unit tests. Reviewer flagged
this as a 7/10 gap because the renderers carry the prior_status /
error_type / message_preview metadata threaded through from the
event source — silent drift in the metadata field names would
produce ugly fallback messages in production.

Adds 5 tests:

  - task_stranded WITH metadata renders the prior_status parenthetical
    (``Task stranded for org/repo (last status: RUNNING)``) so
    operators can tell at a glance whether the task hung in HYDRATING
    vs RUNNING — without the parenthetical the reviewer's "generic
    Event: ..." UX regression would resurface.
  - task_stranded WITHOUT metadata still renders cleanly (legacy
    events written before the reconciler started stamping metadata
    must not crash or leak ``undefined``).
  - agent_error with full metadata (error_type + message_preview)
    renders the rotating_light, type, and preview.
  - agent_error WITHOUT metadata stays sensible — no leaked
    ``undefined`` strings or empty ``_Type:_`` line.
  - agent_error truncates a 500-char message_preview to keep Slack
    channel UX readable.

* test(fanout): cover agent_error dedup + dedup-slot isolation (#79 test gap #33)

Pre-PR-#79 review-fix #4 there was no direct test for the
``slack_dispatched_agent_error`` dedup attribute or its interaction
with the existing ``slack_notified_terminal`` slot. A future
refactor that collapsed the two slots — or renamed one of them —
would silently break the sibling-channel-failure-retry guarantee
that fix #4 added.

Adds 4 tests:

  - ``agent_error claims its own dedup attribute``: pins the
    UpdateExpression and ConditionExpression strings so a refactor
    that renames the attribute breaks loudly.
  - ``agent_error retry hits the dedup guard``: end-to-end scenario
    matching review #4 — task already has
    ``slack_dispatched_agent_error: true``, retry must short-circuit
    before chat.postMessage. Without the guard, a second
    rotating_light fires.
  - ``terminal dedup attribute is per-class``: a flaky
    task_completed-then-task_failed sequence dedups against the
    same ``slack_notified_terminal`` slot. Catches the regression
    where the orchestrator emits both terminal types and we'd
    otherwise post both ✅ and ❌.
  - ``agent_error and terminals use distinct dedup slots``: the
    important negative — having ``slack_dispatched_agent_error``
    set must NOT shadow a subsequent ``task_completed``. Pins
    the slot separation so a future merge into a single slot
    can't silently drop terminals after an agent_error.

* test(fanout): add construct-level tests for FanOutConsumer (#79 test gap #34)

The construct shipped on issue #64 with no unit-level coverage of
its IAM contract. The only synth-level signal lived inside
``slack-integration.test.ts`` ("0 EventSourceMapping") which proved
the migration didn't regress the OTHER construct. Reviewer flagged
this 6/10 — and the gap is what allowed review #2 (unconditional
Slack secret grant) to slip through in the first place.

Adds 6 tests:

  - ``attaches a single DynamoEventSource on the TaskEventsTable
    stream``: pins the architectural invariant — issue #64 was
    fundamentally about reaching exactly-one stream reader. Adding
    a second consumer must fail this test loudly.
  - ``creates a DLQ for the fanout Lambda``: pins retention period
    + presence; a DLQ-less deployment would silently drop
    poison-pill records past retryAttempts.
  - ``omits the bgagent/slack/* grant when slackSecretArnPattern
    is not provided``: the review #2 invariant. Iterates every
    IAM::Policy and asserts NONE of them grant secretsmanager:*
    on a bgagent/slack/* ARN. A regression that re-introduces the
    unconditional grant breaks this test.
  - ``attaches the bgagent/slack/* grant only when
    slackSecretArnPattern is provided``: the positive case. Pins
    the grant shape (action, effect, resource pattern).
  - ``passes TASK_TABLE_NAME env var when taskTable is provided``:
    review #3 dependency — the dispatcher throws on missing env.
  - ``omits TASK_TABLE_NAME env var when taskTable is not provided``:
    graceful degrade for dev stacks that haven't onboarded the
    TaskTable yet (matches the construct's documented contract).

* test(fanout): cover task_stranded through terminal dedup (#79 test gap #35)

The reconciler at handlers/reconcile-stranded-tasks.ts:170 emits
BOTH ``task_stranded`` and ``task_failed`` for a heartbeat-expired
task — one for the operator signal, one to drive the FAILED status
transition. Pre-PR-#79 this pair had no test coverage; reviewer
flagged this 8/10 because the visible failure mode (a paired
"Task stranded" + "Task failed" double-page in Slack) would
surface in production but be silent in CI.

Adds 2 tests:

  - ``task_stranded posts and writes the terminal dedup marker on
    first arrival``: pins that task_stranded participates in the
    shared terminal slot and renders the warning message with
    metadata. Catches a regression that omits task_stranded from
    the dedup map entirely.
  - ``task_stranded after a sibling task_failed dedups``: the
    operational scenario — task_failed already claimed
    ``slack_notified_terminal``; the subsequent task_stranded must
    short-circuit before chat.postMessage. Without this guard,
    operators get the double-page the reviewer warned about.

* fix(fanout): re-read TaskRecord before terminal cleanup to close orphan-message race

Live observation during PR #79 review verification: the same Slack
@mention happy path sometimes leaves the 🚀 task_created message
in the thread (orphaned beside the ✅ task_completed) and sometimes
deletes it cleanly. The race window:

  1. ``task_created`` stream batch posts the rocket message and
     persists ``slack_created_msg_ts`` via the conditional UpdateItem
     introduced in PR #79 review fix #1.
  2. ``task_completed`` stream batch fires ~30s later. Its initial
     GetItem races the prior UpdateItem and sees a stale
     ``channel_metadata`` WITHOUT ``slack_created_msg_ts``.
  3. The terminal cleanup branch checks
     ``channelMeta.slack_created_msg_ts`` — undefined — silently skips
     the chat.delete. The rocket message stays in the thread.

Add a fresh GetItem inside the TERMINAL_EVENTS cleanup branch, after
the dedup UpdateItem has linearized our view of the table. Any prior
``slack_*_msg_ts`` writes are visible by then, so the cleanup fires
correctly. On a re-read failure (DDB throttle / transient blip) we
fall back to the dispatch-entry snapshot and emit
``fanout.slack.cleanup_reread_failed`` so operators can alarm on
the rate.

Pre-existing race (the unconditional UpdateItem in pre-PR-#79 was
the same shape — wrote, GetItem on the next batch could miss it).
PR #79 doesn't introduce it but doesn't fix it either; this commit
does, since the live screenshot evidence appeared during review
verification.

Tests:
  - ``terminal cleanup re-reads TaskRecord``: scripts a stale
    dispatch-entry GetItem followed by a fresh re-read GetItem with
    ``slack_created_msg_ts`` present; asserts chat.delete fires
    against the freshly-read ts.
  - ``terminal cleanup falls back to dispatch-entry snapshot when
    re-read fails``: defense-in-depth — DDB throttle on the re-read
    must not break terminal delivery; cleanup uses the entry snapshot
    and emits the fallback warn.

---------

Co-authored-by: bgagent <bgagent@noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Alain Krok <alkrok@amazon.com>
scoropeza added a commit to scoropeza/sample-autonomous-cloud-coding-agents that referenced this pull request May 16, 2026
Adds the 14 agent-side approval milestone writers (§11.1) on
``_ProgressWriter`` so Chunk 3's hook integration has a typed API
instead of stringly-typed ``write_agent_milestone`` calls, and the
per-task gate counter / per-container sliding-window rate limit /
denial-injection queue on ``PolicyEngine`` that §6.5 requires.

Why now: the hook work lands cleanly only after these surfaces exist
— every code path in ``pre_tool_use_hook``'s REQUIRE_APPROVAL branch
calls one of these helpers. Shipping them separately lets the hook
commit be about the state machine, not the event-shape bookkeeping.

Engine additions:
  - ``approval_gate_count`` / ``increment_approval_gate_count``: the
    per-task counter §12.9 bounds at ``approvalGateCap``. Session-scoped
    in v1; persistence tracked in §17.
  - ``approvals_in_last_minute`` / ``record_approval_gate_timestamp``:
    sliding-window rate limit (20/min/container, §12.9). Prune on read
    so callers see the current count without a separate tick.
  - ``queue_denial_injection`` / ``drain_denial_injections``: queue
    consumed by ``_denial_between_turns_hook`` at the next Stop seam
    (§6.5). Reason is pre-sanitized upstream by ``DenyTaskFn``.
  - ``mark_ceiling_shrinking_emitted``: emit-once latch for IMPL-26.
  - ``APPROVAL_RATE_LIMIT`` / ``APPROVAL_RATE_WINDOW_S`` module consts
    the hook imports rather than re-deriving.

Milestone writers (§11.1 table, 14 agent-emitted of 15):
  - ``pre_approvals_loaded``, ``approval_requested``,
    ``approval_granted``, ``approval_denied``, ``approval_timed_out``,
    ``approval_stranded``, ``approval_write_failed``,
    ``approval_resume_failed``, ``approval_poll_degraded``,
    ``approval_timeout_capped``, ``approval_ceiling_shrinking``,
    ``approval_cap_exceeded``, ``approval_rate_limit_exceeded``,
    ``approval_late_win``.
  - ``approval_decision_recorded`` (Lambda audit) and
    ``approval_timeout_capped_at_submit`` (CreateTaskFn) stay on the
    Lambda side — Chunk 5 owns those.

Each helper is a thin wrapper over ``_put_event("agent_milestone",
...)`` so the shared circuit-breaker + classifier path (finding aws-samples#6/aws-samples#8)
continues to apply. Metadata keys mirror the §11.1 shapes verbatim
(``maxLifetime_remaining_s`` preserves the design-doc spelling for
downstream parsers).

Tests: +29 total. 17 on ``TestApprovalMilestoneHelpers`` pin the DDB
payload shape for each helper (including the two
``approval_timeout_capped`` reason variants — rule_annotation carries
matching_rule_ids; maxLifetime_ceiling omits the field). 12 on the
engine: counter monotonicity, rate-window prune semantics at window
boundary, denial-queue FIFO + drain-clears, ceiling-shrinking latch
idempotency.

No caller changes — engine and writer surfaces are additive. Hook
integration lands in commit C.
scoropeza added a commit to scoropeza/sample-autonomous-cloud-coding-agents that referenced this pull request May 16, 2026
Lands the three read/discovery handlers Chunk 6 (CLI) needs to power
``bgagent pending``, ``bgagent policies list/show``, and
``bgagent notifications configure slack``. Completes §15.2 tasks
14, 15, and 25 (handler side).

Handlers
--------

``get-pending.ts`` (§7.7 — GET /v1/pending)
  - Queries ``user_id-status-index`` GSI on TaskApprovalsTable with
    ``user_id = :caller AND status = :pending``. Without the GSI
    this would be a full-table Scan per call — under
    ``watch -n1 bgagent pending`` that exhausts burst capacity for
    the whole fleet (§10.1 finding aws-samples#8).
  - Response maps each row to ``PendingApprovalSummary`` with a
    derived ``expires_at = created_at + timeout_s`` so the CLI can
    render time-to-timeout without doing arithmetic on ISO strings.
  - Severity coerced to ``medium`` on unknown values so GSI writes
    that drift from the enum don't break the list response.
  - Rate-limited 10/min/user (synthetic row on the same table,
    namespaced ``RATE#<user>#PENDING`` so it does not collide with
    the approve/deny counter).

``get-policies.ts`` (§7.6 — GET /v1/repos/{repo_id}/policies)
  - Combines ``BUILTIN_HARD_DENY_POLICIES`` + ``BUILTIN_SOFT_DENY_POLICIES``
    with the repo's ``cedar_policies`` blueprint override. Runs the
    combined text through ``parseRules`` and returns
    ``{hard[], soft[]}`` rule summaries.
  - 5-minute per-repo in-Lambda cache; cold starts throw it away.
    ``_resetCacheForTests`` exposed for unit-test isolation.
  - Repo ID is URL-decoded from the path (``owner%2Frepo`` common in
    CLI UX).
  - Rate-limited 30/min/user.
  - Blueprint load failure falls back to built-ins with a WARN log;
    invalid blueprint cedar text returns 503 ``SERVICE_UNAVAILABLE``
    rather than a misleading empty list.

``link-slack-user.ts`` (§11.2 finding aws-samples#4 — POST /v1/notifications/slack/link)
  - Writes to SlackUserMappingTable with
    ``ConditionExpression: attribute_not_exists(slack_user_id)``. This
    guard is the entire admission control the §11.2 design hinges on:
    even a compromised Slack admin cannot overwrite an existing
    mapping.
  - Validates ``slack_user_id`` shape (letters, digits, underscores,
    2–40 chars) so junk rows cannot land.
  - Conflict surface is 409 ``REQUEST_ALREADY_DECIDED`` — reused
    error code (the payload message directs the user to unlink via
    support).
  - Slack link_token end-to-end validation against Slack OAuth is
    deferred — v1 accepts the token on trust from the Cognito-authed
    caller; it is persisted in CloudWatch for audit.

Supporting primitives
---------------------

``shared/builtin-policies.ts`` — mirrors ``agent/policies/hard_deny.cedar``
and ``agent/policies/soft_deny.cedar`` as TypeScript string constants.
Embedded rather than read from disk because Lambda's esbuild bundler
does not copy non-TS assets by default and a dedicated bundling hook
is more code than the embed. A drift test
(``builtin-policies.test.ts``) asserts byte-equality with the agent
files so any change on one side without the other flips red at build
time.

``shared/cedar-policy.ts`` — ``parseRules`` now skips the unannotated
``base_permit`` entry (both tiers need it as a Cedar catch-all; it
is not a user-facing rule so it stays out of ParsedRule[]). This
matches the agent-side ``_parse_policy_annotations`` behaviour.

Tests: +37 total.
  - get-pending (8): 401 on missing auth, 429 on rate limit, empty
    result, GSI query shape, row → PendingApprovalSummary with
    derived expires_at, severity fallback, missing timeout → expires_at
    falls back to created_at, 500 on DDB error.
  - get-policies (11): 401/400 validation, built-in rules listed on
    empty repo, URL-decoded repo path, custom blueprint rule lands
    in soft, per-repo cache across calls, 429 rate limit, 503 on
    invalid blueprint cedar, fallback on load failure, hard rules
    omit severity / approval_timeout_s, soft rules carry them.
  - link-slack-user (8): 401/400 validation, shape check, 201 on
    success, 409 on overwrite attempt, 500 on unknown DDB error,
    whitespace trim on slack_user_id, ConditionExpression verified.
  - builtin-policies (4): drift byte-equality with both agent files,
    parseRules round-trip for hard/soft rule IDs.
  - cedar-policy (updated): ``base_permit`` is skipped from
    ParsedRule[] rather than rejected.

Stack wiring (task-api.ts routes, agent.ts layer attachment,
CreateTaskFn extension, orchestrator + reconciler + fanout) lands in
the next commit.
krokoko added a commit to isadeks/sample-autonomous-cloud-coding-agents that referenced this pull request May 18, 2026
* docs(cedar-hitl): restore and revise HITL gates design, fold adversarial findings

Design doc was accidentally removed in 0742ebe; restored from b34d7cd and
substantially revised under a new filename. "Phase 3" framing dropped — this
is the Cedar HITL approval gates feature.

- Renamed PHASE3_CEDAR_HITL.md → CEDAR_HITL_GATES.md; all "phase" gating
  removed (Phase 3a/3b → v1 / future work §17).
- Integrated 16 findings from 2026-05-06 adversarial review with realistic
  scenarios. Major structural changes:
  - Decision #23 (new): cross-engine parity contract between cedarpy (agent,
    Python) and @cedar-policy/cedar-wasm@4.10.0 (Lambda, TS).
  - §11.2: SlackUserMappingTable with OAuth user-initiated mapping; severity-
    gated Slack approvals; admin has no write path.
  - §7.1/§12.3: ApproveTaskFn uses cross-table TransactWriteItems for atomicity.
  - §10.1: user_id-status-index GSI on TaskApprovalsTable; v1 not v-later.
  - §15.6: cedar-wasm as a Lambda layer shared across policy Lambdas.
- Gate-cap revision (2026-05-07): decision #13 — default 50, blueprint-
  configurable via security.approvalGateCap (bounded 1–500), persisted on
  TaskTable. Cache memory bound decoupled: 50-entry LRU regardless of cap.
  IMPL-22 adds telemetry-driven re-evaluation criteria.
- Timeout adversarial+advocate pass (2026-05-07):
  - §6.5 VM-throttle race fix: re-read row on failed TIMED_OUT
    ConditionCheckFailed; honor APPROVED if user beat the timer. IMPL-24.
  - Sub-120s @approval_timeout_s emits blueprint-load WARN. IMPL-25.
  - User-visible timeout cap milestones (approval_timeout_capped_at_submit,
    approval_ceiling_shrinking). IMPL-26.
  - Runtime JWT: no refresh logic in agent/src/ (container uses IAM role);
    ceiling stays min(1h, maxLifetime_remaining - 120s). IMPL-27.
  - Three new CloudWatch metrics for timeout tuning. IMPL-28.
  - §14.8 new: off-hours trade-off section (fail-closed is the invariant).
  - §13.13 new: notification-delivery failure does NOT pause the timer
    (bypass-prevention).
- Added six mermaid diagrams: three-outcome decision flow, end-to-end round-
  trip, TaskApprovalsTable state machine, Slack user-mapping, fail-closed
  decision flow, cross-engine parity check.
- Cross-references updated in INTERACTIVE_AGENTS.md and SECURITY.md.
- Starlight mirror regenerated via docs/scripts/sync-starlight.mjs.

No code changes in this commit — design work only. Implementation lands in a
follow-up PR per §15.2 task list.

* feat(cedar-hitl): pin Cedar engines and seed cross-engine parity contract

Chunk 1 of the Cedar HITL gates PR (docs/design/CEDAR_HITL_GATES.md).
Lays the foundation before engine rewrites in Chunk 2+: both Cedar engines
pinned exactly per decision #23, annotation surface validated by Day-1
spikes per decision #22, and the golden-file parity fixtures seeded so
every subsequent chunk can rely on the contract.

- Pin cedarpy==4.8.0 (agent) and @cedar-policy/cedar-wasm@4.10.0 (cdk)
  exactly (no ^/~); document both in mise.toml header.
- Add agent/tests/test_cedarpy_annotations_contract.py (10 tests)
  validating all 5 annotations round-trip verbatim via
  policies_to_json_str() under staticPolicies.<id>.annotations.
- Add cdk/test/handlers/shared/cedar-policy.test.ts (12 tests) validating
  policySetTextToParts + policyToJson extract the same annotations
  verbatim and isAuthorized returns the documented {type, response}
  wrapper shape.
- Add contracts/cedar-parity/ with 5 golden-file fixtures (single-match,
  multi-match, hard-deny, soft-deny write, no-match default-allow) +
  README documenting the contract. Every fixture policy carries a
  @rule_id - including the base permit as @rule_id("base_permit") - so
  the parity tests raise if either engine returns an unannotated match
  instead of silently dropping it.
- Add agent/tests/test_cedar_parity.py (6 tests, cedarpy side) and
  cdk/test/handlers/shared/cedar-parity.test.ts (6 tests, cedar-wasm
  side) loading the shared fixtures and asserting (decision, sorted
  rule_ids) match expected. Both tests hard-import cedarpy/cedar-wasm
  so a dependency regression fails loud rather than silently skipping.
- Update docs/design/CEDAR_HITL_GATES.md sections 15.2 row 3, 15.6
  prose and the parity mermaid diagram to point at contracts/cedar-parity/
  (the precedent set by contracts/memory-hash-vectors.json) instead of a
  new tests/fixtures/ dir. Regenerate the Starlight mirror.
- Add IMPL-29 noting the cedarpy diagnostics.reasons / cedar-wasm
  diagnostics.reason naming asymmetry surfaced by the spikes; engine
  code normalizes at the boundary.
- Fix rev-4 -> rev-5 cosmetic footer drift.

Test counts: agent 500 -> 516 (+16), cdk 1036 -> 1054 (+18), cli 190
unchanged. No production code changes in this chunk; engine rewrite
lands in Chunk 2.

Follow-up: separate chore issue to move contracts/memory-hash-vectors.json
into a self-named subdir for consistency with contracts/cedar-parity/.

* feat(cedar-hitl): three-outcome PolicyEngine core

Chunk 2 of the Cedar HITL gates PR. Rewrites agent/src/policy.py into
the three-outcome engine specified in docs/design/CEDAR_HITL_GATES.md
section 6. The REQUIRE_APPROVAL outcome is the human-in-the-loop surface
the next chunks (PreToolUse hook extension, REST API, CLI) plug into.
This chunk ships the engine and its load-time validation; no hook or
wire-format changes yet.

Engine:
- Outcome enum (ALLOW, DENY, REQUIRE_APPROVAL) + extended PolicyDecision
  with .allowed backward-compat shim for Phase 1a/1b/2 callers. Custom
  __init__ accepts both outcome= and legacy allowed= kwargs so existing
  tests keep working verbatim.
- Three-outcome pipeline per section 6.2: hard-deny eval (absolute) ->
  allowlist fast-path (tool_type/tool_group/bash_pattern/write_path/
  all_session) -> recent-decision cache (60s TTL on DENIED/TIMED_OUT) ->
  soft-deny eval (with post-eval rule-scope allowlist check and
  blueprint_disable filtering) -> default ALLOW.
- ApprovalAllowlist (section 6.4): parses and matches every scope type.
  Strips whitespace and rejects empty-after-strip values so
  "tool_type: Read " normalizes instead of silently mismatching (review
  finding 6).
- RecentDecisionCache (section 12.9): 50-entry LRU, INDEPENDENT of
  approvalGateCap. Populated only on DENIED/TIMED_OUT. Session-scoped
  (documented section 12.8 caveat).
- Annotation handling (sections 5.2 + 6.3): parses @rule_id, @tier,
  @approval_timeout_s, @severity, @category via
  cedarpy.policies_to_json_str(); merges on multi-match with min timeout
  (clamped by 30s floor) and max severity.
- Load-time validation (sections 5.1, 12.4): rejects missing/mismatched
  @tier, missing @rule_id, sub-floor timeouts, duplicate rule_ids across
  tiers, blueprint text > 64 KB, disable entries naming built-in
  hard-deny rules (finding 9), approval_gate_cap outside [1, 500]
  (decision 13). Sub-120s @approval_timeout_s emits WARN but accepts
  (IMPL-25).
- Fail-closed posture (section 13): cedarpy parse errors surface via
  diagnostics.errors -> RuntimeError raised inside _eval_tier -> outer
  handler returns DENY with reason "fail-closed: <ExceptionType>".
  TypeError on json.dumps of unhashable tool_input surfaces as distinct
  "fail-closed: unhashable_tool_input" reason (review finding 5).

Built-in policies:
- agent/policies/hard_deny.cedar: base_permit catch-all + rm_slash +
  write_git_internals + write_git_internals_nested + drop_table +
  pr_review-specific Write/Edit forbids (absolute).
- agent/policies/soft_deny.cedar: base_permit (catch-all required in
  each tier so cedarpy default-deny does not convert no-match into
  DENY) + force_push_any + force_push_main + push_to_protected_branch
  + write_env_files + write_credentials. All soft rules carry @tier,
  @rule_id, @approval_timeout_s, @severity, @category per section 15.4
  starter set.

Review findings addressed (1 blocker, 8 significant, plus minor):
- blueprint_disable actually disables soft rules at eval time instead
  of silently no-op (the blocker: test coverage had been a silent-pass).
- Legacy extra_policies with @tier/@rule_id rejected to avoid undefined
  double-annotation behavior.
- _matching_rule_ids logs WARN on unknown policy IDs (state-drift
  signal).
- base_permit validator exemption restricted to effect=="permit" so
  misnamed forbid rules cannot bypass validation (finding 7).
- Hard-tier Cedar no_decision logged at WARN (signals missing/malformed
  base_permit catch-all).
- Allowlist whitespace normalization + empty-value rejection.
- StrEnum upgrade, Callable moved to TYPE_CHECKING, assert replaced
  with explicit RuntimeError for S101 compliance.

Phase 1 compatibility:
- All 39 existing test_policy.py tests pass unchanged via the
  .allowed property. One test (test_invalid_policy_syntax_fails_closed)
  updated to patch _hard_policies instead of the removed _policies
  attribute; docstring explains the rewrite.
- extra_policies kwarg preserved; callers with annotated rules must
  migrate to blueprint_soft_policies / blueprint_hard_policies.

Test counts: agent 516 -> 576 (+60: 51 three-outcome + 9 regression
fixes). cli 190 unchanged. cdk 1054 unchanged.

Carry-forward to Chunk 3:
- extra_policies semantic shift (Phase 1 DENY -> Chunk 2
  REQUIRE_APPROVAL); .allowed=False preserved but .outcome differs.
  Switchover happens when hooks.py adopts the three-outcome branching.
- Cross-tier action-context asymmetry (review finding 8): document
  rule-authoring constraint in section 5.5 of design.
- Probe entity-shape coverage (finding 10): extend _probe_cedar to
  exercise Write/Edit/Bash action paths, not just invoke_tool.

* feat(cedar-hitl): approval milestone writers + engine counters

Adds the 14 agent-side approval milestone writers (§11.1) on
``_ProgressWriter`` so Chunk 3's hook integration has a typed API
instead of stringly-typed ``write_agent_milestone`` calls, and the
per-task gate counter / per-container sliding-window rate limit /
denial-injection queue on ``PolicyEngine`` that §6.5 requires.

Why now: the hook work lands cleanly only after these surfaces exist
— every code path in ``pre_tool_use_hook``'s REQUIRE_APPROVAL branch
calls one of these helpers. Shipping them separately lets the hook
commit be about the state machine, not the event-shape bookkeeping.

Engine additions:
  - ``approval_gate_count`` / ``increment_approval_gate_count``: the
    per-task counter §12.9 bounds at ``approvalGateCap``. Session-scoped
    in v1; persistence tracked in §17.
  - ``approvals_in_last_minute`` / ``record_approval_gate_timestamp``:
    sliding-window rate limit (20/min/container, §12.9). Prune on read
    so callers see the current count without a separate tick.
  - ``queue_denial_injection`` / ``drain_denial_injections``: queue
    consumed by ``_denial_between_turns_hook`` at the next Stop seam
    (§6.5). Reason is pre-sanitized upstream by ``DenyTaskFn``.
  - ``mark_ceiling_shrinking_emitted``: emit-once latch for IMPL-26.
  - ``APPROVAL_RATE_LIMIT`` / ``APPROVAL_RATE_WINDOW_S`` module consts
    the hook imports rather than re-deriving.

Milestone writers (§11.1 table, 14 agent-emitted of 15):
  - ``pre_approvals_loaded``, ``approval_requested``,
    ``approval_granted``, ``approval_denied``, ``approval_timed_out``,
    ``approval_stranded``, ``approval_write_failed``,
    ``approval_resume_failed``, ``approval_poll_degraded``,
    ``approval_timeout_capped``, ``approval_ceiling_shrinking``,
    ``approval_cap_exceeded``, ``approval_rate_limit_exceeded``,
    ``approval_late_win``.
  - ``approval_decision_recorded`` (Lambda audit) and
    ``approval_timeout_capped_at_submit`` (CreateTaskFn) stay on the
    Lambda side — Chunk 5 owns those.

Each helper is a thin wrapper over ``_put_event("agent_milestone",
...)`` so the shared circuit-breaker + classifier path (finding #6/#8)
continues to apply. Metadata keys mirror the §11.1 shapes verbatim
(``maxLifetime_remaining_s`` preserves the design-doc spelling for
downstream parsers).

Tests: +29 total. 17 on ``TestApprovalMilestoneHelpers`` pin the DDB
payload shape for each helper (including the two
``approval_timeout_capped`` reason variants — rule_annotation carries
matching_rule_ids; maxLifetime_ceiling omits the field). 12 on the
engine: counter monotonicity, rate-window prune semantics at window
boundary, denial-queue FIFO + drain-clears, ceiling-shrinking latch
idempotency.

No caller changes — engine and writer surfaces are additive. Hook
integration lands in commit C.

* feat(cedar-hitl): TaskApprovals + AWAITING_APPROVAL transition primitives

Adds the four agent-side DDB primitives §6.5 + IMPL-24 need for the
three-outcome hook integration in the next commit:

  - ``transact_write_approval_request`` — cross-table TransactWriteItems:
    Put(TaskApprovalsTable) with ``attribute_not_exists(request_id)`` +
    Update(TaskTable) gated on ``status = RUNNING``. Atomic per §12.3 so
    a concurrent cancel cannot land the task in AWAITING_APPROVAL with
    no matching approval row (or vice versa).
  - ``transact_resume_from_approval`` — Update(TaskTable) gated on
    ``status = AWAITING_APPROVAL AND awaiting_approval_request_id =
    :rid``. The ``request_id`` condition prevents resuming with a stale
    ID after a reconciler race (§13.9).
  - ``best_effort_update_approval_status`` — conditional UpdateItem on
    the approval row with ``status = :pending`` guard. Returns False on
    ``ConditionalCheckFailedException``; this is the signal IMPL-24's
    re-read path fires on (§6.5 pseudocode lines 846-879, §13.12).
  - ``get_approval_row`` — GetItem with ``ConsistentRead=True`` by
    default. Required by IMPL-24's re-read; kept opt-out (bool flag) for
    future cold-path callers that don't need the strong read.

Errors:
  - ``ApprovalTablesUnavailable`` for env-var-missing — raised loud so
    a pre-Chunk-4 deploy fails closed (hook will map to DENY) rather
    than silently no-op'ing the gate.
  - ``ApprovalWriteError`` / ``ApprovalResumeError`` wrap
    ``TransactionCanceledException`` with the cancellation reasons
    list. The hook uses these to distinguish the "concurrent cancel"
    branch from real DDB outages.
  - ``ConditionalCheckFailedException`` on ``update_item`` is consumed
    and returned as ``False`` from ``best_effort_update_approval_status``
    — the caller (hook) needs the boolean to decide whether to
    re-read, not to propagate.
  - All other DDB errors propagate so the hook's outer try/except can
    classify fail-closed with a specific reason.

Implementation notes:
  - Uses ``boto3.client("dynamodb")`` low-level API (not resource).
    ``transact_write_items`` lives on the client, and marshalling the
    approval row attributes explicitly gives deterministic DDB shapes
    that the tests can assert on. ``_py_to_ddb_attr`` covers the
    subset of Python types §10.1 actually uses (str/int/bool/None/list
    of str); any other type raises TypeError loudly rather than
    silently writing something unexpected.
  - ``_extract_error_code`` / ``_extract_cancellation_reasons`` duck-type
    on ``exc.response`` so we don't need botocore at import time (tests
    use a minimal exception class).
  - Errors from unsupported types (floats, dicts, etc.) are caught
    BEFORE the DDB round-trip so the unit-test asserts
    ``transact_write_items`` was not called — catches schema drift
    early.
  - Status constants (``_STATUS_RUNNING`` / ``_STATUS_AWAITING_APPROVAL``)
    named so a rename in CDK cannot silently diverge the Python path.

Tests: +20 total.
  - 5 on TransactWriteApprovalRequest: env-missing, happy-path shape
    assertion (both items + conditions), TransactionCanceled → ApprovalWriteError
    with reasons preserved, other errors propagate, unsupported type rejected
    before any DDB call.
  - 3 on TransactResumeFromApproval: env-missing, happy-path expression
    shape (includes REMOVE awaiting_approval_request_id), cancel →
    ApprovalResumeError.
  - 4 on BestEffortUpdateApprovalStatus: happy path returns True,
    ``reason`` kwarg attaches ``deny_reason``, ConditionalCheckFailed
    returns False (IMPL-24's signal), other errors propagate.
  - 4 on GetApprovalRow: ConsistentRead default True, opt-out False,
    row-not-found returns None, row unmarshalling through every
    supported DDB attribute type.
  - 4 on helpers: error-code extraction with and without
    ClientError-shape, cancellation-reasons extraction with and without.

No runtime callers yet — hook integration lands in commit C. Physical
TaskApprovalsTable lands in Chunk 4; Python side is wire-compatible so
the hook work can be unit-tested today with mocked clients.

* feat(cedar-hitl): PreToolUse three-outcome REQUIRE_APPROVAL path

Wires the agent to the full §6.5 pseudocode: cap + rate-limit check,
atomic TransactWriteItems for pending row + TaskTable AWAITING_APPROVAL,
2s→5s ConsistentRead poll, IMPL-24 VM-throttle race re-read, resume
transition, scope propagation to allowlist, and denial-injection queue
consumed at the next Stop seam. Completes §15.2 rows 26 + 27.

Hook control flow (three outcomes)
----------------------------------
- ALLOW / DENY: existing Phase 1 behavior, now switching on
  ``.outcome`` rather than ``.allowed``. Legacy Phase 1/2 tests still
  green because PolicyDecision preserves the ``.allowed`` shim.
- REQUIRE_APPROVAL (new): extracted into ``_handle_require_approval``
  for readability. Delegates to ``task_state`` primitives and
  ``engine.*`` counter surfaces from the prior commits; no new DDB
  client construction here.

Key pieces:
  - ``_compute_effective_timeout`` applies the §6.5 min(rule, default,
    lifetime) formula. The engine's ``_merge_annotations`` has already
    clipped decision.timeout_s against the task default; the hook adds
    the remaining-lifetime ceiling and floors at FLOOR_30S.
    ``clip_reason`` distinguishes ``rule_annotation`` (rule was tighter
    than task default) from ``maxLifetime_ceiling`` (task is late in
    its life) so ``approval_timeout_capped`` carries the right reason.
  - ``_remaining_maxlifetime_s`` reads ``AGENTCORE_MAX_LIFETIME_S`` +
    ``TASK_STARTED_AT`` env vars (8h default). Returns ``None`` when
    the start timestamp is absent — the hook treats that as "unknown,
    don't clip" rather than pre-DENYing, so Phase 1 test paths that
    don't set the env var still see the old task-default behaviour.
    Chunk 4/5 will wire these at task launch.
  - ``_poll_for_decision`` uses 2s cadence for the first 30s then 5s
    (IMPL-12). All polls use ``ConsistentRead=True`` per IMPL-24. 3
    consecutive GetItem failures emit ``approval_poll_degraded``; 10
    consecutive failures fall through as TIMED_OUT with a specific
    reason (§13.2).
  - ``_reconcile_late_decision`` implements IMPL-24 re-read: on a
    ConditionCheckFailed from the TIMED_OUT write, re-read with
    ConsistentRead. APPROVED → rebuild outcome, propagate scope to
    allowlist, run normal allow flow, emit ``approval_late_win``.
    DENIED → honor the user's sanitized reason. PENDING or row gone
    → fall through with TIMED_OUT (fail-closed, §13.12 last paragraph).

Cancel-wins semantics (finding #2)
----------------------------------
``_denial_between_turns_hook`` is registered AFTER
``_nudge_between_turns_hook`` in ``between_turns_hooks`` so cancel
short-circuits both. The hook re-checks ``_cancel_requested`` itself
as belt-and-braces (matching the nudge hook) so a future reorder does
not silently break cancel-wins. Denial queue is PRESERVED on cancel —
not drained — so a denial still sitting on the queue when the task is
being torn down does not leak across tasks (the engine is per-task
per §IMPL-7).

``stop_hook`` threads ``engine`` into ``ctx`` so the denial hook can
``drain_denial_injections``. ``build_hook_matchers`` accepts a new
``user_id`` kwarg (§12.2) so approval rows carry caller identity for
the REST side's ownership check.

``permissionDecisionReason`` guaranteed surface
-----------------------------------------------
The hook's deny return is the ONLY guaranteed surface the SDK emits
to the agent; denial injection is best-effort (pre-empted by cancel).
``_deny_response`` pipes every reason through ``_strip_ansi`` +
``_truncate(500)``: ANSI sequences can never reach the model, and the
line stays loggable. §12.7 requirement.

Tests: +24 agent hook tests (47 total in test_hooks.py). Run in 0.92s
via a ``_fast_poll`` fixture that collapses ``asyncio.sleep`` to a
no-op AND advances ``hooks.time.monotonic`` by the requested duration
so the poll wall-clock deadline actually trips.

Happy paths:
  - APPROVED + scope propagation to allowlist + milestones.
  - APPROVED with scope=this_call does NOT grow allowlist.
  - DENIED queues denial injection + populates recent-decision cache
    (next identical call auto-denies).
  - TIMED_OUT writes TIMED_OUT row and emits approval_timed_out.

IMPL-24 race: four branches.
  - APPROVED re-read → allow flow, approval_late_win milestone, scope
    propagated, resume succeeds.
  - DENIED re-read → deny flow, approval_late_win milestone, user's
    reason is the permissionDecisionReason.
  - Still-PENDING re-read → fail-closed fall-through (no late_win).
  - Row-gone re-read → same fail-closed fall-through.

Cap / rate-limit / write failure / resume failure branches all:
  - Short-circuit before any DDB write when the local guard fires
    (cap, rate limit).
  - Emit the right approval_* milestone.
  - Return DENY with a specific permissionDecisionReason.

Sanitization:
  - ANSI stripped from deny reason.
  - Deny reason truncated to ≤500 chars.

Timeout clipping:
  - rule_annotation reason when a rule's approval_timeout_s is below
    the task default; matching_rule_ids populated.
  - maxLifetime_ceiling reason when remaining lifetime is the tightest
    bound; matching_rule_ids is None.
  - approval_ceiling_shrinking emits exactly once per task (IMPL-26
    latch).

Denial injection hook (6 tests):
  - Draining produces a <user_denial request_id=... decided_at=...>
    block with XML-escaped reason.
  - Cancel short-circuit preserves the queue so the denial is not
    lost; just not injected into a dying agent.
  - Hostile reason (</user_denial>...<user_nudge>) is XML-escaped so
    the envelope cannot be forged.
  - No-engine ctx returns [] (Phase 1 call sites still work).
  - Registered LAST in ``between_turns_hooks`` (invariant for §6.5
    finding #2).
  - End-to-end via stop_hook: queued denial becomes
    ``decision=block`` + reason on the Stop return.

Carry-forward
-------------
- ``_remaining_maxlifetime_s`` returns None when TASK_STARTED_AT is
  unset — Chunk 4/5 will wire this at task launch. Tracked in §16.
- ``approval_gate_count`` lives on the engine (session-scoped) not on
  TaskTable in v1. §13.6 notes that the reconciler + approval_gate_cap
  still bound worst-case across container restarts. Chunk 7+ tracks
  persistence when telemetry justifies it.
- Denial injection emits a ``user_denial_injected`` milestone that is
  NOT in the §11.1 enumerated table. It mirrors ``nudge_acknowledged``
  for stream visibility; keep the name distinct from the ``approval_*``
  prefix so future §11.1 consumers can't confuse it with an approval
  outcome.

* feat(cedar-hitl): TaskApprovalsTable + SlackUserMapping + status enum

Lands the stateless CDK primitives for Cedar-HITL approval gates so
Chunk 5's REST handlers can be wired onto concrete tables. Completes
§15.2 tasks 9, 20, and 25.

Constructs
----------

``TaskApprovalsTable`` (§10.1)
  - PK ``task_id`` + SK ``request_id`` (ULID). Matches the agent-side
    primitives landed in the prior commit.
  - GSI ``user_id-status-index`` with user_id PK + status SK and an
    ``INCLUDE`` projection limited to the fields GET /v1/pending
    renders. Three deny-sensitive attrs (``deny_reason``, ``scope``,
    ``tool_input_sha256``) deliberately omitted from the projection —
    the list endpoint only returns PENDING rows in practice, but
    excluding them kills the projection-leak concern outright and
    costs no bytes today.
  - Exports ``USER_STATUS_INDEX_NAME`` as a module constant + mirrors
    it on ``construct.userStatusIndexName`` so handlers referencing
    the GSI fail compile-time on a rename.
  - TTL attribute ``ttl`` (agent writes ``created_at + timeout_s +
    120s``).
  - No DynamoDB streams per §11.2. TaskEventsTable carries the audit
    fan-out; streams here would duplicate.
  - Default RemovalPolicy.DESTROY to match the rest of the sample.
    Production deploys override to RETAIN per §10.1.

``SlackUserMappingTable`` (§11.2, finding #4)
  - Single-key (``slack_user_id`` PK). No SK, no TTL, no GSI, no
    stream. The forward-only shape is the trust boundary — a reverse
    GSI (Cognito → Slack) would let a compromised Cognito sub
    enumerate Slack identities without adding v1 capability.
  - Writes land through LinkSlackUserFn (Chunk 5) which enforces the
    ``attribute_not_exists(slack_user_id)`` condition so a prior
    legitimate mapping cannot be overwritten by a later compromise.

``task-status.ts`` — AWAITING_APPROVAL (§10.3)
  - Added to TaskStatus enum + ACTIVE_STATUSES (NOT TERMINAL_STATUSES:
    the task is alive, paused on a human decision).
  - VALID_TRANSITIONS wires the five edges §10.3 enumerates:
      RUNNING      → AWAITING_APPROVAL  (soft-deny entry)
      HYDRATING    → AWAITING_APPROVAL  (rare early-gate case)
      AWAITING_APPROVAL → RUNNING       (approve / deny resume)
      AWAITING_APPROVAL → CANCELLED     (user cancel mid-approval)
      AWAITING_APPROVAL → FAILED        (stranded-approval reconciler)
  - Notably NOT added:
      AWAITING_APPROVAL → FINALIZING    (approve-during-cleanup race)
      AWAITING_APPROVAL → COMPLETED     (skip RUNNING)
      AWAITING_APPROVAL → TIMED_OUT     (timer lives on the approval
                                         row, not the task clock)
    These are regression tests so a future refactor cannot quietly
    add them and bypass the `awaiting_approval_request_id = :rid`
    invariant.

Tests: +29 total.
  - TaskApprovalsTable (11 tests): PK/SK schema, PAY_PER_REQUEST,
    PITR default + override, TTL attribute, NO streams, GSI schema +
    projection + sensitive-attr exclusion, removal policy default +
    override, ``USER_STATUS_INDEX_NAME`` constant parity with the
    construct field.
  - SlackUserMappingTable (8 tests): single-key schema (explicit
    KeySchema length assertion), PAY_PER_REQUEST, PITR, no streams,
    no reverse GSI, DESTROY default, TTL absent.
  - TaskStatus (+10 tests over existing: 5 new assertions on the
    9-state cardinality, AWAITING_APPROVAL membership, and the
    transition graph including the three forbidden edges). The
    existing assertions updated for the new state count.

No stack wiring yet — ``agent.ts`` instantiation + env var plumbing +
grants land in the next commit alongside the Cedar-WASM Lambda layer.

* feat(cedar-hitl): Cedar-wasm layer + wire approval tables into agent stack

Activates the agent-side approval path and ships the Lambda layer
Chunk 5's REST handlers need.

Cedar-wasm Lambda layer (§15.2 task 10)
----------------------------------------

``CedarWasmLayer`` bundles ``@cedar-policy/cedar-wasm@4.10.0`` into
``/opt/nodejs/node_modules/`` so Lambdas can
``require('@cedar-policy/cedar-wasm/nodejs')`` without shipping the
4 MB wasm binary in every function package. A dedicated
``cdk/layers/cedar-wasm/`` directory carries a minimal ``package.json``
pinning the exact version — bundling runs ``npm install --omit=dev``
against that manifest, so the layer build is hermetic from any
``cdk/node_modules/`` drift.

The bundler has two fallbacks:
  - Docker (``public.ecr.aws/sam/build-nodejs22.x``) for CI / prod
    deploys.
  - Local-npm fallback for environments without Docker (unit-test
    synths + `cdk synth` on runners that lack Docker). The local
    path is safe here because the layer ships pure JS + a prebuilt
    wasm binary — no native build step.

Three constants exposed from the module:
  - ``CEDAR_WASM_VERSION`` — single source of truth for the pinned
    version; tests assert this matches both ``cdk/package.json`` and
    the layer manifest, so the three places the version lives stay
    in sync.
  - ``CEDAR_WASM_MIN_LAMBDA_MEMORY_MB`` — 512 MB floor for attaching
    Lambdas per §15.2 task 10.
  - ``CedarWasmLayer.layer`` — the underlying ``LayerVersion`` for
    Chunk 5 handlers to attach via ``fn.addLayers(...)``.

Agent stack wiring (§15.2 task 19)
------------------------------------

``agent.ts`` now instantiates:
  - ``TaskApprovalsTable`` (prior commit) — grants RW to the runtime
    so ``pre_tool_use_hook`` can TransactWriteItems + ConsistentRead
    the PENDING row.
  - ``SlackUserMappingTable`` (prior commit) — not granted to the
    runtime; only the link-user Lambda (Chunk 5) writes here.
  - ``CedarWasmLayer`` — the layer's asset lands in the synthed
    template so Chunk 5 handlers can reference ``.layer`` without
    causing a new asset on their deploy.

New runtime env vars:
  - ``TASK_APPROVALS_TABLE_NAME`` — consumed by
    ``task_state._require_tables``; its absence previously raised
    ``ApprovalTablesUnavailable`` → hook DENY. Now set, so the
    approval path is live on deploy.
  - ``AGENTCORE_MAX_LIFETIME_S = '28800'`` — 8 hours, matching
    ``lifecycleConfiguration.maxLifetime``. Consumed by the hook's
    ``_remaining_maxlifetime_s`` for the maxLifetime ceiling clip
    (§6.5). Kept in sync with the lifecycle via a direct test
    assertion so drift surfaces at build time.

New CfnOutputs: ``TaskApprovalsTableName``, ``SlackUserMappingTableName``,
``CedarWasmLayerArn``. Each is useful for post-deploy smoke tests
(`aws dynamodb describe-table` / `aws lambda get-layer-version`).

Tests: +8 layer tests + 9 agent-stack assertions.

Layer:
  - LayerVersion resource count.
  - Compatible runtimes (nodejs20/22).
  - Description carries the pinned version.
  - CEDAR_WASM_VERSION matches ``cdk/package.json``.
  - CEDAR_WASM_VERSION matches ``layers/cedar-wasm/package.json``.
  - CEDAR_WASM_MIN_LAMBDA_MEMORY_MB ≥ 512.
  - Custom description override works.
  - ``.layer`` exposes a real ``LayerVersion``.

Agent stack:
  - Table count updated from 6 → 8.
  - TaskApprovalsTable schema match (task_id PK / request_id SK,
    user_id-status-index GSI presence).
  - SlackUserMappingTable single-key schema.
  - LayerVersion count + compatibleRuntimes.
  - Three new CfnOutputs present.
  - TASK_APPROVALS_TABLE_NAME env var on the runtime.
  - AGENTCORE_MAX_LIFETIME_S == '28800' (drift guard).

Carry-forward
-------------
- ``TASK_STARTED_AT`` is the other input the hook's
  ``_remaining_maxlifetime_s`` consumes — it's a PER-TASK value the
  orchestrator must stamp at invocation time, not a stack-level env
  var. Chunk 5's orchestrator changes need to add it to the runtime
  invocation payload / session env. For now the hook's fallback
  ("unknown, don't clip") keeps approvals functional.
- Chunk 5 will attach the CedarWasmLayer onto ApproveTaskFn,
  DenyTaskFn, GetPoliciesFn, CreateTaskFn and assert
  ``memorySize >= CEDAR_WASM_MIN_LAMBDA_MEMORY_MB`` for each.

* feat(cedar-hitl): approve + deny handlers + shared types (§7.1, §7.2)

Lands the two user-facing REST handlers that flip a PENDING approval
row to APPROVED / DENIED, the shared types both call sites and the
CLI consume, and the Lambda-side Cedar parser future Chunk-5 handlers
(get-policies, create-task validation) will use.

Wire types (shared/types.ts)
----------------------------
- ApprovalScope union covering every shape the agent's
  ApprovalAllowlist understands. Typed so approve-task / create-task /
  CLI (Chunk 6) all agree at compile time.
- ApprovalRecord / ApprovalRequest / ApprovalResponse / DenyRequest /
  DenyResponse / PendingApprovalSummary / GetPendingResponse /
  PolicyRuleSummary / GetPoliciesResponse / LinkSlackUserRequest /
  LinkSlackUserResponse / SlackUserMappingRecord /
  ApprovalDecisionRecordedEvent / CreateTaskApprovalExtensions.
- Constants: DENY_REASON_MAX_LENGTH=2000, INITIAL_APPROVALS_MAX_ENTRIES=20,
  INITIAL_APPROVALS_MAX_ENTRY_LENGTH=128, APPROVAL_TIMEOUT_S_MIN=30,
  APPROVAL_TIMEOUT_S_MAX=3600, APPROVAL_TIMEOUT_S_DEFAULT=300.

New error codes: REQUEST_NOT_FOUND, REQUEST_ALREADY_DECIDED,
TASK_NOT_AWAITING_APPROVAL.

Shared helpers
--------------
- shared/approval-scope.ts — parseApprovalScope validates every shape;
  rejects unknown tool types / groups / prefixes, empty values,
  over-128-char strings. isDegeneratePattern implements §7.4 (length
  ≤ 2, all-wildcard, wildcard ratio > 50%) for Chunk-5 create-task.
- shared/deny-reason-scanner.ts — scanDenyReason redacts AWS keys,
  GitHub PATs (classic + fine-grained), Slack tokens, PEM blocks,
  bearer tokens with [REDACTED-...] markers. Mirrors
  agent/src/output_scanner.py so the deny reason the agent
  ultimately reads is never raw user input.
- shared/cedar-policy.ts — parseRules pulls the five HITL annotations
  (tier/rule_id/severity/approval_timeout_s/category) into a
  ParsedRule[], preserving positional policy_id for IMPL-29
  diagnostics-to-rule_id resolution. isHardDenyRule, isValidRuleId,
  matchingRuleIds, concatPolicies exposed for future handlers.

Handlers
--------
- approve-task.ts (§7.1) — POST /v1/tasks/{task_id}/approve
  - Cross-table TransactWriteItems: approval row PENDING → APPROVED
    guarded by user_id = :caller AND status = :pending; TaskTable
    no-op Update guarded by status = AWAITING_APPROVAL AND
    awaiting_approval_request_id = :rid.
  - TransactionCanceledException classified by per-item
    CancellationReasons. Approval-row failure collapses to 404
    REQUEST_NOT_FOUND (no existence oracle per §7.1 finding #6);
    task-row failure → 409 TASK_NOT_AWAITING_APPROVAL.
  - Optional scope defaults to this_call.
  - Per-user per-minute rate limit (30/min, synthetic row).
  - Writes approval_decision_recorded audit event (IMPL-6). Audit
    failure is logged but does not fail the request — decision is
    already committed.
- deny-task.ts (§7.2) — POST /v1/tasks/{task_id}/deny
  - Same cross-table pattern; status → DENIED + deny_reason.
  - Reason is scanDenyReason-sanitized + truncated to
    DENY_REASON_MAX_LENGTH BEFORE any persistence — agent and audit
    both read sanitized form; raw input never stored.
  - Same rate-limit namespace as approve.

Tests: +64 total (cedar-policy-parser 24, approval-scope 28,
deny-reason-scanner 13, approve-task 14, deny-task 9). Secret test
fixtures are assembled from string fragments so the source never
holds a contiguous secret literal — Code Defender pre-commit hook
otherwise blocks.

Stack wiring (task-api.ts routes, agent.ts layer attachment,
CreateTaskFn extension, orchestrator + reconciler + fanout +
LinkSlackUserFn + GetPolicies + GetPending) lands in the next
commit.

* feat(cedar-hitl): get-pending + get-policies + link-slack-user handlers

Lands the three read/discovery handlers Chunk 6 (CLI) needs to power
``bgagent pending``, ``bgagent policies list/show``, and
``bgagent notifications configure slack``. Completes §15.2 tasks
14, 15, and 25 (handler side).

Handlers
--------

``get-pending.ts`` (§7.7 — GET /v1/pending)
  - Queries ``user_id-status-index`` GSI on TaskApprovalsTable with
    ``user_id = :caller AND status = :pending``. Without the GSI
    this would be a full-table Scan per call — under
    ``watch -n1 bgagent pending`` that exhausts burst capacity for
    the whole fleet (§10.1 finding #8).
  - Response maps each row to ``PendingApprovalSummary`` with a
    derived ``expires_at = created_at + timeout_s`` so the CLI can
    render time-to-timeout without doing arithmetic on ISO strings.
  - Severity coerced to ``medium`` on unknown values so GSI writes
    that drift from the enum don't break the list response.
  - Rate-limited 10/min/user (synthetic row on the same table,
    namespaced ``RATE#<user>#PENDING`` so it does not collide with
    the approve/deny counter).

``get-policies.ts`` (§7.6 — GET /v1/repos/{repo_id}/policies)
  - Combines ``BUILTIN_HARD_DENY_POLICIES`` + ``BUILTIN_SOFT_DENY_POLICIES``
    with the repo's ``cedar_policies`` blueprint override. Runs the
    combined text through ``parseRules`` and returns
    ``{hard[], soft[]}`` rule summaries.
  - 5-minute per-repo in-Lambda cache; cold starts throw it away.
    ``_resetCacheForTests`` exposed for unit-test isolation.
  - Repo ID is URL-decoded from the path (``owner%2Frepo`` common in
    CLI UX).
  - Rate-limited 30/min/user.
  - Blueprint load failure falls back to built-ins with a WARN log;
    invalid blueprint cedar text returns 503 ``SERVICE_UNAVAILABLE``
    rather than a misleading empty list.

``link-slack-user.ts`` (§11.2 finding #4 — POST /v1/notifications/slack/link)
  - Writes to SlackUserMappingTable with
    ``ConditionExpression: attribute_not_exists(slack_user_id)``. This
    guard is the entire admission control the §11.2 design hinges on:
    even a compromised Slack admin cannot overwrite an existing
    mapping.
  - Validates ``slack_user_id`` shape (letters, digits, underscores,
    2–40 chars) so junk rows cannot land.
  - Conflict surface is 409 ``REQUEST_ALREADY_DECIDED`` — reused
    error code (the payload message directs the user to unlink via
    support).
  - Slack link_token end-to-end validation against Slack OAuth is
    deferred — v1 accepts the token on trust from the Cognito-authed
    caller; it is persisted in CloudWatch for audit.

Supporting primitives
---------------------

``shared/builtin-policies.ts`` — mirrors ``agent/policies/hard_deny.cedar``
and ``agent/policies/soft_deny.cedar`` as TypeScript string constants.
Embedded rather than read from disk because Lambda's esbuild bundler
does not copy non-TS assets by default and a dedicated bundling hook
is more code than the embed. A drift test
(``builtin-policies.test.ts``) asserts byte-equality with the agent
files so any change on one side without the other flips red at build
time.

``shared/cedar-policy.ts`` — ``parseRules`` now skips the unannotated
``base_permit`` entry (both tiers need it as a Cedar catch-all; it
is not a user-facing rule so it stays out of ParsedRule[]). This
matches the agent-side ``_parse_policy_annotations`` behaviour.

Tests: +37 total.
  - get-pending (8): 401 on missing auth, 429 on rate limit, empty
    result, GSI query shape, row → PendingApprovalSummary with
    derived expires_at, severity fallback, missing timeout → expires_at
    falls back to created_at, 500 on DDB error.
  - get-policies (11): 401/400 validation, built-in rules listed on
    empty repo, URL-decoded repo path, custom blueprint rule lands
    in soft, per-repo cache across calls, 429 rate limit, 503 on
    invalid blueprint cedar, fallback on load failure, hard rules
    omit severity / approval_timeout_s, soft rules carry them.
  - link-slack-user (8): 401/400 validation, shape check, 201 on
    success, 409 on overwrite attempt, 500 on unknown DDB error,
    whitespace trim on slack_user_id, ConditionExpression verified.
  - builtin-policies (4): drift byte-equality with both agent files,
    parseRules round-trip for hard/soft rule IDs.
  - cedar-policy (updated): ``base_permit`` is skipped from
    ParsedRule[] rather than rejected.

Stack wiring (task-api.ts routes, agent.ts layer attachment,
CreateTaskFn extension, orchestrator + reconciler + fanout) lands in
the next commit.

* feat(cedar-hitl): wire Chunk 5 routes + orchestrator + reconciler + agent plumbing

Completes Chunk 5 end-to-end: the five new Lambdas are instantiated
and wired onto the REST API, the orchestrator threads approval-related
data through to the agent runtime, the stranded-task reconciler sweeps
AWAITING_APPROVAL tasks, and the agent pipeline accepts the new
per-task approval configuration.

Stack wiring (agent.ts + task-api.ts)
-------------------------------------
- TaskApi construct accepts `taskApprovalsTable`, `slackUserMappingTable`,
  `cedarWasmLayer` props. Approve/Deny/GetPending Lambdas are created
  when the approvals table is present; GetPolicies also requires the
  cedar-wasm layer + RepoTable. Slack-link Lambda attaches when the
  slack mapping table is provided.
- New routes:
    POST /tasks/{task_id}/approve
    POST /tasks/{task_id}/deny
    GET  /pending
    GET  /repos/{repo_id}/policies
    POST /notifications/slack/link
- GetPoliciesFn configures `memorySize: 512` (Cedar-wasm floor from
  §15.2 task 10) and externalizes `@cedar-policy/cedar-wasm` from the
  esbuild bundle so the layer provides the wasm binary at runtime.
- CedarWasmLayer compatibleRuntimes extended to include nodejs24.x
  (the Lambda runtime) — the Node 20/22 list was the original §15.2
  spec but the actual function uses Node 24.
- agent.ts passes all three new constructs into TaskApi.

Orchestrator (shared/orchestrator.ts)
-------------------------------------
- `finalizeTask` now treats AWAITING_APPROVAL as a "task still alive"
  terminal-timeout source: on poll exhaustion the task transitions to
  TIMED_OUT with a distinct `approval_poll_timeout` reason + error
  message ("Orchestrator poll timeout exceeded while awaiting approval").
  The stranded-approval reconciler is the secondary safety net (§13.6)
  for tasks the orchestrator already lost track of.
- Invocation payload now carries three new fields:
    - `task_started_at` (ISO 8601 at HYDRATING → RUNNING time) —
      consumed by the agent hook's `_remaining_maxlifetime_s` so the
      §6.5 maxLifetime ceiling math uses the real task clock instead
      of the fail-open fallback.
    - `approval_timeout_s` (when the submit payload supplied it).
    - `initial_approvals` (when the submit payload supplied entries).

Stranded-task reconciler
------------------------
- Sweeps AWAITING_APPROVAL in addition to SUBMITTED/HYDRATING.
- New `APPROVAL_STRANDED_TIMEOUT_SECONDS` env var (default 7200s =
  2h) — double §7.3's 1h ceiling so this reconciler never races the
  happy-path timer.
- Distinct failure message on approval-stranded vs generic-stranded
  so users see "approval stranded — container evicted" rather than
  the misleading "no pipeline attached" copy.

Fanout (handlers/fanout-task-events.ts)
---------------------------------------
- Slack channel default set replaces the forward-compat
  `approval_required` stub with the real §11.1 events:
  `approval_requested` and `approval_stranded`. Other approval
  milestones (granted/denied/timed_out/late_win/etc.) stay out of
  default routing to avoid notification fatigue — the CLI surfaces
  those confirmations directly.
- Email default replaces `approval_required` with `approval_requested`
  (high-severity gates only; severity gating happens in the dispatcher).

Create-task validation (shared/create-task-core.ts)
---------------------------------------------------
- New request fields:
    - `approval_timeout_s` — integer within
      `[APPROVAL_TIMEOUT_S_MIN, APPROVAL_TIMEOUT_S_MAX]`.
    - `initial_approvals` — array of scope strings; each entry must
      be a valid `ApprovalScope` per `parseApprovalScope`; bash_pattern
      and write_path scopes get the §7.4 degenerate-pattern check.
- TaskRecord extended with `approval_timeout_s`, `initial_approvals`,
  `approval_gate_count` (seeded to 0 at admission), and
  `awaiting_approval_request_id` (written atomically by the agent's
  `transact_write_approval_request` primitive).

Agent plumbing (models.py / config.py / pipeline.py / runner.py / server.py)
----------------------------------------------------------------------------
- `TaskConfig` adds `approval_timeout_s`, `initial_approvals`.
- `build_config`, `run_task`, `_run_task_background`, and
  `_extract_invocation_params` thread the two new fields from payload
  → config → PolicyEngine.
- `server._extract_invocation_params` stamps `os.environ["TASK_STARTED_AT"]`
  from the payload so the hook's `_remaining_maxlifetime_s` returns
  real values (carry-forward from Chunk 3 resolved).
- `runner.py` constructs PolicyEngine with `initial_approvals` +
  `task_default_timeout_s` when supplied; the engine clamps bad
  values at construction time.

Tests
-----
All CDK tests pass: 1219 / 1219.
All agent tests pass: 648 / 648.

Affected suites (changes only):
  - test/stacks/agent.test.ts: cedar-wasm layer CompatibleRuntimes
    now expects `nodejs24.x`; table count still 8.
  - test/constructs/cedar-wasm-layer.test.ts: same runtime expansion.
  - test/handlers/fanout-task-events.test.ts: approval_required →
    approval_requested/approval_stranded in Slack default set;
    approval_required → approval_requested in Email default set.
  - test/handlers/reconcile-stranded-tasks.test.ts: primeResponses
    now queue a third `Items: []` for AWAITING_APPROVAL queries;
    queryCalls assertion bumped to 3.

Carry-forward (non-blocking)
----------------------------
- GetPoliciesFn has write access to TaskApprovalsTable (for the
  rate-limit counter path). A future permissions audit should
  tighten this to a single-item write scoped to `RATE#<user>#*`.
- TASK_STARTED_AT env var is only set when a payload supplies it;
  server.py still supports the Phase 2 no-payload startup path.

* feat(cedar-hitl): Chunk 6 CLI — approve / deny / pending / policies

Ships the four user-facing commands that close the Cedar HITL loop:
once Chunks 1-5 have a PENDING approval row and the Slack/Email fan-out
has notified the user, Chunk 6 is how they actually respond.

New commands (cli/src/commands/)
--------------------------------
- `bgagent approve <task-id> <request-id> [--scope <scope>] [--yes]`
  Default scope is `this_call`; callers extend allowlist with
  `tool_type:Bash`, `rule:<id>`, etc. `all_session` is the only scope
  that requires `--yes` to confirm — mirrors the safety UX from
  §8.4. Error classification maps 404 → "run `bgagent pending`", 409
  → "task no longer awaiting approval", 429 → rate-limit, 401 → login.
- `bgagent deny <task-id> <request-id> [--reason ... | --reason-file ...]`
  `--reason-file` accepts multi-line reasons that would otherwise
  need shell quoting. Client-side `DENY_REASON_MAX_LENGTH` cap avoids
  a round-trip on obviously-too-long reasons; the server still
  truncates. Reason is sanitized server-side (output_scanner) before
  ever reaching the agent.
- `bgagent pending [--output text|json]`
  Lists every PENDING approval owned by the caller. Rendered with
  approve/deny hints inline so the user can copy-paste the next
  command. JSON output for scripting. Rate-limited server-side.
- `bgagent policies list --repo <owner/repo> [--tier hard|soft]`
  `bgagent policies show --repo <owner/repo> --rule <rule_id>`
  Discovery commands so users can find rule IDs without reading CDK
  source. Both subcommands reuse a single `listPolicies` API call
  and filter locally.

Wire changes
------------
- `cli/src/api-client.ts`: `approveTask`, `denyTask`, `listPending`,
  `listPolicies` — each matching the §7.1 / §7.2 / §7.6 / §7.7
  request/response shapes. `approveTask` omits the `scope` body field
  when unset so the server's `this_call` default applies.
- `cli/src/types.ts`: mirrors the Chunk 5 server types verbatim —
  `ApprovalScope` union, `ApprovalRequest/Response`, `DenyRequest/Response`,
  `PendingApprovalSummary`, `GetPoliciesResponse`, `PolicyRuleSummary`,
  plus the five constants (`DENY_REASON_MAX_LENGTH`,
  `INITIAL_APPROVALS_MAX_ENTRIES`, `INITIAL_APPROVALS_MAX_ENTRY_LENGTH`,
  `APPROVAL_TIMEOUT_S_MIN/MAX/DEFAULT`).
- `cli/src/bin/bgagent.ts`: registers the four new commands in the
  order they appear in help output.

Tests: +27 new (217 total).
  - approve (9): default scope, custom scope, all_session guard +
    `--yes` bypass, JSON output, 404/409/401/429 error classifications.
  - deny (6): no-reason path, `--reason`, `--reason-file` with
    tmpdir fixture, mutually-exclusive rejection, over-length rejection,
    404 classification.
  - pending (5): empty render, populated render with approve/deny
    hints, JSON output, 401 and 429 classifications.
  - policies (7): list both tiers, `--tier` filter, `--output json`,
    bad `--tier`, show found rule, show unknown rule, 404
    repo-not-onboarded classification.

Carry-forward
-------------
- `submit.ts` extension with `--approval-timeout` / `--pre-approve`
  flags is deferred to a follow-up commit — the server already accepts
  these fields on POST /v1/tasks (Chunk 5), and `bgagent submit`
  already forwards unknown payload fields through the existing
  request path, so users can set them via `--body-file` today until
  the explicit flags land.
- `--output json` on error branches currently returns a CliError
  instead of a JSON error envelope; matches the pattern the existing
  commands use (status, cancel, nudge). Follow-up to standardize
  JSON error envelopes across the whole CLI if that becomes a
  common scripting pain point.

* feat(cedar-hitl): Chunk 7a — persist gate counter + IMPL-23 cache observability

Persist approval_gate_count to TaskTable across container restarts per
§13.6 so the cumulative gate budget survives eviction. Emit
pre_approvals_loaded after PolicyEngine init per §4 step 7 / §11.1 so
operators see the starting approval posture in the live SSE stream.
Add IMPL-23 cache-hit observability: cache hits attach metadata to
PolicyDecision, hook forwards to new write_policy_decision_cached
progress helper (decision_source="recent_decision_cache").

Why: container restarts were silently resetting the per-task gate
counter, re-exposing users to another approvalGateCap-worth of gates
per restart. Cache-driven denies were invisible in TaskEventsTable
beyond the initial gate. Fresh tasks emitted no "starting posture"
signal so dashboards could not distinguish "no pre-approvals seeded"
from "agent has not started".

Surface additions:
- task_state.increment_approval_gate_count_in_ddb — best-effort
  atomic ADD on approval_gate_count
- PolicyEngine(initial_approval_gate_count=N) — seed session counter
- TaskConfig.initial_approval_gate_count — orchestrator payload field
- progress_writer.write_policy_decision_cached — IMPL-23 emitter
- PolicyDecision.cache_hit_metadata — observability-only field
- _CachedDecision.original_decision_ts — wall-clock preservation
- runner._initialize_policy_engine_and_hooks — extracted helper

Counter survival is a safety bound, not correctness: DDB failure
does NOT block the gate (§13.6). Joint-update invariant on status
+ awaiting_approval_request_id (§10.2) is preserved — counter uses
separate UpdateItem, not merged into resume transaction.

Tests: +36 agent (648→684), +8 CDK (1219→1227), +6 new runner tests.

* feat(cedar-hitl): Chunk 7b — persist approval_gate_cap from blueprint

Capture the per-task approval-gate cap at submit-time (§4 step 5,
decision #13, §13.6) so a blueprint-configured override is frozen
onto the TaskRecord. Mid-task blueprint edits cannot shift the cap
beneath a running task; container restarts re-seed the agent's
PolicyEngine from the persisted value instead of its compile-time
default-50.

Why: Chunk 7a added approval_gate_count persistence but the cap
itself was still resolved from the blueprint on every restart —
so an operator lowering security.approvalGateCap mid-task would
retroactively fail-close the running task. The design has always
said cap is frozen at submit; this chunk makes the implementation
match.

Surface additions:
- BlueprintProps.security.approvalGateCap (CDK, synth-validated
  [1, 500] integer) — new per-repo blueprint prop
- RepoConfig.approval_gate_cap + BlueprintConfig.approval_gate_cap
- TaskRecord.approval_gate_cap + APPROVAL_GATE_CAP_{MIN,MAX,DEFAULT}
- create-task-core now calls loadRepoConfig, resolves cap, bounds-
  checks, persists; returns 503 SERVICE_UNAVAILABLE on invalid
  blueprint data (permanent until admin re-deploys, not transient)
- orchestrator.ts: isValidApprovalGateCap integer+bounds guard;
  logs warn if a persisted cap is structurally invalid (schema
  drift / hand-edited DDB row)
- TaskConfig.approval_gate_cap: int | None = None (agent-side);
  runner threads to PolicyEngine kwarg when not None
- "Task created" log line now carries approval_gate_cap +
  approval_gate_cap_source ("blueprint" | "platform_default") so
  operators can detect a broken-plumbing deploy at the single
  chokepoint where all fallback layers converge

Per silent-failure review:
- HIGH: 500 → 503 + logger.error for permanent misconfig
- HIGH: cap + source in task-created log (catches 4-layer cascade)
- MEDIUM: orchestrator guard tightened past typeof (NaN, Infinity,
  floats, out-of-bounds all omitted + warned)

Tests: CDK 1263/1263 (+36), agent 694/694 (+10). CLI unchanged.

* feat(cedar-hitl): Chunk 7c — observability wrap-up for resolved cap + warn path

Close three deferred items from Chunks 7a/7b before Chunks 8-10:

- runner.py init log now carries approval_gate_cap=N +
  approval_gate_cap_source=threaded|engine_default. Matches the
  handler log key so CloudWatch Insights can join across the
  cascade; agent can't distinguish blueprint-override from
  platform-default-frozen (handler log is the ground truth).

- server.py adds _warn_cw helper routing [server/warn] lines to
  a dedicated CloudWatch stream (server_warn/<task_id>). stdout
  print is preserved for local dev + existing capsys tests.
  AgentCore does not forward container stdout to APPLICATION_LOGS,
  so pre-7c warnings about malformed invocation payloads were
  invisible in production. Failure counter shared with _debug_cw
  for a single alarm surface; hoisted above writer defs for
  import-time ordering safety.

- blueprint.ts emits a synth-time info annotation when
  security.approvalGateCap is omitted so operators see a signal
  that the repo will rely on the platform default of 50. Without
  this, the default was a silent fallback at the handler layer —
  only visible by inspecting a TaskRecord at runtime.

Tests: agent 694→700 (+6), cdk 1263→1265 (+2), cli unchanged.
Design refs: §4 step 5, §11.1, §13.6, decision #13.

* feat(cedar-hitl): Chunk 8a — extend approval outcome event schema

Add created_at / effective_timeout_s / matching_rule_ids to
approval_granted / approval_denied / approval_timed_out events so
the incoming ApprovalMetricsPublisher Lambda (Chunk 8b) can compute
decision latency and emit a rule_id-dimensioned timeout breakdown
without a round-trip GetItem against TaskEventsTable.

Fields are added conditionally — omitted from metadata when the
caller did not supply them — so the event stream stays free of
null-value noise and legacy callers continue to produce valid
payloads. Publisher handles missing fields via explicit skip-and-log
on the specific metric branch (not fallback-to-zero).

Agent tests extended: +6 progress_writer tests, +3 hooks tests.
Baseline 700 → 710. No consumer wired yet — this commit is a
forward-compatible superset; Chunk 8b ships the CDK publisher +
dashboard widgets.

* feat(cedar-hitl): Chunk 8b — ApprovalMetricsPublisher + native CloudWatch dashboard widgets

Ship the Cedar-HITL dashboard widgets from §11.3 / IMPL-28 via the
MetricsPublisher architecture (Option E):

- New ApprovalMetricsPublisher Lambda consumes TaskEventsTable DDB
  stream as consumer #2 (FanoutConsumer is #1; stream is within its
  2-consumer soft cap — documented in task-events-table.ts).
- Handler emits CloudWatch EMF for 3 metrics in namespace
  ABCA/Cedar-HITL:
    * ApprovalRequestCount  +  ClippedApprovalCount (reason dim)  →
      ApprovalTimeoutClipRate widget (MathExpression with IF-guard
      against NaN on zero-denominator periods)
    * TimedOutEffectiveTimeout (rule_id dim with allowlist
      cardinality cap) → ApprovalTimeoutBreakdown widget
    * ApprovalDecisionLatencyMs (outcome dim) → ApprovalDecisionLatency
      widget with per-outcome p50/p90/p99
- Observability-of-observability (silent-failure review):
    * MetricsPublisherHeartbeat per batch so dashboard gaps
      distinguish "no traffic" from "pipeline broken"
    * MetricEmitSkipped with a reason dim on schema mismatches,
      parse anomalies, unknown rule ids — never fall back to
      latency=0 or count=0 which would poison percentile widgets
    * Expected high-volume skip reasons (non-milestone events,
      REMOVE records) DO NOT emit MetricEmitSkipped — only
      anomaly reasons (missing keys, missing milestone name) do,
      so real signal isn't drowned
    * Structured log lines alongside every skip so the absence of
      metrics is also observable via CloudWatch Logs Insights
- Cardinality caps via ``RULE_ID_ALLOWLIST`` + ``normalizeClipReason``.
  Unknown values collapse to ``other`` / ``unknown`` buckets with
  dashboard series so the collapse is discoverable rather than
  silently accruing custom-metric cost.
- Event-source-mapping filter pattern rejects non-agent_milestone
  records at the service layer; handler-layer allowlist catches
  anything that slips through. Filter pattern correctness tested
  structurally + positively/negatively probed (silent-failure H3).
- Per-record try/catch + reportBatchItemFailures + SQS DLQ mirror
  the fanout-task-events.ts poison-pill pattern exactly.

Deferred to Chunk 10 chore issues:
- DLQ alarms (fanout + publisher) — fire-into-void until
  notification channel lands, so wire with §11.5 alarms as a group
- Explicit log-group declaration (IAM drift defense)
- stdout-flush race documentation (pre-existing pattern in fanout)
- EMF 100-updates/sec throttle alarm

Tests: cdk 1265 → 1327 (+62); agent 710 (unchanged); cli 217
(unchanged). All pass. §11.5 alarm plumbing now unblocked —
publisher provides the metrics infrastructure the design always
intended; only the notification-channel SNS wiring is left.

* docs(cedar-hitl): Chunk 9 — sync design doc to Chunks 7b / 8a / 8b

Bring CEDAR_HITL_GATES.md current with the code that shipped in
Chunks 7b (approval_gate_cap persist), 8a (outcome event schema
superset), and 8b (ApprovalMetricsPublisher + dashboard widgets):

- §10.2 adds the missing approval_gate_cap row (carry-forward
  drift from Chunk 7b). Bounds + frozen-at-submit semantics
  documented.
- §11.1 outcome events (approval_granted / approval_denied /
  approval_timed_out) now document the Chunk 8a optional fields
  (created_at, effective_timeout_s, matching_rule_ids) plus the
  publisher's skip-on-missing-field policy.
- §11.1 intro names ApprovalMetricsPublisherFn as consumer #2 and
  points to §11.3 for the metric schema.
- §11.3 rewritten to describe the Option E architecture:
  publisher Lambda + EMF + native CloudWatch metrics in namespace
  ABCA/Cedar-HITL, MathExpression with divide-by-zero guard,
  rule_id cardinality cap, observability-of-observability via
  heartbeat + skip meta-metrics, widget layout (12/12 over 24),
  2-consumer stream budget. Dropped the stale "Retired the old
  bundled widget" line — that widget never shipped.
- §11.5 reframed as "deferred (notification-channel gated)" with
  a plumbing-status paragraph noting the metric infra now exists;
  only SNS wiring remains. Alarm list expanded to include DLQ
  and publisher-health alarms.
- §16 IMPL-28 rewritten for Option E; §15.2 row 46 expanded to
  reference the 4 new test files; Appendix B checklist updated.

Starlight mirror regenerated via ``cd docs && node
scripts/sync-starlight.mjs``.

No code changes. Test baselines unchanged. Adversarial
comment-analyzer review verified every new claim against
committed code — zero inaccuracies.

* feat(cedar-hitl): Chunk 10 review fixes — close 2 blockers + tighten 2 mediums

Full-branch adversarial review (code-reviewer + silent-failure-hunter
on all 18 commits) surfaced findings that only appear at final-state.
Addressing the blockers + low-cost meds before deploy:

B2 — stranded approvals were invisible to the dashboard:
  - Reconciler writes ``event_type: 'task_stranded'``; the metrics
    publisher's event-source filter only accepts
    ``event_type: 'agent_milestone'``, so AWAITING_APPROVAL evictions
    produced zero §11.3 signal.
  - Fix: reconciler now additionally emits an ``agent_milestone``
    with ``milestone: 'approval_stranded'`` when the stranded task
    was AWAITING_APPROVAL. Publisher allowlist extended; classifier
    emits ``ApprovalStrandedCount`` counter. SUBMITTED / HYDRATING
    stranded events unchanged (guarded by test).

B1 — heartbeat comment was false reassurance:
  - Event-source filter blocks Lambda invocation when no
    ``agent_milestone`` records exist in the poll window, so a
    quiet period produces the same widget gap as a broken
    pipeline. The code + design-doc wording claimed "gap =
    pipeline broken" which would mislead the on-call.
  - Fix: corrected module + function docstrings to describe the
    heartbeat as "present when active, not pipeline-alive-always."
    Operators should alarm on the combination
    (heartbeat-absent + recent TaskEventsTable traffic) or wire
    a scheduled canary — the latter tracked as a §11.5 follow-up.

M1 — safety-critical milestones produced zero dashboard signal:
  - ``approval_cap_exceeded`` (§12.9 per-task cap) and
    ``approval_rate_limit_exceeded`` (per-user per-minute rate)
    were emitted by the agent but not on the publisher allowlist.
    A production bug where every gate hit the cap would have
    been invisible.
  - Fix: both added to APPROVAL_METRIC_MILESTONES with
    ``ApprovalCapExceededCount`` / ``ApprovalRateLimitExceededCount``
    counters. No dimensions — the request_id in the event carries
    per-user correlation for ad-hoc log-insights investigation.

H2 — filter / handler eventName disagreement:
  - Event-source filter required ``INSERT``; handler accepted
    ``INSERT`` and ``MODIFY``. Benign today (TaskEventsTable is
    put-only), but a future chunk MODIFY-ing records would be
    silently dropped by the filter while the handler was ready
    to process them.
  - Fix: handler now INSERT-only, matching the filter. Single
    source of truth on the eventName invariant.

M1-rename — ``expected_non_approval_milestone`` skip reason was
misleading (the non-metric approval milestones like
``approval_late_win`` also land in this bucket). Renamed to
``expected_milestone_not_tracked``.

Tests: cdk 1327 → 1332 (+5: 3 classifier branches for new metrics,
1 reconciler AWAITING_APPROVAL path, 1 SUBMITTED-not-double-counted
guard). Agent + CLI unchanged. All pre-commit hooks green; pre-push
security fails only on the 3 pre-existing CVEs tracked for chore
issue filing.

Deferred findings from the same review (file as chore issues):
- H1: agent-dies-between-TIMED_OUT-and-resume loses latency
  (edge, affects p99 bias)
- H3: late-win APPROVED created_at staleness invariant
  (works today, document invariant)
- H4: _warn_cw daemon-thread burst under adversarial payload
- M2-M4: late-win metric, rename helpers, etc.

No upstream PR filing this chunk — deploy to Sam's AWS account
for integration testing first.

* fix(cedar-hitl): suppress AwsSolutions-IAM5 on Runtime ExecutionRole overflow policies

Synth + deploy were blocked by cdk-nag: the Cedar HITL additions
(TaskApprovalsTable grant + SlackUserMappingTable + extra env vars
threaded to the AgentCore runtime) pushed the runtime ExecutionRole
past CDK's inline-policy size limit, so CDK auto-splits excess
statements into ``OverflowPolicy1``. The overflow inherits the same
wildcard ``bedrock:InvokeModel*`` / CloudWatch actions as the base
policy but lives at a path
(``Runtime/ExecutionRole/OverflowPolicy1/Resource``) that the
existing ``addResourceSuppressions(runtime, ..., applyToChildren:
true)`` cannot reach — CDK creates overflow policies lazily during
synth ``prepare()``, after the construct tree has been frozen and
after static suppressions have been cached.

Suppress via an Aspect at MUTATING priority so the suppression is
applied before cdk-nag's READONLY visitor runs. Matches any path
containing ``/Runtime/ExecutionRole/OverflowPolicy`` + ending
``/Resource`` so future ``OverflowPolicy2``, etc. are covered
without hardcoding indices.

Verified: ``mise //cdk:synth`` now completes cleanly.
``mise //cdk:test`` still 1332/1332.

* fix(cedar-hitl): E2E deploy-readiness — policies bundle + onboarding gate + CLI error visibility

Three E2E T1.4 + T2.2 findings from the Chunk 10 integration-test
session. Batched into one commit since all three need the same
redeploy to verify:

1. agent/Dockerfile: COPY policies/ into the container image.
   ``PolicyEngine.__init__`` reads
   ``/app/policies/hard_deny.cedar`` + ``soft_deny.cedar`` at import
   time via ``_POLICIES_DIR = Path(__file__).parent.parent /
   "policies"``. The Dockerfile only copied ``src/``, so the
   directory was missing and every Cedar-HITL task failed at 0 turns
   with ``missing built-in hard-deny policies``. Introduced
   alongside Chunk 2 when the policy files were first added —
   Dockerfile was never updated. Zero tasks on this branch ever
   su…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants