Skip to content

feat(compute): add EC2 fleet compute strategy#31

Closed
MichaelWalker-git wants to merge 16 commits into
mainfrom
feat/compute-strategy
Closed

feat(compute): add EC2 fleet compute strategy#31
MichaelWalker-git wants to merge 16 commits into
mainfrom
feat/compute-strategy

Conversation

@MichaelWalker-git
Copy link
Copy Markdown
Contributor

Summary

  • Add EC2 fleet compute strategy with SSM Run Command dispatch — a third compute backend alongside AgentCore (default) and ECS Fargate
  • New Ec2ComputeStrategy handler: finds idle instances via tags, uploads payload to S3, dispatches via SSM AWS-RunShellScript, polls GetCommandInvocation, cancels with CancelCommand
  • New Ec2AgentFleet CDK construct: Auto Scaling Group with launch template (AL2023 ARM64), security group (443 egress only), S3 payload bucket, IAM role with scoped permissions, Docker user data for pre-pulling images
  • Wire orchestrator polling, cancel-task SSM dispatch, and task-api SSM permissions for EC2
  • Stack wiring is commented-out (same pattern as ECS) — ready to enable per-repo via blueprint compute_type: 'ec2'
  • Add instance_type field to RepoConfig and BlueprintConfig for future GPU/custom instance type support

Test plan

  • mise //cdk:compile — no TypeScript errors
  • mise //cdk:test — 43 suites, 697 tests all passing (including new ec2-strategy and ec2-agent-fleet tests)
  • mise //cdk:synth — synthesizes without errors (EC2 block commented out)
  • mise //cdk:build — full build including lint passes
  • Deploy with EC2 block uncommented and run an end-to-end task with compute_type: 'ec2'

MichaelWalker-git and others added 16 commits April 9, 2026 13:30
…AgentCore logic

Introduce ComputeStrategy interface with SessionHandle/SessionStatus types
and resolveComputeStrategy factory. Extract AgentCoreComputeStrategy from
orchestrator.ts. Refactor orchestrate-task handler to use strategy pattern
for session lifecycle (start/poll/stop). Pure refactor — no behavior change,
identical CloudFormation output.
The mise install step downloads tools (trivy) from GitHub releases.
Without GITHUB_TOKEN, unauthenticated requests hit the 60 req/hr
rate limit, causing flaky CI failures.
Mise uses GITHUB_API_TOKEN (not GITHUB_TOKEN) for authenticated
GitHub API requests when downloading aqua tools like trivy.
Trivy, grype, semgrep, osv-scanner, and gitleaks are only needed for
security scanning tasks, not for the build/test/synth pipeline. Disable
them via MISE_DISABLE_TOOLS to avoid GitHub API rate limits when mise
tries to download them on every PR build.
- Keep gitleaks and osv-scanner enabled in CI build (only disable
  trivy/grype/semgrep which need GitHub API downloads)
- Remove unused @aws-sdk/client-bedrock-agentcore mock from
  orchestrate-task.test.ts (SDK is no longer imported by orchestrator)
- Update PR description to note additive strategy_type event field
1. Single source of truth for runtimeArn — removed constructor param,
   strategy now reads exclusively from blueprintConfig.runtime_arn
2. Lazy singleton for BedrockAgentCoreClient — module-level shared
   client avoids creating new TLS sessions per invocation
3. ComputeType union type ('agentcore' | 'ecs') with exhaustive switch
   and never-pattern in resolveComputeStrategy
4. Differentiated error handling in stopSession — ResourceNotFoundException
   (info), ThrottlingException/AccessDeniedException (error), others (warn)
5. Added logger.info('Session started') after full invoke+transition+event
   sequence in orchestrate-task.ts
6. Added start-session-composition.test.ts with integration tests for
   happy path, error path (failTask), and partial failure (transitionTask throws)
7. pollSession now throws NotImplementedError instead of returning stale
   'running' status — clear signal for future developers
- Replace require() with ES import for BedrockAgentCoreClient mock
- Fix import ordering in start-session-composition test
Wire ECS Fargate as a compute backend behind the existing ComputeStrategy
interface, using the existing durable Lambda orchestrator. No separate
stacks or Step Functions — ECS is a strategy option alongside AgentCore.

Changes:
- EcsComputeStrategy: startSession (RunTask), pollSession (DescribeTasks
  state mapping), stopSession (StopTask with graceful error handling)
- EcsAgentCluster construct: ECS Cluster (container insights), Fargate
  task def (2 vCPU/4GB/ARM64), security group (TCP 443 egress only),
  CloudWatch log group, task role (DynamoDB, SecretsManager, Bedrock)
- TaskOrchestrator: optional ECS props for env vars and IAM policies
  (ecs:RunTask/DescribeTasks/StopTask conditioned on cluster ARN,
  iam:PassRole conditioned on ecs-tasks.amazonaws.com)
- Orchestrator polling: ECS compute-level crash detection alongside
  existing DDB polling (non-fatal, wrapped in try/catch)
- AgentStack: conditional ECS infrastructure (ABCA_ENABLE_ECS env var)
- Full test coverage: 15 ECS strategy tests, 9 construct tests,
  5 orchestrator ECS tests. All 563 tests pass.

Deployed and verified: stack deploys cleanly, CDK synth passes cdk-nag,
agent task running on AgentCore path unaffected.
- Keep gitleaks/osv-scanner enabled in CI (only disable trivy/grype/semgrep)
- Type ComputeStrategy.type and SessionHandle.strategyType as ComputeType
- Trim/filter ECS_SUBNETS to handle whitespace and trailing commas
- Handle undefined exit code in ECS pollSession (container never started)
- Scope iam:PassRole to specific ECS task/execution role ARNs
- Validate all-or-nothing ECS props in TaskOrchestrator constructor
- Remove dead hasEcsBlueprint detection; document env-flag driven approach
- Add comment noting strategy_type as additive event field
The ECS container's default CMD starts uvicorn server:app which waits
for HTTP POST to /invocations — but in standalone ECS nobody sends that
request, leaving the agent idle. Override the container command to invoke
entrypoint.run_task() directly with the full orchestrator payload via
AGENT_PAYLOAD env var. Also add GITHUB_TOKEN_SECRET_ARN to the ECS task
definition base environment.
Add a third compute backend (EC2 fleet with SSM Run Command) alongside
the existing AgentCore and ECS strategies. This provides maximum
flexibility with no image size limits, configurable instance types
(including GPU), and full control over the compute environment.

New files:
- ec2-strategy.ts: ComputeStrategy implementation using EC2 tags for
  instance tracking and SSM RunShellScript for task dispatch
- ec2-agent-fleet.ts: CDK construct with ASG, launch template,
  security group, S3 payload bucket, and IAM role
- ec2-strategy.test.ts and ec2-agent-fleet.test.ts: full test coverage

Wiring:
- repo-config.ts: add 'ec2' to ComputeType, add instance_type field
- compute-strategy.ts: add EC2 SessionHandle variant and resolver case
- task-orchestrator.ts: add ec2Config prop with env vars and IAM grants
- orchestrate-task.ts: enable compute polling for EC2
- cancel-task.ts: add SSM CancelCommand for EC2 tasks
- task-api.ts: add ssm:CancelCommand permission for cancel Lambda
- agent.ts: add commented-out EC2 fleet block (same pattern as ECS)
@MichaelWalker-git
Copy link
Copy Markdown
Contributor Author

Recreating from a clean branch off main to avoid conflicts from prior commits

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a third compute backend (“EC2 fleet”) alongside the existing AgentCore and ECS options, wiring strategy selection into the orchestrator and extending CDK constructs/tests/docs to support the new backend.

Changes:

  • Introduces compute-strategy abstraction with implementations for AgentCore, ECS Fargate, and EC2 fleet (SSM Run Command + S3 payload).
  • Updates orchestrator start-session and polling to use the selected compute strategy and persist compute metadata for cancellation.
  • Adds CDK constructs/tests for ECS agent cluster and EC2 agent fleet, plus small docs/CI updates.

Reviewed changes

Copilot reviewed 31 out of 33 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
yarn.lock Adds AWS SDK clients (EC2/ECS/S3/SSM) and transitive deps.
docs/src/content/docs/design/Architecture.md Adds rationale section on separating orchestrator vs agent loops.
docs/design/ARCHITECTURE.md Same rationale section mirrored into top-level design doc.
cdk/test/handlers/start-session-composition.test.ts Integration-style orchestration step composition tests.
cdk/test/handlers/shared/strategies/agentcore-strategy.test.ts Unit tests for AgentCore compute strategy.
cdk/test/handlers/shared/strategies/ecs-strategy.test.ts Unit tests for ECS compute strategy.
cdk/test/handlers/shared/strategies/ec2-strategy.test.ts Unit tests for EC2 compute strategy.
cdk/test/handlers/shared/preflight.test.ts Normalizes compute_type casing in tests.
cdk/test/handlers/shared/compute-strategy.test.ts Tests strategy resolution for agentcore/ecs/ec2.
cdk/test/handlers/orchestrate-task.test.ts Removes older startSession tests now handled by strategies/composition tests.
cdk/test/handlers/cancel-task.test.ts Adds ECS cancellation coverage and behavior tests.
cdk/test/constructs/task-orchestrator.test.ts Adds ECS env var + IAM wiring tests.
cdk/test/constructs/task-api.test.ts Adds cancel-task ECS env var + IAM wiring tests.
cdk/test/constructs/ecs-agent-cluster.test.ts New tests for ECS cluster construct.
cdk/test/constructs/ec2-agent-fleet.test.ts New tests for EC2 fleet construct.
cdk/src/stacks/agent.ts Adds commented wiring blocks for ECS/EC2 backends.
cdk/src/handlers/shared/types.ts Persists compute_type + compute_metadata on task records.
cdk/src/handlers/shared/strategies/agentcore-strategy.ts New AgentCore compute strategy implementation.
cdk/src/handlers/shared/strategies/ecs-strategy.ts New ECS compute strategy implementation.
cdk/src/handlers/shared/strategies/ec2-strategy.ts New EC2 compute strategy implementation.
cdk/src/handlers/shared/repo-config.ts Adds ComputeType union + instance_type config field.
cdk/src/handlers/shared/orchestrator.ts Adds PollState fields + instance_type wiring; removes old startSession helper.
cdk/src/handlers/shared/compute-strategy.ts New strategy interface + resolver.
cdk/src/handlers/orchestrate-task.ts Uses compute strategies for start + compute-level polling.
cdk/src/handlers/cancel-task.ts Adds ECS StopTask + EC2 SSM CancelCommand cancellation paths.
cdk/src/constructs/task-orchestrator.ts Adds optional ECS/EC2 config env vars + IAM grants.
cdk/src/constructs/task-api.ts Adds optional ECS/EC2 cancellation wiring + IAM grants.
cdk/src/constructs/ecs-agent-cluster.ts New ECS cluster construct (Fargate task def + SG + IAM).
cdk/src/constructs/ec2-agent-fleet.ts New EC2 ASG-based fleet construct (SSM-managed instances).
cdk/src/constructs/blueprint.ts Extends blueprint compute type to include ec2.
cdk/package.json Adds AWS SDK clients needed for new strategies.
.gitignore Ignores local-docs directory.
.github/workflows/build.yml Sets GitHub token env vars for CI tools; adjusts MISE_DISABLE_TOOLS.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +99 to +108
// 3. Tag instance as busy
await getEc2Client().send(new CreateTagsCommand({
Resources: [instanceId],
Tags: [
{ Key: 'bgagent:status', Value: 'busy' },
{ Key: 'bgagent:task-id', Value: taskId },
],
}));

// 4. Build the boot command (mirrors ECS strategy env vars and Python boot command)
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instance is tagged bgagent:status=busy before the SSM command is dispatched, but if SendCommand throws or returns no CommandId, the instance will remain stuck in busy (and with bgagent:task-id set). Wrap the dispatch in a try/catch/finally that reverts tags on failure, or tag busy only after a successful SendCommand response.

Suggested change
// 3. Tag instance as busy
await getEc2Client().send(new CreateTagsCommand({
Resources: [instanceId],
Tags: [
{ Key: 'bgagent:status', Value: 'busy' },
{ Key: 'bgagent:task-id', Value: taskId },
],
}));
// 4. Build the boot command (mirrors ECS strategy env vars and Python boot command)
// 3. Build the boot command (mirrors ECS strategy env vars and Python boot command)

Copilot uses AI. Check for mistakes.
Comment on lines +56 to +76
// The ECS container's default CMD starts the FastAPI server (uvicorn) which
// waits for HTTP POST to /invocations — but in standalone ECS nobody sends
// that request. We override the container command to invoke run_task()
// directly with the full orchestrator payload (including hydrated_context).
// This avoids the server entirely and runs the agent in batch mode.
const payloadJson = JSON.stringify(payload);

const containerEnv = [
{ name: 'TASK_ID', value: taskId },
{ name: 'REPO_URL', value: String(payload.repo_url ?? '') },
...(payload.prompt ? [{ name: 'TASK_DESCRIPTION', value: String(payload.prompt) }] : []),
...(payload.issue_number ? [{ name: 'ISSUE_NUMBER', value: String(payload.issue_number) }] : []),
{ name: 'MAX_TURNS', value: String(payload.max_turns ?? 100) },
...(payload.max_budget_usd !== undefined ? [{ name: 'MAX_BUDGET_USD', value: String(payload.max_budget_usd) }] : []),
...(blueprintConfig.model_id ? [{ name: 'ANTHROPIC_MODEL', value: blueprintConfig.model_id }] : []),
...(blueprintConfig.system_prompt_overrides ? [{ name: 'SYSTEM_PROMPT_OVERRIDES', value: blueprintConfig.system_prompt_overrides }] : []),
{ name: 'CLAUDE_CODE_USE_BEDROCK', value: '1' },
// Full orchestrator payload as JSON — the Python wrapper reads this to
// call run_task() with all fields including hydrated_context.
{ name: 'AGENT_PAYLOAD', value: payloadJson },
...(payload.github_token_secret_arn
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This strategy serializes the full orchestrator payload (including hydrated_context) into an ECS environment variable (AGENT_PAYLOAD). ECS task overrides have fairly small limits on environment size, so larger contexts can cause RunTask to fail at runtime. Consider switching to an S3-backed payload (pass an S3 URI/key in env) or another mechanism that doesn't depend on env-var size limits.

Copilot uses AI. Check for mistakes.
Comment on lines +145 to +168
} else if (computeType === 'ec2') {
// EC2-backed task — cancel the SSM command
const commandId = record.compute_metadata?.commandId;
const instanceId = record.compute_metadata?.instanceId;
if (commandId) {
try {
await ssmClient.send(new CancelCommandCommand({
CommandId: commandId,
...(instanceId && { InstanceIds: [instanceId] }),
}));
logger.info('SSM CancelCommand invoked after cancel', { task_id: taskId, command_id: commandId, request_id: requestId });
} catch (stopErr) {
logger.warn('SSM CancelCommand failed after cancel (command may already be done)', {
task_id: taskId,
request_id: requestId,
error: stopErr instanceof Error ? stopErr.message : String(stopErr),
});
}
} else {
logger.warn('EC2 task cancel skipped: missing commandId in compute_metadata', {
task_id: taskId,
request_id: requestId,
});
}
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For compute_type: 'ec2', cancel currently only calls SSM CancelCommand. If the command is cancelled mid-script, the cleanup/tag-reset section in the boot script may never run, leaving the instance stuck in bgagent:status=busy and effectively reducing fleet capacity. Consider also re-tagging the instance back to idle here (and deleting bgagent:task-id), or invoking the EC2 strategy’s stopSession logic from the cancel handler.

Copilot uses AI. Check for mistakes.
Comment on lines +179 to +235
let consecutiveEcsPollFailures = 0;
let consecutiveEcsCompletedPolls = 0;

// ECS compute-level crash detection: if DDB is not terminal, check ECS task status
if (
ddbState.lastStatus &&
!TERMINAL_STATUSES.includes(ddbState.lastStatus) &&
computeStrategy
) {
try {
const ecsStatus = await computeStrategy.pollSession(sessionHandle);
if (ecsStatus.status === 'failed') {
const errorMsg = 'error' in ecsStatus ? ecsStatus.error : 'ECS task failed';
logger.warn('ECS task failed before DDB terminal write', {
task_id: taskId,
error: errorMsg,
});
await failTask(taskId, ddbState.lastStatus, `ECS container failed: ${errorMsg}`, task.user_id, true);
return { attempts: ddbState.attempts, lastStatus: TaskStatus.FAILED };
}
if (ecsStatus.status === 'completed') {
consecutiveEcsCompletedPolls = (state.consecutiveEcsCompletedPolls ?? 0) + 1;
if (consecutiveEcsCompletedPolls >= MAX_CONSECUTIVE_ECS_COMPLETED_POLLS) {
// ECS task exited successfully but DDB never reached terminal — the agent
// likely crashed after container exit code 0 but before writing status.
logger.error('ECS task completed but DDB never caught up — failing task', {
task_id: taskId,
consecutive_completed_polls: consecutiveEcsCompletedPolls,
});
await failTask(taskId, ddbState.lastStatus, `ECS task exited successfully but agent never wrote terminal status after ${consecutiveEcsCompletedPolls} polls`, task.user_id, true);
return { attempts: ddbState.attempts, lastStatus: TaskStatus.FAILED };
}
logger.warn('ECS task completed but DDB not terminal — waiting for DDB catchup', {
task_id: taskId,
consecutive_completed_polls: consecutiveEcsCompletedPolls,
});
}
} catch (err) {
consecutiveEcsPollFailures = (state.consecutiveEcsPollFailures ?? 0) + 1;
if (consecutiveEcsPollFailures >= MAX_CONSECUTIVE_ECS_POLL_FAILURES) {
logger.error('ECS pollSession failed repeatedly — failing task', {
task_id: taskId,
consecutive_failures: consecutiveEcsPollFailures,
error: err instanceof Error ? err.message : String(err),
});
await failTask(taskId, ddbState.lastStatus, `ECS poll failed ${consecutiveEcsPollFailures} consecutive times: ${err instanceof Error ? err.message : String(err)}`, task.user_id, true);
return { attempts: ddbState.attempts, lastStatus: TaskStatus.FAILED };
}
logger.warn('ECS pollSession check failed (non-fatal)', {
task_id: taskId,
consecutive_failures: consecutiveEcsPollFailures,
error: err instanceof Error ? err.message : String(err),
});
}
}

return { ...ddbState, consecutiveEcsPollFailures, consecutiveEcsCompletedPolls };
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compute-level polling block is gated on blueprintConfig.compute_type === 'ecs' || 'ec2', but the variables/log messages/errors are all ECS-specific (e.g., consecutiveEcs*, "ECS container failed"). This will produce misleading failures for EC2 and makes the logic harder to extend. Consider renaming this to backend-neutral terminology and branching error messages based on sessionHandle.strategyType (or splitting ECS vs EC2 polling).

Suggested change
let consecutiveEcsPollFailures = 0;
let consecutiveEcsCompletedPolls = 0;
// ECS compute-level crash detection: if DDB is not terminal, check ECS task status
if (
ddbState.lastStatus &&
!TERMINAL_STATUSES.includes(ddbState.lastStatus) &&
computeStrategy
) {
try {
const ecsStatus = await computeStrategy.pollSession(sessionHandle);
if (ecsStatus.status === 'failed') {
const errorMsg = 'error' in ecsStatus ? ecsStatus.error : 'ECS task failed';
logger.warn('ECS task failed before DDB terminal write', {
task_id: taskId,
error: errorMsg,
});
await failTask(taskId, ddbState.lastStatus, `ECS container failed: ${errorMsg}`, task.user_id, true);
return { attempts: ddbState.attempts, lastStatus: TaskStatus.FAILED };
}
if (ecsStatus.status === 'completed') {
consecutiveEcsCompletedPolls = (state.consecutiveEcsCompletedPolls ?? 0) + 1;
if (consecutiveEcsCompletedPolls >= MAX_CONSECUTIVE_ECS_COMPLETED_POLLS) {
// ECS task exited successfully but DDB never reached terminal — the agent
// likely crashed after container exit code 0 but before writing status.
logger.error('ECS task completed but DDB never caught up — failing task', {
task_id: taskId,
consecutive_completed_polls: consecutiveEcsCompletedPolls,
});
await failTask(taskId, ddbState.lastStatus, `ECS task exited successfully but agent never wrote terminal status after ${consecutiveEcsCompletedPolls} polls`, task.user_id, true);
return { attempts: ddbState.attempts, lastStatus: TaskStatus.FAILED };
}
logger.warn('ECS task completed but DDB not terminal — waiting for DDB catchup', {
task_id: taskId,
consecutive_completed_polls: consecutiveEcsCompletedPolls,
});
}
} catch (err) {
consecutiveEcsPollFailures = (state.consecutiveEcsPollFailures ?? 0) + 1;
if (consecutiveEcsPollFailures >= MAX_CONSECUTIVE_ECS_POLL_FAILURES) {
logger.error('ECS pollSession failed repeatedly — failing task', {
task_id: taskId,
consecutive_failures: consecutiveEcsPollFailures,
error: err instanceof Error ? err.message : String(err),
});
await failTask(taskId, ddbState.lastStatus, `ECS poll failed ${consecutiveEcsPollFailures} consecutive times: ${err instanceof Error ? err.message : String(err)}`, task.user_id, true);
return { attempts: ddbState.attempts, lastStatus: TaskStatus.FAILED };
}
logger.warn('ECS pollSession check failed (non-fatal)', {
task_id: taskId,
consecutive_failures: consecutiveEcsPollFailures,
error: err instanceof Error ? err.message : String(err),
});
}
}
return { ...ddbState, consecutiveEcsPollFailures, consecutiveEcsCompletedPolls };
let consecutiveComputePollFailures = 0;
let consecutiveComputeCompletedPolls = 0;
const computeBackendLabel = sessionHandle.strategyType === 'ec2' ? 'EC2' : 'ECS';
// Compute-level crash detection: if DDB is not terminal, check compute session status.
if (
ddbState.lastStatus &&
!TERMINAL_STATUSES.includes(ddbState.lastStatus) &&
computeStrategy
) {
try {
const computeStatus = await computeStrategy.pollSession(sessionHandle);
if (computeStatus.status === 'failed') {
const errorMsg =
'error' in computeStatus ? computeStatus.error : `${computeBackendLabel} task failed`;
logger.warn(`${computeBackendLabel} task failed before DDB terminal write`, {
task_id: taskId,
error: errorMsg,
});
await failTask(
taskId,
ddbState.lastStatus,
`${computeBackendLabel} compute failed: ${errorMsg}`,
task.user_id,
true,
);
return { attempts: ddbState.attempts, lastStatus: TaskStatus.FAILED };
}
if (computeStatus.status === 'completed') {
consecutiveComputeCompletedPolls = (state.consecutiveEcsCompletedPolls ?? 0) + 1;
if (consecutiveComputeCompletedPolls >= MAX_CONSECUTIVE_ECS_COMPLETED_POLLS) {
// Compute session exited successfully but DDB never reached terminal —
// the agent likely crashed after compute completion but before writing status.
logger.error(`${computeBackendLabel} task completed but DDB never caught up — failing task`, {
task_id: taskId,
consecutive_completed_polls: consecutiveComputeCompletedPolls,
});
await failTask(
taskId,
ddbState.lastStatus,
`${computeBackendLabel} task exited successfully but agent never wrote terminal status after ${consecutiveComputeCompletedPolls} polls`,
task.user_id,
true,
);
return { attempts: ddbState.attempts, lastStatus: TaskStatus.FAILED };
}
logger.warn(`${computeBackendLabel} task completed but DDB not terminal — waiting for DDB catchup`, {
task_id: taskId,
consecutive_completed_polls: consecutiveComputeCompletedPolls,
});
}
} catch (err) {
consecutiveComputePollFailures = (state.consecutiveEcsPollFailures ?? 0) + 1;
if (consecutiveComputePollFailures >= MAX_CONSECUTIVE_ECS_POLL_FAILURES) {
logger.error(`${computeBackendLabel} pollSession failed repeatedly — failing task`, {
task_id: taskId,
consecutive_failures: consecutiveComputePollFailures,
error: err instanceof Error ? err.message : String(err),
});
await failTask(
taskId,
ddbState.lastStatus,
`${computeBackendLabel} poll failed ${consecutiveComputePollFailures} consecutive times: ${err instanceof Error ? err.message : String(err)}`,
task.user_id,
true,
);
return { attempts: ddbState.attempts, lastStatus: TaskStatus.FAILED };
}
logger.warn(`${computeBackendLabel} pollSession check failed (non-fatal)`, {
task_id: taskId,
consecutive_failures: consecutiveComputePollFailures,
error: err instanceof Error ? err.message : String(err),
});
}
}
return {
...ddbState,
consecutiveEcsPollFailures: consecutiveComputePollFailures,
consecutiveEcsCompletedPolls: consecutiveComputeCompletedPolls,
};

Copilot uses AI. Check for mistakes.
Comment on lines +283 to +315
// EC2 fleet compute strategy permissions (only when EC2 is configured)
if (props.ec2Config) {
this.fn.addToRolePolicy(new iam.PolicyStatement({
actions: [
'ec2:DescribeInstances',
'ec2:CreateTags',
],
resources: ['*'],
}));

this.fn.addToRolePolicy(new iam.PolicyStatement({
actions: [
'ssm:SendCommand',
'ssm:GetCommandInvocation',
'ssm:CancelCommand',
],
resources: ['*'],
}));

this.fn.addToRolePolicy(new iam.PolicyStatement({
actions: ['s3:PutObject'],
resources: [`arn:${Aws.PARTITION}:s3:::${props.ec2Config.payloadBucketName}/*`],
}));

this.fn.addToRolePolicy(new iam.PolicyStatement({
actions: ['iam:PassRole'],
resources: [props.ec2Config.instanceRoleArn],
conditions: {
StringEquals: {
'iam:PassedToService': 'ec2.amazonaws.com',
},
},
}));
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EC2 orchestrator IAM statements are very broad (ec2:CreateTags on *, ssm:* on *) and the iam:PassRole grant conditioned on ec2.amazonaws.com doesn’t appear to be used by the EC2 strategy (it doesn’t call APIs that pass a role). Consider removing the EC2 iam:PassRole statement, and scoping EC2 tagging permissions with conditions (e.g., ec2:ResourceTag/${fleetTagKey} == fleetTagValue and limiting aws:TagKeys) to reduce blast radius.

Copilot uses AI. Check for mistakes.
Comment on lines +113 to +116
*/
readonly ec2FleetConfig?: {
readonly instanceRoleArn: string;
};
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ec2FleetConfig includes instanceRoleArn, but that value isn’t used (the construct only conditionally grants ssm:CancelCommand on *). This makes the API misleading and suggests a scoping that doesn’t exist. Either remove instanceRoleArn from the props or use it to implement a more scoped permission model (e.g., document/service role, or additional permissions needed to reset instance tags on cancel).

Suggested change
*/
readonly ec2FleetConfig?: {
readonly instanceRoleArn: string;
};
* This object currently serves only as an enablement flag for EC2-backed cancellation.
*/
readonly ec2FleetConfig?: {};

Copilot uses AI. Check for mistakes.

// Tag the ASG instances for fleet identification
// CDK auto-propagates tags from the ASG to instances
this.autoScalingGroup.node.defaultChild;
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.autoScalingGroup.node.defaultChild; is a no-op statement and can be removed. If the intent was to access/modify the underlying CFN resource, assign it to a variable and use it explicitly (otherwise this line adds noise without effect).

Suggested change
this.autoScalingGroup.node.defaultChild;

Copilot uses AI. Check for mistakes.
'',
'# Set environment variables',
...envExports,
'',
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EC2 boot script uses aws ecr get-login-password --region $AWS_REGION, but AWS_REGION is never set in the script. On stock AL2023 instances this is typically unset, which will make ECR login/pull fail. Consider deriving the region from instance metadata (like the user-data does) and exporting AWS_REGION/AWS_DEFAULT_REGION before the AWS CLI and docker run steps.

Suggested change
'',
'',
'# Resolve AWS region from instance metadata',
'IMDS_TOKEN=$(curl -sS -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")',
'AWS_REGION=$(curl -sS -H "X-aws-ec2-metadata-token: $IMDS_TOKEN" http://169.254.169.254/latest/dynamic/instance-identity/document | sed -n \'s/.*"region"[[:space:]]*:[[:space:]]*"\\([^"]*\\)".*/\\1/p\')',
'export AWS_REGION',
'export AWS_DEFAULT_REGION="$AWS_REGION"',
'',

Copilot uses AI. Check for mistakes.
Comment on lines +127 to +130
'# Fetch payload from S3',
`aws s3 cp "s3://${EC2_PAYLOAD_BUCKET}/${payloadKey}" /tmp/payload.json`,
'export AGENT_PAYLOAD=$(cat /tmp/payload.json)',
'',
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

export AGENT_PAYLOAD=$(cat /tmp/payload.json) is unsafe: command substitution strips newlines and can mangle JSON, and large hydrated payloads may exceed shell/env-var limits. Prefer keeping the payload as a file (e.g., mount /tmp/payload.json into the container and have the Python entrypoint read it) or at least quote the assignment to preserve content.

Copilot uses AI. Check for mistakes.
isadeks pushed a commit to isadeks/sample-autonomous-cloud-coding-agents that referenced this pull request May 13, 2026
…umulation (aws-samples#79 review aws-samples#6)

The conditional UpdateItem dup-delete path
(``task_created`` / ``session_started`` lifecycle persists)
calls ``deleteMessage`` to clean up the duplicate Slack message
that landed when a sibling retry won the race. The delete is
inherently best-effort — but if it fails, the duplicate becomes a
permanent ghost in the thread and operators had no way to alarm
on the rate.

Refactor ``deleteMessage`` to return a boolean (``true`` on success
or ``message_not_found``-as-already-gone, ``false`` otherwise) and
emit a dedicated ``fanout.slack.dup_delete_failed`` event with an
``error_id: FANOUT_SLACK_DUP_DELETE_FAILED`` from the dup-delete
callsites when the cleanup couldn't complete.

The terminal-event cleanup paths (``slack_session_msg_ts``,
``slack_created_msg_ts``) intentionally don't fire this event —
those paths target genuinely-stale UX cleanup, not retry-driven
duplicates, so an alarm there would be noise.

No new tests beyond the existing dup-delete coverage; the
``deleteMessage`` return value isn't yet asserted at the unit
level, but the behavior is fully exercised by the existing
``dup-delete`` integration paths (test gap aws-samples#31 will add an
explicit failure-path assertion when it lands).
github-merge-queue Bot pushed a commit that referenced this pull request May 13, 2026
#79)

* feat(fanout): migrate SlackNotifyFn to FanOutConsumer subscriber (#64)

Move the Slack outbound delivery off its own DynamoDB Streams consumer
onto FanOutConsumer as a per-channel dispatcher. Drops TaskEventsTable
from 2 concurrent stream readers to 1, restoring headroom for future
channels (Email, Teams, etc.) without exceeding the documented
DynamoDB Streams 2-reader-per-shard practical limit.

The PR also addresses an adversarial code review on the original
migration; the body below walks through each piece in the order it
landed.

## (a) Migration

- `cdk/src/handlers/slack-notify.ts` — rewritten as exported
  `dispatchSlackEvent(event, ddb)` plus a tagged `SlackApiError`
  class. The standalone `handler(event)` stream entrypoint is gone;
  the FanOutConsumer is now the only TaskEventsTable stream reader.
  Behaviour preserved bit-for-bit: channel_source==='slack' gate,
  terminal-event dedup via conditional UpdateItem on
  `channel_metadata.slack_notified_terminal`, threaded replies under
  the @mention or task_created message, emoji transitions
  (eyes -> hourglass -> ✅/❌/🚫/⏲), DM channel_id -> user_id rewrite,
  intermediate session+created message cleanup on terminal events.

- `cdk/src/handlers/fanout-task-events.ts` — replaces the log-only
  `dispatchToSlack` stub with a wrapper that calls dispatchSlackEvent
  and routes errors via the new typed contract (see (b) below). Slack
  defaults gain task_created, session_started, task_timed_out so the
  router fans out the lifecycle events the old SlackNotifyFn handled;
  the dispatcher's channel_source gate keeps non-Slack tasks unaffected.

- `cdk/src/constructs/fanout-consumer.ts` — adds a scoped
  `secretsmanager:GetSecretValue` grant on `bgagent/slack/*` so the
  fanout Lambda can fetch per-workspace bot tokens. Same scope the old
  SlackNotifyFn role held.

- `cdk/src/constructs/slack-integration.ts` — deletes SlackNotifyFn,
  its DynamoEventSource, its IAM policy, and its NagSuppressions
  entry. Drops the now-unused StartingPosition / FilterCriteria /
  FilterRule / lambdaEventSources imports.

After this lands, `aws lambda list-event-source-mappings` shows
exactly one consumer of the TaskEventsTable stream (FanOutFn);
verified on the dev stack with end-to-end @mention + cancel + CLI
isolation scenarios.

## (b) Review fix #1 — partial-batch retry semantics (BLOCKER)

The first review pass found that the post-migration handler silently
dropped Slack-side infra errors (DDB throttle on the GetItem,
Secrets Manager 5xx, transient Slack timeout). Pre-migration the
SlackNotifyFn handler rethrew non-SlackApiError so Lambda retried
the batch; post-migration `Promise.allSettled` swallowed the
rejection and routeEvent returned an empty list with no escalation
path to `batchItemFailures`.

routeEvent's return type changed from `NotificationChannel[]` to
`{ dispatched, infraRejections }`. The handler now pushes the
record into `batchItemFailures` whenever `infraRejections.length>0`,
so Lambda replays the record under the partial-batch contract. The
warn line on rejection is tagged `retryable: true` so operators can
alert distinctly from the channel-terminal swallow path.

GitHub got the symmetric treatment: 4xx (excluding the existing 401
and 404 handling) is now treated as a channel-terminal swallow via
`fanout.github.api_error` instead of escalating to retry.

## (c) Review fix #2 — split SlackApiError into terminal + retryable

Originally any `!result.ok` Slack response was wrapped in
SlackApiError and swallowed. That collapsed retryable codes
(`ratelimited`, `service_unavailable`, `internal_error`,
`fatal_error`, `request_timeout`) into the same swallow as
`channel_not_found` — a tier-1 Slack outage would silently drop
every message.

Introduced `TERMINAL_SLACK_API_ERRORS` set + `classifySlackError`
helper. Terminal codes still throw SlackApiError (router swallows).
Retryable codes throw a plain Error so the router classifies them
as infra rejections and Lambda replays.

## (d) Review fix #3 — NOTIFIABLE_EVENTS / CHANNEL_DEFAULTS drift

The original migration added task_created/session_started/task_timed_out
to CHANNEL_DEFAULTS.slack but the dispatcher's NOTIFIABLE_EVENTS gate
already excluded several events the router was subscribing Slack to
(agent_error, pr_created, task_stranded). Result: Slack was reported
as `dispatched` for events it silently dropped — telemetry lied,
agent_error never reached operators on Slack-origin tasks, and
task_stranded rendered the generic "Event: task_stranded for
owner/repo" fallback (UX regression).

Added render cases for task_stranded and agent_error in slack-blocks.ts
and added them to NOTIFIABLE_EVENTS. Forward-compat approval_required
and status_response stay out of NOTIFIABLE_EVENTS until their emitters
ship; a new cross-file consistency test in fanout-task-events.test.ts
fails if anyone re-introduces the drift.

The Slack dispatcher wrapper now passes `effectiveEventType` so an
agent_milestone(pr_created) wrapper is unwrapped before NOTIFIABLE_EVENTS
matching. Without the rewrite, the dispatcher would short-circuit
on the wrapper string `agent_milestone`.

## (e) Review fix #4 — conditional UpdateItem on lifecycle persists

Once the BLOCKER fix made batches retry, the original task_created
and session_started UpdateItem calls became hazardous: a Slack POST
that succeeded but whose follow-up UpdateItem failed transiently
would, on retry, post a second root and overwrite slack_thread_ts —
orphaning every threaded reply that had threaded under the first ts.

Both UpdateItems now carry an `attribute_not_exists` ConditionExpression
on the relevant `channel_metadata.slack_*_msg_ts`. On
ConditionalCheckFailedException the handler logs at info, deletes
the duplicate Slack message via `chat.delete`, and returns. Sibling
retry wins the race; the duplicate is cleaned up.

## (f) Dev-stack regression: drop pr_created from Slack defaults

Live verification surfaced a UX duplication: pr_created (subscribed
in CHANNEL_DEFAULTS.slack as the original §6.2 design called for) and
task_completed both rendered messages with View PR buttons, posted
seconds apart. The original SlackNotifyFn had silently dropped
pr_created (NOTIFIABLE_EVENTS gate), so users hadn't relied on it.

Removed pr_created from CHANNEL_DEFAULTS.slack and from
NOTIFIABLE_EVENTS, and removed the prCreatedMessage renderer.
GitHub keeps pr_created (its edit-in-place comment surface
genuinely benefits from the early checkpoint).

## Verification

- mise //cdk:compile  — clean
- mise //cdk:test     — 1183 / 1183 pass (8 net-new tests added for
                        the review fixes: NOTIFIABLE_EVENTS drift
                        guard, retryable Slack codes, GitHub 4xx
                        swallow, infra rejection escalation,
                        SlackApiError swallow, task_stranded render)
- mise //cdk:eslint   — clean
- mise //cdk:synth    — confirms exactly one Lambda::EventSourceMapping
                        on TaskEventsTable, pointing at FanOutFn
- Dev-stack scenarios — @mention happy path, Cancel button, CLI submit
                        (channel_source=api -> zero Slack dispatches,
                        GitHub edit-in-place still fires)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(fanout): retry GitHub 403/429 instead of swallowing as terminal (#79 review #1)

PR #79 review found that the new 4xx terminal-swallow path treats
HTTP 403 and 429 as channel-terminal — but on GitHub these are
transient rate-limit responses (403 with "API rate limit exceeded",
429 "Too Many Requests"). Under a reconciliation wave that touches
many tasks, an entire window of GitHub comment updates would be
permanently dropped with only a warn log.

Carve out 403 and 429 from the swallow guard so they propagate as
infra rejections through ``Promise.allSettled``. The record lands
in ``batchItemFailures`` and Lambda replays until the rate-limit
window clears (or DLQs after ``retryAttempts``).

Test coverage: parametrized over 403 + 429 with a GitHubCommentError
mock at the helper boundary, asserting the record's eventID surfaces
in ``batchItemFailures`` rather than being absorbed.

* fix(fanout): guard Slack Secrets Manager grant on a prop (#79 review #2)

Every other external-service grant in FanOutConsumer (taskTable,
repoTable, githubTokenSecret) is gated by ``if (props.X)``, so a
deployment that hasn't onboarded the corresponding service stays
free of dangling IAM permissions. The original migration broke the
pattern with an unconditional ``bgagent/slack/*`` Secrets Manager
grant — dev stacks without Slack onboarding ended up holding read
permission on a resource pattern they never use, with a misleading
``cdk-nag AwsSolutions-IAM5`` suppression reason.

Adds an optional ``slackSecretArnPattern`` prop on
``FanOutConsumerProps``; the policy statement is only attached when
the prop is set. ``cdk/src/stacks/agent.ts`` now computes the
``bgagent/slack/*`` ARN inline and passes it through, mirroring the
other guarded props. ``ArnFormat`` and ``Stack`` imports moved out
of fanout-consumer.ts since the construct no longer needs them.

No changes to live behaviour — agent.ts always passes the prop, so
the IAM policy still attaches in production. The dispatcher will
log-and-fail-retry on a missing pattern (covered by review #3 fix).
Test gap covering the construct itself ships in a follow-up commit
(test gap #34).

* fix(fanout): throw on missing TASK_TABLE_NAME env var (#79 review #3)

Pre-fix: when ``TASK_TABLE_NAME`` was unset on a Slack-subscribed
event, ``dispatchSlackEvent`` returned silently after a warn line.
The router counted Slack as ``dispatched`` and a broken stack
quietly dropped every Slack notification — operators only saw it
in the warn-rate metric, with no rejected-channel signal.

Post-fix: throw a plain Error so the rejection propagates as an
infra rejection through ``Promise.allSettled``. The router pushes
the record into ``batchItemFailures``, Lambda retries the batch,
the ``fanout.dispatcher.rejected`` warn fires per record, and
operators get a distinct alarm.

Also bumps the existing log line from ``warn`` to ``error`` and
attaches an ``error_id: FANOUT_SLACK_MISSING_TASK_TABLE`` so the
deployment-bug case can be distinguished from per-record failures.

Test: ``throws when TASK_TABLE_NAME env var is missing`` deletes
the env var, asserts the throw, asserts no DDB call was attempted
(env-var guard fires first).

* fix(fanout): match SlackApiError by name as well as instanceof (#79 review #7)

When a bundler ever duplicates the slack-notify module (rare with
NodejsFunction tree-shaking but possible if dual-bundled), two
distinct SlackApiError classes coexist and ``instanceof`` against
one fails for instances of the other. The dispatcher would see a
foreign-class SlackApiError, fall through to the rethrow branch,
and the router would treat it as an infra rejection — flipping a
channel-terminal swallow into infinite Lambda retries.

Add an ``err.name === 'SlackApiError'`` fallback so the swallow
branch fires either way. Mirrors the duck-typed
``GitHubCommentError`` check used elsewhere in the same handler.

Test: synthesise a plain Error with name === 'SlackApiError'
(NOT an instance of the mock's SlackApiError class) and assert
batchItemFailures stays empty — proving the swallow path catches
both shapes.

* fix(fanout): extend TERMINAL_SLACK_API_ERRORS with permission codes (#79 review #8)

Original set omitted documented Slack permission/scope failures.
Codes outside the set fall to the retryable branch, so a
misconfiguration like ``ekm_access_denied`` or ``missing_scope``
would burn 3 Lambda retries before DLQ on every event — even
though the failure is fundamentally a configuration bug that no
retry can clear.

Adds:
  - Permission/scope: missing_scope, ekm_access_denied,
    team_access_not_granted, posting_to_general_channel_denied
  - Payload shape: invalid_arguments

Reorganized the set into commented blocks (channel-shape, auth,
permission/scope, payload-shape) so future additions go in the
right bucket and the rationale stays visible.

Test coverage: parametrized over the full TERMINAL_SLACK_API_ERRORS
set (21 codes) — every one must throw SlackApiError so the router
swallows it. The existing retryable test.each remains intact and
covers the negative-class case (codes outside the set throw a
plain Error and escalate to retry).

* fix(fanout): promote Slack reaction/delete network errors to error logs (#79 review #5)

The reaction / delete helpers (``addReaction``, ``removeReaction``,
``deleteMessage``) used to log every catch at warn with a single
generic event key, lumping API-level rejections (e.g. ``no_reaction``)
together with infrastructure failures (DNS lookup, TLS handshake,
fetch timeout, JSON parse error from a hostile gateway). Operators
who alarmed on the warn rate saw a flat signal that masked
genuine infra problems.

Split the boundary:

  - API-level (``!result.ok`` after a successful HTTP call) stays at
    warn with channel-specific event keys
    (``fanout.slack.reaction_add_api_error``,
    ``fanout.slack.reaction_remove_api_error``,
    ``fanout.slack.message_delete_api_error``). These are per-message
    UX problems; operators don't page.

  - Network errors (the outer ``catch (err)`` after ``fetch``)
    promote to ``logger.error`` with dedicated event keys
    (``fanout.slack.reaction_add_network_error``,
    ``fanout.slack.reaction_remove_network_error``,
    ``fanout.slack.message_delete_network_error``) and ``error_id``s
    (``FANOUT_SLACK_REACTION_NETWORK``,
    ``FANOUT_SLACK_DELETE_NETWORK``) so each has its own alarmable
    signal. User-visible symptoms when these fire silently:
    stale emoji reactions (hourglass never swaps to ✅) and
    orphaned intermediate messages.

Behaviour unchanged: errors are still swallowed (per-message
reactions and intermediate cleanup are best-effort by design;
they must not fail the batch), but operators now get distinct
metrics for each failure class.

* fix(fanout): emit fanout.slack.dup_delete_failed on ghost-message accumulation (#79 review #6)

The conditional UpdateItem dup-delete path
(``task_created`` / ``session_started`` lifecycle persists)
calls ``deleteMessage`` to clean up the duplicate Slack message
that landed when a sibling retry won the race. The delete is
inherently best-effort — but if it fails, the duplicate becomes a
permanent ghost in the thread and operators had no way to alarm
on the rate.

Refactor ``deleteMessage`` to return a boolean (``true`` on success
or ``message_not_found``-as-already-gone, ``false`` otherwise) and
emit a dedicated ``fanout.slack.dup_delete_failed`` event with an
``error_id: FANOUT_SLACK_DUP_DELETE_FAILED`` from the dup-delete
callsites when the cleanup couldn't complete.

The terminal-event cleanup paths (``slack_session_msg_ts``,
``slack_created_msg_ts``) intentionally don't fire this event —
those paths target genuinely-stale UX cleanup, not retry-driven
duplicates, so an alarm there would be noise.

No new tests beyond the existing dup-delete coverage; the
``deleteMessage`` return value isn't yet asserted at the unit
level, but the behavior is fully exercised by the existing
``dup-delete`` integration paths (test gap #31 will add an
explicit failure-path assertion when it lands).

* chore(fanout): tighten RouteOutcome arrays to ReadonlyArray (#79 review #9)

``RouteOutcome.dispatched`` and ``infraRejections`` were typed as
plain ``NotificationChannel[]`` — which made ``readonly`` on the
property prevent reassignment but still allow callers to mutate the
underlying array via ``.push``, ``.splice``, or ``.sort``. Inconsistent
with the ``ReadonlySet<string>`` used for ``CHANNEL_DEFAULTS`` in the
same file.

Tightening to ``ReadonlyArray<NotificationChannel>`` makes the
contract honest: the router owns the arrays, callers read them.
Test suite updated to use ``[...outcome.dispatched].sort()`` where it
previously called ``.sort()`` directly — the explicit copy makes the
intent clear and would have surfaced any silent test-side mutation.

* refactor(fanout): make SlackDispatchEvent a type alias of FanOutEvent (#79 review #10)

The two interfaces were structurally identical: same five fields,
same readonly modifiers, same metadata shape. The decoupling was
purely nominal and a silent-drift footgun — adding a field to
``FanOutEvent`` (e.g. when the router starts plumbing an
``approval_required`` ID through) would not flow into
``SlackDispatchEvent``, leaving the dispatcher unaware until a
downstream test happened to fail.

Replace with a one-line type alias:

  export type SlackDispatchEvent = FanOutEvent;

The slack-notify module now type-imports ``FanOutEvent`` from
fanout-task-events. ``import type`` is erased at compile time, so
the runtime bundle still has the one-way dep
(fanout-task-events → slack-notify) — no module-cycle hazard.

Reviewer-suggested ``Pick<FanOutEvent, 'task_id' | …>`` was
considered and rejected: the dispatcher uses every field of
``FanOutEvent``, so the Pick would just enumerate the same five
fields with extra noise. A direct alias keeps the intent obvious
and prevents drift identically.

* fix(fanout): generalize Slack dedup to cover agent_error + log Retry-After (#79 review #4)

PR #79 review #4 surfaced a sibling-channel-failure hazard: when
GitHub or Email rate-limits, the record lands in
``batchItemFailures``. On the Lambda retry, every Slack-subscribed
event for that record runs again. Terminal events were already
guarded by ``slack_notified_terminal``; ``agent_error`` was not —
operators would page twice on a single agent failure if a sibling
channel happened to fail.

Generalize the dedup mechanism. ``TERMINAL_EVENTS`` is replaced by
a ``SLACK_DEDUP_ATTRIBUTE`` map that marks each event type with the
``channel_metadata`` attribute that should guard the post:

  - 5 terminals share ``slack_notified_terminal`` (any first-arriving
    terminal claims the right; subsequent terminals dedup against it)
  - ``agent_error`` gets its own ``slack_dispatched_agent_error``
    so a duplicate agent_error doesn't reuse the terminal slot
  - ``task_created`` / ``session_started`` map to ``null`` because
    they already use the per-event ``slack_*_msg_ts`` conditional
    persists from review #1 — the conditional already provides
    full idempotency (a separate marker would be redundant)

Also surfaces Slack's ``Retry-After`` header on rate-limited
responses through a dedicated ``fanout.slack.retryable_api_error``
warn so operators reading CloudWatch can see the recovery window
instead of guessing from sustained warn rate.

Tests:
  - logs Retry-After header on rate-limited Slack responses (new):
    asserts ``retry_after_seconds`` propagates from Slack's response
    header into the warn metadata
  - existing terminal-codes parametrized test untouched (terminal
    branch doesn't read headers)
  - existing retryable test gains a ``headers: { get: () => null }``
    stub on the fetch mock so the headers.get call doesn't crash

Reviewer suggested a per-channel dispatch bitmap as the alternative.
Rejected as premature: the duplicate-GitHub-PATCH is harmless
(idempotent), Email is still a stub, and the dedup map covers
the specific agent_error pain identified above. A bitmap would add
a new table + IAM grants + per-dispatch DDB cost for a hypothetical
problem (Slack rate-limiting AND a sibling channel failure).

* test(fanout): conditional UpdateItem race + dup-delete coverage (#79 test gap)

Adds 4 tests covering the lifecycle-persist conditional path that
review fix #1 introduced and review fix #6 hardened. Pre-PR-#79 the
only ConditionalCheckFailed coverage was the terminal-dedup path;
the new lifecycle-persist + dup-delete code lacked direct assertions
and was flagged 9/10 criticality by the reviewer.

  - task_created persist ConditionalCheckFailed → posts duplicate
    then deletes it: pins the cleanup behaviour that prevents ghost
    task_created posts in the channel
  - session_started persist ConditionalCheckFailed → posts duplicate
    then deletes it: parallel coverage for the other lifecycle
    attribute (slack_session_msg_ts)
  - dup-delete failure emits fanout.slack.dup_delete_failed with
    error_id: pins the operator-alarm signal added in review fix #6;
    asserts both the event key and the FANOUT_SLACK_DUP_DELETE_FAILED
    error_id propagate
  - chat.delete returning message_not_found is treated as success
    (no dup_delete_failed): negative-class assertion. Prevents
    false-positive alarms when the race resolves cleanly (the
    duplicate was already deleted by a prior retry).

The ghost / message_not_found tests use ``fetchMock.mockImplementation``
URL-routing rather than ``.mockResolvedValueOnce`` chains because
``updateReaction`` issues 2-3 reaction-API fetches between
chat.postMessage and chat.delete; routing by URL keeps the test
focused on the load-bearing chat.delete behaviour without coupling
to reaction call order.

* test(fanout): cover task_stranded + agent_error renderers (#79 test gap #32)

Pre-PR-#79 the new ``taskStrandedMessage`` and ``agentErrorMessage``
helpers in slack-blocks.ts had no direct unit tests. Reviewer flagged
this as a 7/10 gap because the renderers carry the prior_status /
error_type / message_preview metadata threaded through from the
event source — silent drift in the metadata field names would
produce ugly fallback messages in production.

Adds 5 tests:

  - task_stranded WITH metadata renders the prior_status parenthetical
    (``Task stranded for org/repo (last status: RUNNING)``) so
    operators can tell at a glance whether the task hung in HYDRATING
    vs RUNNING — without the parenthetical the reviewer's "generic
    Event: ..." UX regression would resurface.
  - task_stranded WITHOUT metadata still renders cleanly (legacy
    events written before the reconciler started stamping metadata
    must not crash or leak ``undefined``).
  - agent_error with full metadata (error_type + message_preview)
    renders the rotating_light, type, and preview.
  - agent_error WITHOUT metadata stays sensible — no leaked
    ``undefined`` strings or empty ``_Type:_`` line.
  - agent_error truncates a 500-char message_preview to keep Slack
    channel UX readable.

* test(fanout): cover agent_error dedup + dedup-slot isolation (#79 test gap #33)

Pre-PR-#79 review-fix #4 there was no direct test for the
``slack_dispatched_agent_error`` dedup attribute or its interaction
with the existing ``slack_notified_terminal`` slot. A future
refactor that collapsed the two slots — or renamed one of them —
would silently break the sibling-channel-failure-retry guarantee
that fix #4 added.

Adds 4 tests:

  - ``agent_error claims its own dedup attribute``: pins the
    UpdateExpression and ConditionExpression strings so a refactor
    that renames the attribute breaks loudly.
  - ``agent_error retry hits the dedup guard``: end-to-end scenario
    matching review #4 — task already has
    ``slack_dispatched_agent_error: true``, retry must short-circuit
    before chat.postMessage. Without the guard, a second
    rotating_light fires.
  - ``terminal dedup attribute is per-class``: a flaky
    task_completed-then-task_failed sequence dedups against the
    same ``slack_notified_terminal`` slot. Catches the regression
    where the orchestrator emits both terminal types and we'd
    otherwise post both ✅ and ❌.
  - ``agent_error and terminals use distinct dedup slots``: the
    important negative — having ``slack_dispatched_agent_error``
    set must NOT shadow a subsequent ``task_completed``. Pins
    the slot separation so a future merge into a single slot
    can't silently drop terminals after an agent_error.

* test(fanout): add construct-level tests for FanOutConsumer (#79 test gap #34)

The construct shipped on issue #64 with no unit-level coverage of
its IAM contract. The only synth-level signal lived inside
``slack-integration.test.ts`` ("0 EventSourceMapping") which proved
the migration didn't regress the OTHER construct. Reviewer flagged
this 6/10 — and the gap is what allowed review #2 (unconditional
Slack secret grant) to slip through in the first place.

Adds 6 tests:

  - ``attaches a single DynamoEventSource on the TaskEventsTable
    stream``: pins the architectural invariant — issue #64 was
    fundamentally about reaching exactly-one stream reader. Adding
    a second consumer must fail this test loudly.
  - ``creates a DLQ for the fanout Lambda``: pins retention period
    + presence; a DLQ-less deployment would silently drop
    poison-pill records past retryAttempts.
  - ``omits the bgagent/slack/* grant when slackSecretArnPattern
    is not provided``: the review #2 invariant. Iterates every
    IAM::Policy and asserts NONE of them grant secretsmanager:*
    on a bgagent/slack/* ARN. A regression that re-introduces the
    unconditional grant breaks this test.
  - ``attaches the bgagent/slack/* grant only when
    slackSecretArnPattern is provided``: the positive case. Pins
    the grant shape (action, effect, resource pattern).
  - ``passes TASK_TABLE_NAME env var when taskTable is provided``:
    review #3 dependency — the dispatcher throws on missing env.
  - ``omits TASK_TABLE_NAME env var when taskTable is not provided``:
    graceful degrade for dev stacks that haven't onboarded the
    TaskTable yet (matches the construct's documented contract).

* test(fanout): cover task_stranded through terminal dedup (#79 test gap #35)

The reconciler at handlers/reconcile-stranded-tasks.ts:170 emits
BOTH ``task_stranded`` and ``task_failed`` for a heartbeat-expired
task — one for the operator signal, one to drive the FAILED status
transition. Pre-PR-#79 this pair had no test coverage; reviewer
flagged this 8/10 because the visible failure mode (a paired
"Task stranded" + "Task failed" double-page in Slack) would
surface in production but be silent in CI.

Adds 2 tests:

  - ``task_stranded posts and writes the terminal dedup marker on
    first arrival``: pins that task_stranded participates in the
    shared terminal slot and renders the warning message with
    metadata. Catches a regression that omits task_stranded from
    the dedup map entirely.
  - ``task_stranded after a sibling task_failed dedups``: the
    operational scenario — task_failed already claimed
    ``slack_notified_terminal``; the subsequent task_stranded must
    short-circuit before chat.postMessage. Without this guard,
    operators get the double-page the reviewer warned about.

* fix(fanout): re-read TaskRecord before terminal cleanup to close orphan-message race

Live observation during PR #79 review verification: the same Slack
@mention happy path sometimes leaves the 🚀 task_created message
in the thread (orphaned beside the ✅ task_completed) and sometimes
deletes it cleanly. The race window:

  1. ``task_created`` stream batch posts the rocket message and
     persists ``slack_created_msg_ts`` via the conditional UpdateItem
     introduced in PR #79 review fix #1.
  2. ``task_completed`` stream batch fires ~30s later. Its initial
     GetItem races the prior UpdateItem and sees a stale
     ``channel_metadata`` WITHOUT ``slack_created_msg_ts``.
  3. The terminal cleanup branch checks
     ``channelMeta.slack_created_msg_ts`` — undefined — silently skips
     the chat.delete. The rocket message stays in the thread.

Add a fresh GetItem inside the TERMINAL_EVENTS cleanup branch, after
the dedup UpdateItem has linearized our view of the table. Any prior
``slack_*_msg_ts`` writes are visible by then, so the cleanup fires
correctly. On a re-read failure (DDB throttle / transient blip) we
fall back to the dispatch-entry snapshot and emit
``fanout.slack.cleanup_reread_failed`` so operators can alarm on
the rate.

Pre-existing race (the unconditional UpdateItem in pre-PR-#79 was
the same shape — wrote, GetItem on the next batch could miss it).
PR #79 doesn't introduce it but doesn't fix it either; this commit
does, since the live screenshot evidence appeared during review
verification.

Tests:
  - ``terminal cleanup re-reads TaskRecord``: scripts a stale
    dispatch-entry GetItem followed by a fresh re-read GetItem with
    ``slack_created_msg_ts`` present; asserts chat.delete fires
    against the freshly-read ts.
  - ``terminal cleanup falls back to dispatch-entry snapshot when
    re-read fails``: defense-in-depth — DDB throttle on the re-read
    must not break terminal delivery; cleanup uses the entry snapshot
    and emits the fallback warn.

---------

Co-authored-by: bgagent <bgagent@noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Alain Krok <alkrok@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants