Skip to content

feat(k8s): add k8s-health-check reusable workflow#103

Open
chrisamti wants to merge 8 commits into
mainfrom
feat/k8s-health-check-workflow
Open

feat(k8s): add k8s-health-check reusable workflow#103
chrisamti wants to merge 8 commits into
mainfrom
feat/k8s-health-check-workflow

Conversation

@chrisamti
Copy link
Copy Markdown
Member

@chrisamti chrisamti commented Mar 19, 2026

Summary

  • Adds a new reusable workflow k8s-health-check.yaml that validates key cluster components are healthy after a Terraform apply
  • Intended to gate prod deploys on dev health, sitting between tf-apply (dev) and tf-apply (prod)
  • Checks performed:
    • Karpenter rollout status
    • Datadog operator + cluster-agent + node-agent rollout status
    • DatadogAgent CRD status conditions — directly checks for Error/Degraded conditions on the custom resource (catches immutable field errors, reconciliation failures, etc.)
    • Datadog operator logs — scans for ERROR level entries from the last 3 minutes
    • Lacework rollout status (skipped gracefully if not deployed)

Usage

apply-dev:
  uses: DND-IT/github-workflows/.github/workflows/tf-apply.yaml@v3
  with:
    environment: platform-dev

health-check-dev:
  needs: apply-dev
  uses: DND-IT/github-workflows/.github/workflows/k8s-health-check.yaml@v3
  with:
    environment: platform-dev

apply-prod:
  needs: health-check-dev
  uses: DND-IT/github-workflows/.github/workflows/tf-apply.yaml@v3
  with:
    environment: platform-prod

Testing

No integration test is included in this repo. This workflow is an integration test by nature — it runs kubectl rollout status and inspects live CRD status against real cluster resources. The sandbox account has no EKS cluster with the required components (Karpenter, Datadog), so it cannot be tested here.

The workflow is tested end-to-end via the disco-infra-terraform pipeline (see companion PR), which has access to the real platform-dev cluster.

Test plan

  • Verify workflow runs correctly via the disco-infra-terraform PR pipeline against platform-dev

Validates key cluster components after a Terraform apply to gate
prod deploys on dev health. Checks Karpenter, Datadog (operator,
cluster-agent, node-agent) and Lacework rollout status, plus Datadog
operator reconciliation error logs.
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Mar 19, 2026
…v for test

- Fail with a clear error if no cluster is found in the region instead
  of passing 'None' to aws eks update-kubeconfig
- Switch test environment from sandbox (no cluster) to platform-dev
The workflow requires access to a specific EKS cluster in the platform-dev
AWS account. This environment is not available in this repo and would require
additional OIDC trust configuration. The workflow is effectively tested via
the disco-infra-terraform pipeline which already has the correct access.
Adds a direct check on the DatadogAgent custom resource status conditions,
catching Error/Degraded states (e.g. immutable field errors) at the source
rather than relying solely on operator log parsing.
@swibrow
Copy link
Copy Markdown
Contributor

swibrow commented Mar 20, 2026

@codex[agent] Please review

@Codex
Copy link
Copy Markdown
Contributor

Codex AI commented Mar 20, 2026

@swibrow I've opened a new pull request, #107, to work on those changes. Once the pull request is ready, I'll request review from you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-cd documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants