A small reference for orchestrating ML and CI/CD pipelines on Kubernetes with Argo Workflows.
Status: Proof of concept. Built as a reference for teammates evaluating Argo Workflows — not production-ready.
A minimal, self-contained playground that explores three distinct Argo Workflows patterns and demonstrates how to pin individual steps to specific node types (CPU vs GPU) using nodeSelector. Everything runs on a local k3d cluster, so you can study the patterns end-to-end without any cloud infrastructure or real GPU hardware.
There is no application code in this repo — only Argo Workflow manifests under workflows/. Each workflow uses public container images so it runs as-is.
- A sequential
steps:template for an ML training pipeline (data prep → train → eval). - A linear
dag:template for a baseline CI/CD pipeline (build → test → scan → deploy). - A branching
dag:withwhen:conditionals that fan out into success/failure paths. - Pinning individual steps to specific nodes via
nodeSelector, using k3d node labels (hardware=cpu,hardware=gpu) to simulate heterogeneous hardware.
Each workflow file isolates one Argo pattern. Rather than cramming every concept into a single mega-workflow, the PoC keeps the patterns side-by-side so a reader can study them independently and copy whichever one fits their use case:
| File | Pattern | One-line summary |
|---|---|---|
workflows/model-training-steps.yaml |
Sequential steps: |
ML training pipeline; the training step is pinned to a GPU node, the rest run on CPU. |
workflows/cicd-pipeline-dag.yaml |
Linear dag: |
Baseline CI/CD pipeline (build → test → scan → deploy), all CPU. |
workflows/cicd-pipeline-complex-dag.yaml |
Branching dag: with conditionals |
Adds parallel scan/lint after a successful test and a failure-notification branch when the test fails. |
.
├── README.md
├── CLAUDE.md
└── workflows/
├── model-training-steps.yaml # steps: pattern, CPU + GPU
├── cicd-pipeline-dag.yaml # dag: linear
└── cicd-pipeline-complex-dag.yaml # dag: branching with when:
k3d cluster create mycluster --agents 2 -p "8081:80@loadbalancer"
# Label the agents to simulate heterogeneous hardware
kubectl label nodes k3d-mycluster-agent-0 hardware=gpu
kubectl label nodes k3d-mycluster-agent-1 hardware=cpu
kubectl get nodes --show-labelsNote: there is no real GPU here. The
hardware=gpulabel is purely a scheduling hint so you can seenodeSelectorroute the training step to a specific node.
helm repo add argo https://argoproj.github.io/argo-helm
helm repo update
helm install argo argo/argo-workflows --namespace argo --create-namespace \
--set workflow.serviceAccount.create=true \
--set 'workflow.serviceAccount.name=argo-workflow' \
--set 'server.authModes[0]=server' \
--version 0.45.20# ML training pipeline (steps:)
argo submit -n argo --watch --serviceaccount argo-workflow ./workflows/model-training-steps.yaml
# Linear CI/CD DAG
argo submit -n argo --watch --serviceaccount argo-workflow ./workflows/cicd-pipeline-dag.yaml
# Branching CI/CD DAG (default: test passes)
argo submit -n argo --watch --serviceaccount argo-workflow ./workflows/cicd-pipeline-complex-dag.yamlkubectl get pods -n argo -o wideYou should see the training pod scheduled onto k3d-mycluster-agent-0 (the gpu-labeled node) and the surrounding steps on k3d-mycluster-agent-1 (cpu).
The complex DAG accepts a test_mode parameter. Pass fail to make the test step throw, which causes the when: conditionals to skip scan/lint/deploy and instead run failure-notification:
argo submit -n argo --watch --serviceaccount argo-workflow \
./workflows/cicd-pipeline-complex-dag.yaml \
-p test_mode=failIf you'd rather study the YAML than run it, here's where to look:
steps:vsdag:syntax — compare thetemplates[0]blocks ofmodel-training-steps.yamlandcicd-pipeline-dag.yaml.nodeSelectorhardware pinning — every template body setsnodeSelector: { hardware: cpu | gpu }. The training step inmodel-training-steps.yamlis the one that targetsgpu.- Conditional branching — see the
when: "{{tasks.test.status}} == Succeeded"and== Failedclauses incicd-pipeline-complex-dag.yaml. Thetest-tasktemplate in that file shows how thetest_mode=failparameter is wired in to force the failure path. - Parameters — both
training-step(inmodel-training-steps.yaml) andtest-task(in both DAG files) demonstrateinputs.parameterswith default values that can be overridden viaargo submit -p.
- No application code. All steps run public container images and print fake output. There's nothing real being built, tested, or trained.
- GPU is simulated. The
hardware=gpulabel is just a label; the training step uses a CUDA base image but does no GPU work. - No tests or CI for the manifests themselves. YAML changes are not validated automatically.
- Pinned versions. The argo-workflows Helm chart is pinned to
0.45.20; newer charts may need flag adjustments. - Local-only. Setup assumes k3d on a single machine; nothing here is hardened for shared or remote clusters.
Ideas for anyone adapting this as a starting point:
- Replace the placeholder containers with real build/test/training images.
- Add artifacts to pass data between steps.
- Promote shared templates into a
WorkflowTemplateso multiple workflows can reuse them. - Wire the DAG into a real trigger (Argo Events, a Git webhook, or
argo cron).