fix: skip IMDS crawl in DataSourceEc2KubernetesLocal init-local phase by damdo · Pull Request #1951 · kubernetes-sigs/image-builder

damdo · 2026-03-14T09:31:36Z

Change description

Issue

DataSourceEc2KubernetesLocal runs during cloud-init's init-local stage (pre-network). Its _get_data() delegates to DataSourceEc2._get_data() which attempts to crawl the IMDS at 169.254.169.254, but no network is available yet.

The TCP connection retries for ~200s before timing out, adding a boot penalty to every EC2 node.

This was not visible until cloud-init 25.1.4 changed ds-identify to respect user-configured datasource_list in /etc/cloud/cloud.cfg.d/ [1].

This update was included in Ubuntu 22.04 and 24.04 base images on Feb 17 2026, so any AMI built from a base image after that date is affected.

Previously ds-identify wrote its own datasource_list: [ Ec2, None ] to /run/cloud-init/cloud.cfg (highest merge priority), silently overriding the custom Ec2Kubernetes datasource with the standard Ec2Local, which handles init-local correctly by setting up ephemeral DHCP first.

Proposed Fix

Return False immediately from DataSourceEc2KubernetesLocal._get_data() so cloud-init proceeds to the init-network phase where DataSourceEc2Kubernetes runs with full network access. This matches the existing end-state (init-local always failed) but eliminates the ~200s timeout.

More detail

Evidence: Before and After

Tested this fix by running the CAPA e2e suite (kubernetes-sigs/cluster-api-provider-aws#5898) with and without the patched DataSourceEc2Kubernetes.py.

Both runs use Ubuntu 24.04 with cloud-init v25.3 (which includes the ds-identify behavioral change from 25.1.4 that surfaces this bug).

Before the fix — prow job #2032348607742480384

systemd.log from quick-start-ufkxn1-control-plane-qgkkx

# cloud-init-local starts, finds DataSourceEc2KubernetesLocal, begins IMDS crawl
Mar 13 09:00:39 systemd[1]: Starting cloud-init-local.service - Cloud-init: Local Stage (pre-network)...
Mar 13 09:00:39 cloud-init[470]: Cloud-init v. 25.3-0ubuntu1~24.04.1 running 'init-local' ... Up 4.93 seconds.
Mar 13 09:00:39 cloud-init[470]: Searching for local data source in: ['DataSourceEc2KubernetesLocal']
Mar 13 09:00:39 cloud-init[470]: Fetching Ec2 IMDSv2 API Token
Mar 13 09:00:39 cloud-init[470]: [0/1] open 'http://169.254.169.254/latest/api/token' ...

# IMDS fails repeatedly — "Network is unreachable" (138 retries over ~232s)
Mar 13 09:00:40 cloud-init[470]: ... 'http://[fd00:ec2::254]/latest/api/token' failed [0/240s]: ... Network is unreachable
...
Mar 13 09:04:32 cloud-init[470]: ... 'http://[fd00:ec2::254]/latest/api/token' failed [232/240s]: ... Network is unreachable

# init-local gives up after 232s
Mar 13 09:04:32 cloud-init[470]: Getting metadata took 232.092 seconds
Mar 13 09:04:32 cloud-init[470]: finish: init-local/search-Ec2KubernetesLocal: FAIL
Mar 13 09:04:32 cloud-init[470]: cloud-init stage: 'init-local' took 232.783 seconds
Mar 13 09:04:32 systemd[1]: Finished cloud-init-local.service

# Network comes up → IMDS succeeds instantly
Mar 13 09:04:34 systemd[1]: Starting cloud-init.service - Cloud-init: Network Stage...
Mar 13 09:04:34 cloud-init[632]: Read from http://169.254.169.254/latest/api/token (200, 56b) after 1 attempts
Mar 13 09:04:40 cloud-init[632]: finish: init-network/search-Ec2Kubernetes: SUCCESS
Mar 13 09:04:42 cloud-init[632]: cloud-init stage: 'init-network' took 7.508 seconds

# Total boot penalty
Mar 13 09:05:10 systemd[1]: Startup finished in 1.272s (kernel) + 4min 34.372s (userspace) = 4min 35.645s.

(full log)

Result: init-local wastes ~233s per node retrying IMDS with no network. The cumulative delay across all nodes causes the entire e2e suite to exceed its 5-hour timeout: FAIL! — 19 Passed | 3 Failed | 6 Pending | 1 Skipped — Suite Timeout Elapsed.

After the fix - prow job #2033678665199390720

systemd.log from quick-start-dej7cy-gpcmf-spzx4 (control plane)

# cloud-init-local starts, finds DataSourceEc2KubernetesLocal
Mar 17 02:50:28 systemd[1]: Starting cloud-init-local.service - Cloud-init: Local Stage (pre-network)...
Mar 17 02:50:29 cloud-init[470]: Cloud-init v. 25.3-0ubuntu1~24.04.1 running 'init-local' ... Up 5.18 seconds.
Mar 17 02:50:29 cloud-init[470]: Searching for local data source in: ['DataSourceEc2KubernetesLocal']

# Fix kicks in - _get_data() returns False immediately, no IMDS crawl
Mar 17 02:50:29 cloud-init[470]: DataSourceEc2Kubernetes.py[DEBUG]: Skipping metadata crawl in init-local phase (no network). DataSourceEc2Kubernetes will run in init-network phase.
Mar 17 02:50:29 cloud-init[470]: Getting metadata took 0.001 seconds
Mar 17 02:50:29 cloud-init[470]: finish: init-local/search-Ec2KubernetesLocal: SUCCESS: no local data found
Mar 17 02:50:30 cloud-init[470]: cloud-init stage: 'init-local' took 0.974 seconds
Mar 17 02:50:30 systemd[1]: Finished cloud-init-local.service

# Network comes up → init-network succeeds as before
Mar 17 02:50:31 systemd[1]: Starting cloud-init.service - Cloud-init: Network Stage...
Mar 17 02:50:32 cloud-init[539]: Read from http://169.254.169.254/latest/api/token (200, 56b) after 1 attempts
Mar 17 02:50:41 cloud-init[539]: finish: init-network/search-Ec2Kubernetes: SUCCESS: found network data
Mar 17 02:50:43 cloud-init[539]: cloud-init stage: 'init-network' took 11.063 seconds

# Total boot - no penalty
Mar 17 02:51:37 systemd[1]: Startup finished in 1.241s (kernel) + 1min 11.074s (userspace) = 1min 12.315s.

(full log)

systemd.log from quick-start-dej7cy-md-0-zwf9x-m9dvt-wz2cx (worker)

Mar 17 02:52:44 systemd[1]: Starting cloud-init-local.service - Cloud-init: Local Stage (pre-network)...
Mar 17 02:52:46 cloud-init[470]: Skipping metadata crawl in init-local phase (no network). DataSourceEc2Kubernetes will run in init-network phase.
Mar 17 02:52:46 cloud-init[470]: Getting metadata took 0.005 seconds
Mar 17 02:52:47 cloud-init[470]: cloud-init stage: 'init-local' took 1.503 seconds
Mar 17 02:52:48 systemd[1]: Finished cloud-init-local.service

Mar 17 02:52:50 systemd[1]: Starting cloud-init.service - Cloud-init: Network Stage...
Mar 17 02:52:59 cloud-init[539]: finish: init-network/search-Ec2Kubernetes: SUCCESS: found network data
Mar 17 02:53:06 cloud-init[539]: cloud-init stage: 'init-network' took 15.175 seconds

Mar 17 02:53:22 systemd[1]: Startup finished in 1.520s (kernel) + 41.872s (userspace) = 43.392s.

(full log)

Result: init-local completes in <1s (vs ~233s before). Node startup drops from 4min 35s → 43s–1min 12s. The e2e suite completes in 4h14m (vs 5h timeout): 21 Passed | 1 Failed | 6 Pending | 1 Skipped - the single failure is unrelated (quick-start cluster provisioning timeout, not cloud-init).

Summary

	Before	After
`init-local` metadata crawl	232s (IMDS timeout)	0.001s (skipped)
`init-local` stage	232.783s	~1s
Node startup	4min 35s	43s – 1min 12s
E2E suite (5h budget)	Timeout (19P/3F)	4h14m (21P/1F)

DataSourceEc2KubernetesLocal runs during cloud-init's init-local stage (pre-network). Its _get_data() delegates to DataSourceEc2._get_data() which attempts to crawl the IMDS at 169.254.169.254, but no network is available yet. The TCP connection retries for ~200s before timing out, adding a boot penalty to every EC2 node. This was not visible until cloud-init 25.1.4 changed ds-identify to respect user-configured datasource_list in /etc/cloud/cloud.cfg.d/ [1]. This update was included in Ubuntu 22.04 and 24.04 base images on Feb 17 2026, so any AMI built from a base image after that date is affected. Previously ds-identify wrote its own datasource_list: [ Ec2, None ] to /run/cloud-init/cloud.cfg (highest merge priority), silently overriding the custom Ec2Kubernetes datasource with the standard Ec2Local, which handles init-local correctly by setting up ephemeral DHCP first. Fix: Return False immediately from DataSourceEc2KubernetesLocal._get_data() so cloud-init proceeds to the init-network phase where DataSourceEc2Kubernetes runs with full network access. This matches the existing end-state (init-local always failed) but eliminates the ~200s timeout. [1] https://cloudinit.readthedocs.io/en/latest/reference/breaking_changes.html#strict-datasource-identity-before-network

k8s-ci-robot · 2026-03-14T09:31:39Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

damdo · 2026-03-14T09:34:51Z

/test ?

...s/providers/files/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceEc2Kubernetes.py

damdo · 2026-03-17T17:50:10Z

/assign @dlipovetsky @AndiDog @richardcase @nrb

damdo · 2026-03-17T18:32:54Z

/retest

AndiDog · 2026-03-18T10:28:38Z

I've tested this manually with k8s 1.35.2, latest CAPI+CAPA. Seems a reasonable fix.

/lgtm

damdo · 2026-03-18T10:36:28Z

/assign @mboersma @justinsb

richardcase · 2026-03-18T12:44:14Z

I think this is good.

/lgtm

mboersma

/lgtm
/approve

k8s-ci-robot · 2026-03-18T16:08:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mboersma

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [mboersma]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 14, 2026

k8s-ci-robot requested review from AndiDog and jsturtevant March 14, 2026 09:31

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Mar 14, 2026

AndiDog reviewed Mar 17, 2026

View reviewed changes

...s/providers/files/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceEc2Kubernetes.py Show resolved Hide resolved

damdo marked this pull request as ready for review March 17, 2026 17:48

damdo changed the title ~~WIP: fix: skip IMDS crawl in DataSourceEc2KubernetesLocal init-local phase~~ fix: skip IMDS crawl in DataSourceEc2KubernetesLocal init-local phase Mar 17, 2026

k8s-ci-robot requested review from AverageMarcus and kkeshavamurthy March 17, 2026 17:48

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 17, 2026

k8s-ci-robot assigned AndiDog, dlipovetsky, nrb and richardcase Mar 17, 2026

damdo mentioned this pull request Mar 17, 2026

🌱 Bump CAPI to v1.12.2, k8s to v1.34 and controller-runtime to v0.22.5 kubernetes-sigs/cluster-api-provider-aws#5857

Open

7 tasks

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 18, 2026

k8s-ci-robot assigned justinsb and mboersma Mar 18, 2026

AndiDog mentioned this pull request Mar 18, 2026

cloud-init warning: "Kubernetes is trying to restart cloud-init" #1957

Open

mboersma approved these changes Mar 18, 2026

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 18, 2026

k8s-ci-robot merged commit 64373a4 into kubernetes-sigs:main Mar 18, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: skip IMDS crawl in DataSourceEc2KubernetesLocal init-local phase#1951

fix: skip IMDS crawl in DataSourceEc2KubernetesLocal init-local phase#1951
k8s-ci-robot merged 1 commit intokubernetes-sigs:mainfrom
damdo:fix-skip-imds-crawl-init-local-phase

damdo commented Mar 14, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented Mar 14, 2026

Uh oh!

damdo commented Mar 14, 2026

Uh oh!

Uh oh!

damdo commented Mar 17, 2026

Uh oh!

damdo commented Mar 17, 2026

Uh oh!

AndiDog commented Mar 18, 2026

Uh oh!

damdo commented Mar 18, 2026

Uh oh!

richardcase commented Mar 18, 2026

Uh oh!

mboersma left a comment

Uh oh!

k8s-ci-robot commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

damdo commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change description

Issue

Proposed Fix

More detail

Evidence: Before and After

Before the fix — prow job #2032348607742480384

After the fix - prow job #2033678665199390720

Summary

Uh oh!

k8s-ci-robot commented Mar 14, 2026

Uh oh!

damdo commented Mar 14, 2026

Uh oh!

Uh oh!

damdo commented Mar 17, 2026

Uh oh!

damdo commented Mar 17, 2026

Uh oh!

AndiDog commented Mar 18, 2026

Uh oh!

damdo commented Mar 18, 2026

Uh oh!

richardcase commented Mar 18, 2026

Uh oh!

mboersma left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

damdo commented Mar 14, 2026 •

edited

Loading