Skip to content

fix: skip IMDS crawl in DataSourceEc2KubernetesLocal init-local phase#1951

Merged
k8s-ci-robot merged 1 commit intokubernetes-sigs:mainfrom
damdo:fix-skip-imds-crawl-init-local-phase
Mar 18, 2026
Merged

fix: skip IMDS crawl in DataSourceEc2KubernetesLocal init-local phase#1951
k8s-ci-robot merged 1 commit intokubernetes-sigs:mainfrom
damdo:fix-skip-imds-crawl-init-local-phase

Conversation

@damdo
Copy link
Member

@damdo damdo commented Mar 14, 2026

Change description

Issue

DataSourceEc2KubernetesLocal runs during cloud-init's init-local stage (pre-network). Its _get_data() delegates to DataSourceEc2._get_data() which attempts to crawl the IMDS at 169.254.169.254, but no network is available yet.

The TCP connection retries for ~200s before timing out, adding a boot penalty to every EC2 node.

This was not visible until cloud-init 25.1.4 changed ds-identify to respect user-configured datasource_list in /etc/cloud/cloud.cfg.d/ [1].

This update was included in Ubuntu 22.04 and 24.04 base images on Feb 17 2026, so any AMI built from a base image after that date is affected.

Previously ds-identify wrote its own datasource_list: [ Ec2, None ] to /run/cloud-init/cloud.cfg (highest merge priority), silently overriding the custom Ec2Kubernetes datasource with the standard Ec2Local, which handles init-local correctly by setting up ephemeral DHCP first.

Proposed Fix

Return False immediately from DataSourceEc2KubernetesLocal._get_data() so cloud-init proceeds to the init-network phase where DataSourceEc2Kubernetes runs with full network access. This matches the existing end-state (init-local always failed) but eliminates the ~200s timeout.


More detail

Evidence: Before and After

Tested this fix by running the CAPA e2e suite (kubernetes-sigs/cluster-api-provider-aws#5898) with and without the patched DataSourceEc2Kubernetes.py.

Both runs use Ubuntu 24.04 with cloud-init v25.3 (which includes the ds-identify behavioral change from 25.1.4 that surfaces this bug).

Before the fix — prow job #2032348607742480384

systemd.log from quick-start-ufkxn1-control-plane-qgkkx
# cloud-init-local starts, finds DataSourceEc2KubernetesLocal, begins IMDS crawl
Mar 13 09:00:39 systemd[1]: Starting cloud-init-local.service - Cloud-init: Local Stage (pre-network)...
Mar 13 09:00:39 cloud-init[470]: Cloud-init v. 25.3-0ubuntu1~24.04.1 running 'init-local' ... Up 4.93 seconds.
Mar 13 09:00:39 cloud-init[470]: Searching for local data source in: ['DataSourceEc2KubernetesLocal']
Mar 13 09:00:39 cloud-init[470]: Fetching Ec2 IMDSv2 API Token
Mar 13 09:00:39 cloud-init[470]: [0/1] open 'http://169.254.169.254/latest/api/token' ...

# IMDS fails repeatedly — "Network is unreachable" (138 retries over ~232s)
Mar 13 09:00:40 cloud-init[470]: ... 'http://[fd00:ec2::254]/latest/api/token' failed [0/240s]: ... Network is unreachable
...
Mar 13 09:04:32 cloud-init[470]: ... 'http://[fd00:ec2::254]/latest/api/token' failed [232/240s]: ... Network is unreachable

# init-local gives up after 232s
Mar 13 09:04:32 cloud-init[470]: Getting metadata took 232.092 seconds
Mar 13 09:04:32 cloud-init[470]: finish: init-local/search-Ec2KubernetesLocal: FAIL
Mar 13 09:04:32 cloud-init[470]: cloud-init stage: 'init-local' took 232.783 seconds
Mar 13 09:04:32 systemd[1]: Finished cloud-init-local.service

# Network comes up → IMDS succeeds instantly
Mar 13 09:04:34 systemd[1]: Starting cloud-init.service - Cloud-init: Network Stage...
Mar 13 09:04:34 cloud-init[632]: Read from http://169.254.169.254/latest/api/token (200, 56b) after 1 attempts
Mar 13 09:04:40 cloud-init[632]: finish: init-network/search-Ec2Kubernetes: SUCCESS
Mar 13 09:04:42 cloud-init[632]: cloud-init stage: 'init-network' took 7.508 seconds

# Total boot penalty
Mar 13 09:05:10 systemd[1]: Startup finished in 1.272s (kernel) + 4min 34.372s (userspace) = 4min 35.645s.

(full log)

Result: init-local wastes ~233s per node retrying IMDS with no network. The cumulative delay across all nodes causes the entire e2e suite to exceed its 5-hour timeout: FAIL! — 19 Passed | 3 Failed | 6 Pending | 1 Skipped — Suite Timeout Elapsed.

After the fix - prow job #2033678665199390720

systemd.log from quick-start-dej7cy-gpcmf-spzx4 (control plane)
# cloud-init-local starts, finds DataSourceEc2KubernetesLocal
Mar 17 02:50:28 systemd[1]: Starting cloud-init-local.service - Cloud-init: Local Stage (pre-network)...
Mar 17 02:50:29 cloud-init[470]: Cloud-init v. 25.3-0ubuntu1~24.04.1 running 'init-local' ... Up 5.18 seconds.
Mar 17 02:50:29 cloud-init[470]: Searching for local data source in: ['DataSourceEc2KubernetesLocal']

# Fix kicks in - _get_data() returns False immediately, no IMDS crawl
Mar 17 02:50:29 cloud-init[470]: DataSourceEc2Kubernetes.py[DEBUG]: Skipping metadata crawl in init-local phase (no network). DataSourceEc2Kubernetes will run in init-network phase.
Mar 17 02:50:29 cloud-init[470]: Getting metadata took 0.001 seconds
Mar 17 02:50:29 cloud-init[470]: finish: init-local/search-Ec2KubernetesLocal: SUCCESS: no local data found
Mar 17 02:50:30 cloud-init[470]: cloud-init stage: 'init-local' took 0.974 seconds
Mar 17 02:50:30 systemd[1]: Finished cloud-init-local.service

# Network comes up → init-network succeeds as before
Mar 17 02:50:31 systemd[1]: Starting cloud-init.service - Cloud-init: Network Stage...
Mar 17 02:50:32 cloud-init[539]: Read from http://169.254.169.254/latest/api/token (200, 56b) after 1 attempts
Mar 17 02:50:41 cloud-init[539]: finish: init-network/search-Ec2Kubernetes: SUCCESS: found network data
Mar 17 02:50:43 cloud-init[539]: cloud-init stage: 'init-network' took 11.063 seconds

# Total boot - no penalty
Mar 17 02:51:37 systemd[1]: Startup finished in 1.241s (kernel) + 1min 11.074s (userspace) = 1min 12.315s.

(full log)

systemd.log from quick-start-dej7cy-md-0-zwf9x-m9dvt-wz2cx (worker)
Mar 17 02:52:44 systemd[1]: Starting cloud-init-local.service - Cloud-init: Local Stage (pre-network)...
Mar 17 02:52:46 cloud-init[470]: Skipping metadata crawl in init-local phase (no network). DataSourceEc2Kubernetes will run in init-network phase.
Mar 17 02:52:46 cloud-init[470]: Getting metadata took 0.005 seconds
Mar 17 02:52:47 cloud-init[470]: cloud-init stage: 'init-local' took 1.503 seconds
Mar 17 02:52:48 systemd[1]: Finished cloud-init-local.service

Mar 17 02:52:50 systemd[1]: Starting cloud-init.service - Cloud-init: Network Stage...
Mar 17 02:52:59 cloud-init[539]: finish: init-network/search-Ec2Kubernetes: SUCCESS: found network data
Mar 17 02:53:06 cloud-init[539]: cloud-init stage: 'init-network' took 15.175 seconds

Mar 17 02:53:22 systemd[1]: Startup finished in 1.520s (kernel) + 41.872s (userspace) = 43.392s.

(full log)

Result: init-local completes in <1s (vs ~233s before). Node startup drops from 4min 35s → 43s–1min 12s. The e2e suite completes in 4h14m (vs 5h timeout): 21 Passed | 1 Failed | 6 Pending | 1 Skipped - the single failure is unrelated (quick-start cluster provisioning timeout, not cloud-init).

Summary

Before After
init-local metadata crawl 232s (IMDS timeout) 0.001s (skipped)
init-local stage 232.783s ~1s
Node startup 4min 35s 43s – 1min 12s
E2E suite (5h budget) Timeout (19P/3F) 4h14m (21P/1F)

DataSourceEc2KubernetesLocal runs during cloud-init's init-local stage
(pre-network). Its _get_data() delegates to DataSourceEc2._get_data()
which attempts to crawl the IMDS at 169.254.169.254, but no network is
available yet. The TCP connection retries for ~200s before timing
out, adding a boot penalty to every EC2 node.

This was not visible until cloud-init 25.1.4 changed ds-identify to
respect user-configured datasource_list in /etc/cloud/cloud.cfg.d/ [1].
This update was included in Ubuntu 22.04 and 24.04 base images on
Feb 17 2026, so any AMI built from a base image after that date is
affected.
Previously ds-identify wrote its own
datasource_list: [ Ec2, None ] to /run/cloud-init/cloud.cfg (highest
merge priority), silently overriding the custom Ec2Kubernetes datasource
with the standard Ec2Local, which handles init-local correctly by
setting up ephemeral DHCP first.

Fix:
Return False immediately from DataSourceEc2KubernetesLocal._get_data()
so cloud-init proceeds to the init-network phase where
DataSourceEc2Kubernetes runs with full network access.
This matches the existing end-state (init-local always failed)
but eliminates the ~200s timeout.

[1] https://cloudinit.readthedocs.io/en/latest/reference/breaking_changes.html#strict-datasource-identity-before-network
@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 14, 2026
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Mar 14, 2026
@damdo
Copy link
Member Author

damdo commented Mar 14, 2026

/test ?

@damdo damdo marked this pull request as ready for review March 17, 2026 17:48
@damdo damdo changed the title WIP: fix: skip IMDS crawl in DataSourceEc2KubernetesLocal init-local phase fix: skip IMDS crawl in DataSourceEc2KubernetesLocal init-local phase Mar 17, 2026
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 17, 2026
@damdo
Copy link
Member Author

damdo commented Mar 17, 2026

/assign @dlipovetsky @AndiDog @richardcase @nrb

@damdo
Copy link
Member Author

damdo commented Mar 17, 2026

/retest

@AndiDog
Copy link
Contributor

AndiDog commented Mar 18, 2026

I've tested this manually with k8s 1.35.2, latest CAPI+CAPA. Seems a reasonable fix.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 18, 2026
@damdo
Copy link
Member Author

damdo commented Mar 18, 2026

/assign @mboersma @justinsb

@richardcase
Copy link
Member

I think this is good.

/lgtm

Copy link
Contributor

@mboersma mboersma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mboersma

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 18, 2026
@k8s-ci-robot k8s-ci-robot merged commit 64373a4 into kubernetes-sigs:main Mar 18, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants