fix: skip IMDS crawl in DataSourceEc2KubernetesLocal init-local phase#1951
Merged
k8s-ci-robot merged 1 commit intokubernetes-sigs:mainfrom Mar 18, 2026
Merged
Conversation
DataSourceEc2KubernetesLocal runs during cloud-init's init-local stage (pre-network). Its _get_data() delegates to DataSourceEc2._get_data() which attempts to crawl the IMDS at 169.254.169.254, but no network is available yet. The TCP connection retries for ~200s before timing out, adding a boot penalty to every EC2 node. This was not visible until cloud-init 25.1.4 changed ds-identify to respect user-configured datasource_list in /etc/cloud/cloud.cfg.d/ [1]. This update was included in Ubuntu 22.04 and 24.04 base images on Feb 17 2026, so any AMI built from a base image after that date is affected. Previously ds-identify wrote its own datasource_list: [ Ec2, None ] to /run/cloud-init/cloud.cfg (highest merge priority), silently overriding the custom Ec2Kubernetes datasource with the standard Ec2Local, which handles init-local correctly by setting up ephemeral DHCP first. Fix: Return False immediately from DataSourceEc2KubernetesLocal._get_data() so cloud-init proceeds to the init-network phase where DataSourceEc2Kubernetes runs with full network access. This matches the existing end-state (init-local always failed) but eliminates the ~200s timeout. [1] https://cloudinit.readthedocs.io/en/latest/reference/breaking_changes.html#strict-datasource-identity-before-network
Contributor
|
Skipping CI for Draft Pull Request. |
Member
Author
|
/test ? |
AndiDog
reviewed
Mar 17, 2026
...s/providers/files/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceEc2Kubernetes.py
Show resolved
Hide resolved
Member
Author
|
/assign @dlipovetsky @AndiDog @richardcase @nrb |
7 tasks
Member
Author
|
/retest |
Contributor
|
I've tested this manually with k8s 1.35.2, latest CAPI+CAPA. Seems a reasonable fix. /lgtm |
Member
Author
Member
|
I think this is good. /lgtm |
Contributor
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mboersma The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Change description
Issue
DataSourceEc2KubernetesLocalruns during cloud-init's init-local stage (pre-network). Its_get_data()delegates toDataSourceEc2._get_data()which attempts to crawl the IMDS at 169.254.169.254, but no network is available yet.The TCP connection retries for ~200s before timing out, adding a boot penalty to every EC2 node.
This was not visible until cloud-init 25.1.4 changed
ds-identifyto respect user-configureddatasource_listin/etc/cloud/cloud.cfg.d/[1].This update was included in Ubuntu 22.04 and 24.04 base images on Feb 17 2026, so any AMI built from a base image after that date is affected.
Previously ds-identify wrote its own
datasource_list: [ Ec2, None ]to/run/cloud-init/cloud.cfg(highest merge priority), silently overriding the customEc2Kubernetesdatasource with the standardEc2Local, which handles init-local correctly by setting up ephemeral DHCP first.Proposed Fix
Return
Falseimmediately fromDataSourceEc2KubernetesLocal._get_data()so cloud-init proceeds to the init-network phase whereDataSourceEc2Kubernetesruns with full network access. This matches the existing end-state (init-local always failed) but eliminates the ~200s timeout.More detail
Evidence: Before and After
Tested this fix by running the CAPA e2e suite (kubernetes-sigs/cluster-api-provider-aws#5898) with and without the patched
DataSourceEc2Kubernetes.py.Both runs use Ubuntu 24.04 with cloud-init v25.3 (which includes the
ds-identifybehavioral change from 25.1.4 that surfaces this bug).Before the fix — prow job #2032348607742480384
systemd.logfromquick-start-ufkxn1-control-plane-qgkkx(full log)
Result:
init-localwastes ~233s per node retrying IMDS with no network. The cumulative delay across all nodes causes the entire e2e suite to exceed its 5-hour timeout:FAIL! — 19 Passed | 3 Failed | 6 Pending | 1 Skipped — Suite Timeout Elapsed.After the fix - prow job #2033678665199390720
systemd.logfromquick-start-dej7cy-gpcmf-spzx4(control plane)(full log)
systemd.logfromquick-start-dej7cy-md-0-zwf9x-m9dvt-wz2cx(worker)(full log)
Result:
init-localcompletes in <1s (vs ~233s before). Node startup drops from 4min 35s → 43s–1min 12s. The e2e suite completes in 4h14m (vs 5h timeout):21 Passed | 1 Failed | 6 Pending | 1 Skipped- the single failure is unrelated (quick-start cluster provisioning timeout, not cloud-init).Summary
init-localmetadata crawlinit-localstage