Skip to content

feat: enable localdns hosts plugin to cache critical AKS FQDNs#7639

Open
saewoni wants to merge 141 commits intomainfrom
sakwa/localdns_poc
Open

feat: enable localdns hosts plugin to cache critical AKS FQDNs#7639
saewoni wants to merge 141 commits intomainfrom
sakwa/localdns_poc

Conversation

@saewoni
Copy link
Contributor

@saewoni saewoni commented Jan 12, 2026

What type of PR is this?
/kind feature

What this PR does / why we need it:
This is adding the aks-setup-hosts.sh, aks-setup-hosts.service, and aks-service.hosts-timer to resolve the AKS critical FQDNs every 15 minutes, which are written to /etc/localdns/hosts file. This file can be mounted onto the LocalDNS corefile if EnableHostsPlugin bool from AKS-RP is passed in.

Which issue(s) this PR fixes:

Fixes #

Requirements:

  • uses conventional commit messages
  • includes documentation
  • adds unit tests
  • tested upgrade from previous version
  • commits are GPG signed and Github marks them as verified

Special notes for your reviewer:

Release note:

none

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a periodic “local DNS cache” for mcr.microsoft.com that writes resolved IPs into /etc/hosts.testing and wires CoreDNS/localdns to consult that file before forwarding, improving reliability and latency for MCR image pulls when LocalDNS is enabled (especially in the scriptless path).

Changes:

  • Adds mcr-hosts-setup script, systemd service, and timer into the VHD build pipeline and node provisioning flow, including a new shouldEnableMCRHostsSetup helper and CSE wiring to enable the timer when LocalDNS (scriptless) is enabled.
  • Updates the localdns CoreDNS template and associated tests to add a hosts /etc/hosts.testing plugin block so MCR lookups can be served from the generated hosts file before going to Azure DNS.
  • Adds targeted shellspec coverage for the new mcr-hosts-setup behavior and for enabling its timer, and refreshes baked CustomData blobs used in VHD-related tests to include the new artifacts.

Reviewed changes

Copilot reviewed 31 out of 83 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
vhdbuilder/packer/vhd-image-builder-mariner*.json, vhdbuilder/packer/vhd-image-builder-flatcar*.json, vhdbuilder/packer/vhd-image-builder-cvm.json, vhdbuilder/packer/vhd-image-builder-base.json Ensures new mcr-hosts-setup.sh, .service, and .timer artifacts are copied into /home/packer during various VHD builds so they can be installed into the image.
vhdbuilder/packer/vhd-image-builder-arm64-gen2.json Same as above for the ARM64 Gen2 image; one object is slightly misformatted compared to the rest of the JSON.
vhdbuilder/packer/packer_source.sh Copies mcr-hosts-setup.sh to /opt/azure/containers and installs the corresponding systemd service and timer units into /etc/systemd/system with appropriate permissions.
parts/linux/cloud-init/artifacts/mcr-hosts-setup.sh New script that resolves A/AAAA records for mcr.microsoft.com via dig and writes them into /etc/hosts.testing, logging both summary counts and the concrete IPs.
parts/linux/cloud-init/artifacts/mcr-hosts-setup.service Ones-shot systemd unit that runs the mcr-hosts-setup.sh script after network-online.target is reached.
parts/linux/cloud-init/artifacts/mcr-hosts-setup.timer Systemd timer that triggers mcr-hosts-setup.service at boot and every 5 minutes thereafter, with jitter and ordering relative to localdns.service.
parts/linux/cloud-init/artifacts/cse_config.sh Adds shouldEnableMCRHostsSetup, which uses systemctlEnableAndStart mcr-hosts-setup.timer 30 to enable/start the timer and logs descriptive messages.
parts/linux/cloud-init/artifacts/cse_main.sh Integrates shouldEnableMCRHostsSetup into the base provisioning flow, calling it when SHOULD_ENABLE_LOCALDNS is true so the timer is only enabled alongside LocalDNS scriptless corefile generation.
pkg/agent/baker.go Extends the LocalDNS CoreDNS template so that, when $isRootDomain is true, a hosts /etc/hosts.testing { fallthrough } block is inserted before the Azure DNS forwarder.
pkg/agent/baker_test.go Updates expected localdns corefile strings in tests to include the new hosts /etc/hosts.testing stanza, ensuring the template change is validated.
spec/parts/linux/cloud-init/artifacts/mcr_hosts_setup_spec.sh New shellspec tests that (by re-simulating the logic) verify hosts file generation and content based on mocked dig output; currently they do not execute the real script, which has maintainability implications.
spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh Adds tests to ensure shouldEnableMCRHostsSetup echoes the expected messages and calls systemctlEnableAndStart mcr-hosts-setup.timer 30.
pkg/agent/testdata/CustomizedImage*/CustomData Refreshes gzipped CustomData payloads to include the new artifacts and behavior, keeping VHD-related tests aligned with the new provisioning logic.

@saewoni saewoni changed the title Sakwa/localdns poc feat: add hosts plugin to Jan 30, 2026
@saewoni saewoni changed the title feat: add hosts plugin to feat(localdns): enable mcr-hosts-setup timer for DNS caching Jan 30, 2026
@saewoni saewoni changed the title feat(localdns): enable mcr-hosts-setup timer for DNS caching feat: enable localdns hosts plugin to cache critical AKS FQDNs Jan 30, 2026
@saewoni saewoni marked this pull request as ready for review February 2, 2026 17:50
…refile selection

Changes:
1. Remove synchronous DNS resolution from enableAKSHostsSetup()
   - Previously ran aks-hosts-setup.sh immediately during CSE, blocking provisioning
   - Now only enables the timer asynchronously with systemctlEnableAndStartNoBlock
   - Hosts file populated by timer in background (OnBootSec=0 + RandomizedDelaySec=60s)
   - Reduces CSE provisioning time by ~5-40 seconds depending on DNS latency

2. Add 30-second timeout wait in select_localdns_corefile()
   - Polls hosts file every 5 seconds to check for IP→FQDN mappings
   - Uses hosts plugin corefile if mappings found within 30s
   - Falls back to non-hosts corefile on timeout (safe degradation)
   - Validates hosts file content with grep -qE '^[0-9a-fA-F.:]+[[:space:]]+[a-zA-Z]'

Benefits:
- Faster node provisioning (async DNS resolution doesn't block CSE)
- Still benefits from DNS caching when timer completes quickly
- Graceful fallback if DNS resolution takes longer than expected
- No provisioning failures if aks-hosts-setup timer is delayed

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 46 out of 170 changed files in this pull request and generated 3 comments.


You can also share your feedback on Copilot code review. Take the survey.

kwaksaewon and others added 3 commits March 6, 2026 19:00
…ts-setup.sh

The validation logic at lines 264-293 was a weaker duplicate of the thorough
validation at lines 213-262. The second block:
- Used inverted logic (negative validation vs positive)
- Did not count valid entries (could pass with zero valid mappings)
- Provided less detailed error messages
- Was completely redundant since first block already validated everything

Removed 31 lines of duplicate code while keeping the more thorough first
validation block which:
- Counts valid entries and ensures at least one exists
- Provides detailed per-line error messages
- Uses clearer positive validation logic (if IPv4 then count, elif IPv6 then count)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove the 60-second random delay from aks-hosts-setup.timer to fix a race
condition where select_localdns_corefile() times out before the hosts file
is populated.

The random delay was intended to prevent thundering herd when multiple nodes
boot simultaneously, but it causes the hosts plugin to be disabled on ~50%
of nodes:

- select_localdns_corefile() waits 30s for /etc/localdns/hosts
- Timer triggers immediately (OnBootSec=0) but adds 0-60s random delay
- If delay > 30s, selection times out and falls back to no-hosts corefile

Removing the delay is safe because:
1. DNS resolution naturally staggers (6 FQDNs × 3s timeout per query)
2. Azure DNS (168.63.129.16) is designed to handle massive scale
3. Operation is read-only (no shared write bottleneck)
4. DNS resolution completes in ~18-36s, well within 30s timeout

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 47 out of 170 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

e2e/validation.go:59

  • for i := range maxRetries does not compile (cannot range over an int). This should be a standard counted loop (e.g., for i := 0; i < maxRetries; i++ { ... }).
func ValidatePodRunningWithRetry(ctx context.Context, s *Scenario, pod *corev1.Pod, maxRetries int) {
	var err error
	for i := range maxRetries {
		err = validatePodRunning(ctx, s, pod)
		if err != nil {
			time.Sleep(1 * time.Second)
			s.T.Logf("retrying pod %q validation (%d/%d)", pod.Name, i+1, maxRetries)
			continue

You can also share your feedback on Copilot code review. Take the survey.

kwaksaewon and others added 3 commits March 6, 2026 20:28
# Conflicts:
#	pkg/agent/testdata/AKSUbuntu2204+China/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+Containerd+CDI/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+Containerd+DevicePlugin/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+Containerd+MIG+ArtifactStreaming/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+Containerd+MIG/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+CustomCloud+USNat/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+CustomCloud+USSec/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+CustomCloud+ootcredentialprovider/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+CustomCloud/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+CustomKubeletConfig+CustomLinuxOSConfig/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+CustomKubeletConfig+SerializeImagePulls/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+DisableKubeletServingCertificateRotation+CustomKubeletConfig/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+DisableKubeletServingCertificateRotation/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+EnableManagedGPU+Disabled/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+EnableManagedGPU/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+IMDSRestrictionOff/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+IMDSRestrictionOnWithFilterTable/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+IMDSRestrictionOnWithMangleTable/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+ImplicitlyDisableKubeletServingCertificateRotation/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+KubeletServingCertificateRotation+CustomKubeletConfig/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+KubeletServingCertificateRotation/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+ManagedGPUExperienceAFEC+Disabled/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+ManagedGPUExperienceAFEC/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+MigStrategy+Mixed/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+MigStrategy+None/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+MigStrategy+Single/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+NoArtifactStreaming/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+OutboundTypeBlocked/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+OutboundTypeNil/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+OutboundTypeNone/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+SSHStatusEntraID/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+SSHStatusOff/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+SSHStatusOn/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+SecurityProfile/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+SerializeImagePulls/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+cgroupv2/CustomData
#	pkg/agent/testdata/AKSUbuntu2204+ootcredentialprovider/CustomData
#	pkg/agent/testdata/AKSUbuntu2404+CustomLinuxOSConfigUlimit/CustomData
#	pkg/agent/testdata/AKSUbuntu2404+NetworkPolicy/CustomData
#	pkg/agent/testdata/AKSUbuntu2404+Teleport/CustomData
#	pkg/agent/testdata/AzureLinuxV2+Kata/CustomData
#	pkg/agent/testdata/AzureLinuxV3+Kata+DisableUnattendedUpgrades=false/CustomData
#	pkg/agent/testdata/AzureLinuxV3+Kata+DisableUnattendedUpgrades=true/CustomData
#	pkg/agent/testdata/AzureLinuxV3+Kata/CustomData
#	pkg/agent/testdata/AzureLinuxv2+DisableUnattendedUpgrades=false/CustomData
#	pkg/agent/testdata/AzureLinuxv2+DisableUnattendedUpgrades=true/CustomData
#	pkg/agent/testdata/AzureLinuxv2+Kata+DisableUnattendedUpgrades=false/CustomData
#	pkg/agent/testdata/AzureLinuxv2+Kata+DisableUnattendedUpgrades=true/CustomData
#	pkg/agent/testdata/CustomizedImage/CustomData
#	pkg/agent/testdata/CustomizedImageKata/CustomData
#	pkg/agent/testdata/CustomizedImageLinuxGuard/CustomData
#	pkg/agent/testdata/Flatcar+CustomCloud+USSec/CustomData
#	pkg/agent/testdata/Flatcar+CustomCloud+USSec/CustomData.inner
#	pkg/agent/testdata/Flatcar+CustomCloud/CustomData
#	pkg/agent/testdata/Flatcar+CustomCloud/CustomData.inner
#	pkg/agent/testdata/Flatcar/CustomData
#	pkg/agent/testdata/Flatcar/CustomData.inner
#	pkg/agent/testdata/MarinerV2+CustomCloud+USNat/CustomData
#	pkg/agent/testdata/MarinerV2+CustomCloud+USSec/CustomData
#	pkg/agent/testdata/MarinerV2+CustomCloud/CustomData
#	pkg/agent/testdata/MarinerV2+Kata/CustomData
#	pkg/agent/testdata/Marinerv2+DisableUnattendedUpgrades=false/CustomData
#	pkg/agent/testdata/Marinerv2+DisableUnattendedUpgrades=true/CustomData
#	pkg/agent/testdata/Marinerv2+Kata+DisableUnattendedUpgrades=false/CustomData
#	pkg/agent/testdata/Marinerv2+Kata+DisableUnattendedUpgrades=true/CustomData
Fix critical validation ordering bug in aks-hosts-setup.sh where validation
occurred AFTER the atomic rename, allowing invalid data to be published to
the production hosts file.

## Problem

The old code flow was:
1. Write to temp file
2. Rename temp → production (atomic, but unvalidated)
3. Validate production file
4. If validation fails, exit with error

This meant that if validation failed (e.g., FQDN with no IP address), the
INVALID data was already live at /etc/localdns/hosts. CoreDNS would read
the bad entries and serve incorrect DNS responses until the next timer run
(15+ minutes later), causing:
- Failed image pulls from mcr.microsoft.com
- Pod startup failures
- DaemonSet deployment issues

## Solution

New code flow:
1. Write to temp file
2. Validate temp file (before it goes live)
3. If validation fails, delete temp file and exit
4. Only if validation passes, rename temp → production (atomic)

This ensures CoreDNS never sees invalid data. If validation fails, the old
hosts file stays intact and DNS continues working with previous valid entries.

## Changes

- **aks-hosts-setup.sh**: Move all validation logic to occur on temp file
  before the atomic rename. Add cleanup (rm temp file) on all validation
  failure paths.

- **test_aks_hosts_validation.sh**: Add Test 6 to verify validate-before-rename
  behavior. Test confirms that when temp file has invalid data, production
  file remains unchanged.

- **verify_shell.sh**: Add aks-hosts-setup.sh and test_aks_hosts_validation.sh
  to BASH_ONLY_LIST to skip POSIX compliance checks. These scripts require
  bash features (arrays, [[]], regex matching with =~).

- **testdata/***: Regenerate all CustomData snapshots to reflect the validation
  ordering fix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update the comment in select_localdns_corefile() to remove outdated
reference to RandomizedDelaySec=60s, which was removed from the timer
configuration in an earlier commit.

The comment now accurately reflects:
- Timer triggers immediately (OnBootSec=0)
- DNS resolution takes 18-36 seconds (6 FQDNs × 3s timeout)
- 30 second wait timeout is sufficient

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 48 out of 180 changed files in this pull request and generated 2 comments.


You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +305 to +315
AKS_HOSTS_SETUP_SCRIPT_SRC=/home/packer/aks-hosts-setup.sh
AKS_HOSTS_SETUP_SCRIPT_DEST=/opt/azure/containers/aks-hosts-setup.sh
cpAndMode $AKS_HOSTS_SETUP_SCRIPT_SRC $AKS_HOSTS_SETUP_SCRIPT_DEST 0755

AKS_HOSTS_SETUP_SERVICE_SRC=/home/packer/aks-hosts-setup.service
AKS_HOSTS_SETUP_SERVICE_DEST=/etc/systemd/system/aks-hosts-setup.service
cpAndMode $AKS_HOSTS_SETUP_SERVICE_SRC $AKS_HOSTS_SETUP_SERVICE_DEST 0644

AKS_HOSTS_SETUP_TIMER_SRC=/home/packer/aks-hosts-setup.timer
AKS_HOSTS_SETUP_TIMER_DEST=/etc/systemd/system/aks-hosts-setup.timer
cpAndMode $AKS_HOSTS_SETUP_TIMER_SRC $AKS_HOSTS_SETUP_TIMER_DEST 0644
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copyPackerFiles() now unconditionally copies /home/packer/aks-hosts-setup.{sh,service,timer}. This will fail VHD builds for templates that don't upload these artifacts (e.g. vhd-image-builder-acl.json currently only uploads localdns.* and not aks-hosts-setup.), because cpAndMode exits on any copy failure. Either add these files to every relevant vhd-image-builder-.json (including ACL) or make the copy conditional on the source files existing to keep older/variant templates building.

Copilot uses AI. Check for mistakes.
kwaksaewon and others added 4 commits March 6, 2026 21:52
Update the shellspec test expectations to match the actual stderr log
messages produced by select_localdns_corefile() in cse_helpers.sh.

Changed:
- "exists, checking for IP mappings" → "checking ${HOSTS_FILE} for content"

This matches the implementation at line 1325:
  echo "Hosts plugin is enabled, checking ${hosts_file_path} for content..." >&2

All other test assertions already match the implementation correctly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…Block

Update enableAKSHostsSetup specs to mock systemctlEnableAndStartNoBlock
instead of systemctlEnableAndStart, matching the actual implementation in
cse_config.sh which calls systemctlEnableAndStartNoBlock.

Changes:
- Mock function: systemctlEnableAndStart → systemctlEnableAndStartNoBlock
- Update test assertions to expect "systemctlEnableAndStartNoBlock"
- Update test titles to reference correct function name

This ensures tests accurately reflect the non-blocking timer startup
behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace `command -v /opt/bin/kubectl` with `[ -x /opt/bin/kubectl ]`
for checking kubectl binary existence. The `-x` test is more reliable
for absolute paths since `command -v` behavior varies across shells
and can produce false negatives/positives.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove test case expecting "Running initial aks-hosts-setup to resolve DNS..."
output. The enableAKSHostsSetup function does not run the setup script
directly - it only enables the systemd timer which runs the script
asynchronously. The test was expecting behavior that was never implemented.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 48 out of 180 changed files in this pull request and generated 2 comments.


You can also share your feedback on Copilot code review. Take the survey.

kwaksaewon and others added 2 commits March 6, 2026 22:30
Remove `head -n1` from corefile variant detection log. The hosts plugin
directive "hosts /etc/localdns/hosts" does not appear on the first line
of the corefile, so the previous check would almost always report
"WITHOUT hosts plugin" even when the hosts plugin was present.

Now using `grep -q` to search the entire decoded corefile content,
which correctly identifies whether the hosts plugin is configured.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add guard to check if profile is nil before dereferencing it in
GenerateLocalDNSCoreFile. Without this check, the function would panic
if config.AgentPoolProfile is nil when GetGeneratedLocalDNSCoreFile or
GetGeneratedLocalDNSCoreFileNoHosts template functions are called.

Now returns empty corefile when profile is nil, matching the behavior
when LocalDNS is disabled.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 48 out of 180 changed files in this pull request and generated 2 comments.


You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +1327 to +1350
# Wait up to 30 seconds for the hosts file to be populated with IP mappings
# aks-hosts-setup timer has OnBootSec=0 which triggers immediately when timer starts.
# DNS resolution typically completes within 18-36 seconds (6 FQDNs × 3s timeout per query).
# 30 seconds provides reasonable time for initial DNS resolution without blocking provisioning.
local timeout=30
local wait_interval=5
local elapsed=0

while [ $elapsed -lt $timeout ]; do
if [ -f "${hosts_file_path}" ]; then
if grep -qE '^[0-9a-fA-F.:]+[[:space:]]+[a-zA-Z]' "${hosts_file_path}"; then
echo "aks-hosts-setup produced hosts file with IP mappings after ${elapsed}s, using corefile with hosts plugin" >&2
echo "${corefile_with_hosts}"
return 0
fi
fi

if [ $elapsed -eq 0 ]; then
echo "Waiting for aks-hosts-setup to populate ${hosts_file_path} (timeout: ${timeout}s)..." >&2
fi

sleep $wait_interval
elapsed=$((elapsed + wait_interval))
done
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select_localdns_corefile waits only 30s for /etc/localdns/hosts to contain at least one mapping. aks-hosts-setup.sh does up to 2 DNS queries (A + AAAA) per FQDN with a 3s timeout each; for the public cloud list this can be ~12 queries, i.e., up to ~36s in the worst case (even before network-online delays). With the current 30s timeout, a healthy but slow first run can cause a permanent fallback to the no-hosts Corefile on boot. Consider increasing the timeout (or triggering an initial synchronous run of aks-hosts-setup.service) to better match the worst-case resolution time.

Copilot uses AI. Check for mistakes.
Comment on lines +983 to +991
It 'should warn when initial hosts setup script fails'
echo '#!/bin/bash
exit 1' > "$AKS_HOSTS_SETUP_SCRIPT"
chmod +x "$AKS_HOSTS_SETUP_SCRIPT"
When call enableAKSHostsSetup
The status should be success
The output should include "Warning: Initial hosts setup failed"
The output should include "Enabling aks-hosts-setup timer..."
End
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new spec expects enableAKSHostsSetup to attempt an initial run and log "Warning: Initial hosts setup failed" when the setup script exits non-zero, but enableAKSHostsSetup in cse_config.sh currently never runs aks-hosts-setup.sh (it only enables the timer and touches the hosts file). This will make this test (and any similar ones) fail. Either update enableAKSHostsSetup to perform an initial run (and emit the warning on failure) or adjust/remove this test case to match the implemented behavior.

Copilot uses AI. Check for mistakes.
Add aks-hosts-setup.{sh,service,timer} file uploads to the ACL VHD
build template to bring ACL to feature parity with other OS variants
(Ubuntu, Mariner, Flatcar).

These files were missing from the ACL template because ACL was added
to the codebase after aks-hosts-setup was introduced. Without these
files, packer_source.sh's copyPackerFiles() would fail during ACL VHD
builds when trying to copy non-existent files from /home/packer/.

This change ensures ACL VHDs include the localdns hosts plugin
functionality for DNS caching of critical AKS FQDNs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update the aks-hosts-setup.sh ShellSpec test to validate that
mcr.azk8s.cn (legacy China container registry) is resolved in
AzureChinaCloud environment, matching the recent addition to the
script itself.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 48 out of 181 changed files in this pull request and generated 1 comment.


You can also share your feedback on Copilot code review. Take the survey.

if [ "$a" -le 255 ] && [ "$b" -le 255 ] && [ "$c" -le 255 ] && [ "$d" -le 255 ]; then
echo "${a}.${b}.${c}.${d}"
fi
done
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With set -euo pipefail, the resolve_ipv4 pipeline can exit non-zero when grep finds no matches (e.g., domains with no A records). Because the pipeline isn’t guarded, that non-zero status can terminate the whole script even though “no records” should be a normal case. Make the pipeline tolerant (e.g., ensure it returns 0 when there are no matches) so missing A records just yield an empty result instead of aborting the run.

Suggested change
done
done || true

Copilot uses AI. Check for mistakes.
kwaksaewon and others added 4 commits March 11, 2026 00:18
Limit aks-hosts-setup.sh cloud support to:
- AzurePublicCloud
- AzureChinaCloud
- AzureUSGovernmentCloud

Remove support for USNatCloud, USSecCloud, AzureStackCloud,
AzureGermanCloud, AzureGermanyCloud, and AzureBleuCloud.

Script now exits with error code 1 for unsupported clouds with
message: "ERROR: The following cloud is not supported: <cloud>"

Updated unit tests to verify removed clouds properly fail and
added tests for each removed cloud environment.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Accept theirs for all CustomData conflicts
- Will regenerate CustomData after merge
- Regenerated all CustomData files with 'make generate' after merging origin/main
- Removed outdated test 'should warn when initial hosts setup script fails'
  - This test expected a warning message that no longer exists in the current implementation
  - The enableAKSHostsSetup() function now only enables the timer without running an initial setup script
- All remaining tests for enableAKSHostsSetup() pass, including cloud validation tests
…ustomData

- Added back the cloud validation case statement that was lost during merge
- Validates TARGET_CLOUD is one of: AzurePublicCloud, AzureChinaCloud, or AzureUSGovernmentCloud
- Returns with warning for unsupported clouds
- Regenerated all CustomData files with 'make generate'
- All 85 unit tests now pass
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 49 out of 195 changed files in this pull request and generated 3 comments.


You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +724 to +727
// The corefile with hosts plugin is written to /opt/azure/containers/localdns/localdns.corefile.
// The corefile without hosts plugin is written to /opt/azure/containers/localdns/localdns-nohosts.corefile
// and used as a fallback when enableAKSHostsSetup fails or when older VHDs don't have aks-hosts-setup artifacts.
// Runtime selection happens in cse_main.sh based on the availability of /etc/localdns/hosts.
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says the generated corefile variants are "written" to specific paths (including localdns-nohosts.corefile), but this helper only returns base64 content; the actual write happens later in shell. This is misleading for future maintainers—either update the comment to reflect that this function only generates/encodes the corefile, or implement writing the no-hosts variant if that’s the intent.

Suggested change
// The corefile with hosts plugin is written to /opt/azure/containers/localdns/localdns.corefile.
// The corefile without hosts plugin is written to /opt/azure/containers/localdns/localdns-nohosts.corefile
// and used as a fallback when enableAKSHostsSetup fails or when older VHDs don't have aks-hosts-setup artifacts.
// Runtime selection happens in cse_main.sh based on the availability of /etc/localdns/hosts.
// This helper only generates and encodes the corefile variants; it does not write them to disk.
// The provisioning scripts (e.g., cse_main.sh) use these values to write the corefile with hosts
// plugin to /opt/azure/containers/localdns/localdns.corefile and the corefile without hosts plugin
// to /opt/azure/containers/localdns/localdns-nohosts.corefile, using the latter as a fallback when
// enableAKSHostsSetup fails or when older VHDs don't have aks-hosts-setup artifacts. Runtime selection
// happens in cse_main.sh based on the availability of /etc/localdns/hosts.

Copilot uses AI. Check for mistakes.
Comment on lines 1293 to 1296

echo "localdns should be enabled."
systemctlEnableAndStart localdns 30 || exit $ERR_LOCALDNS_FAIL
echo "Enable localdns succeeded."
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enableLocalDNS hard-fails if systemctlEnableAndStart localdns fails. On older VHDs that don't ship localdns.service, this causes provisioning failure (systemctl restart retries then exits). To preserve backward compatibility, add a guard before starting/enabling localdns (e.g., check systemctl list-unit-files for localdns.service or verify the unit file exists) and skip localdns setup with a warning instead of exiting when the unit is missing.

Copilot generated this review using guidance from repository custom instructions.
ExecStart=/opt/azure/containers/aks-hosts-setup.sh

[Install]
WantedBy=multi-user.target kubelet.service
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aks-hosts-setup.service is a oneshot intended to be triggered by the timer. Having WantedBy=... kubelet.service means enabling the service could also hook it to kubelet restarts/ordering, which is surprising and can cause extra DNS resolution runs unrelated to the timer. Consider limiting WantedBy to multi-user.target (or removing the install section entirely if the service is never meant to be enabled directly) and rely on enabling aks-hosts-setup.timer only.

Suggested change
WantedBy=multi-user.target kubelet.service
WantedBy=multi-user.target

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants