feat: enable localdns hosts plugin to cache critical AKS FQDNs#7639
feat: enable localdns hosts plugin to cache critical AKS FQDNs#7639
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a periodic “local DNS cache” for mcr.microsoft.com that writes resolved IPs into /etc/hosts.testing and wires CoreDNS/localdns to consult that file before forwarding, improving reliability and latency for MCR image pulls when LocalDNS is enabled (especially in the scriptless path).
Changes:
- Adds
mcr-hosts-setupscript, systemd service, and timer into the VHD build pipeline and node provisioning flow, including a newshouldEnableMCRHostsSetuphelper and CSE wiring to enable the timer when LocalDNS (scriptless) is enabled. - Updates the localdns CoreDNS template and associated tests to add a
hosts /etc/hosts.testingplugin block so MCR lookups can be served from the generated hosts file before going to Azure DNS. - Adds targeted shellspec coverage for the new
mcr-hosts-setupbehavior and for enabling its timer, and refreshes bakedCustomDatablobs used in VHD-related tests to include the new artifacts.
Reviewed changes
Copilot reviewed 31 out of 83 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
vhdbuilder/packer/vhd-image-builder-mariner*.json, vhdbuilder/packer/vhd-image-builder-flatcar*.json, vhdbuilder/packer/vhd-image-builder-cvm.json, vhdbuilder/packer/vhd-image-builder-base.json |
Ensures new mcr-hosts-setup.sh, .service, and .timer artifacts are copied into /home/packer during various VHD builds so they can be installed into the image. |
vhdbuilder/packer/vhd-image-builder-arm64-gen2.json |
Same as above for the ARM64 Gen2 image; one object is slightly misformatted compared to the rest of the JSON. |
vhdbuilder/packer/packer_source.sh |
Copies mcr-hosts-setup.sh to /opt/azure/containers and installs the corresponding systemd service and timer units into /etc/systemd/system with appropriate permissions. |
parts/linux/cloud-init/artifacts/mcr-hosts-setup.sh |
New script that resolves A/AAAA records for mcr.microsoft.com via dig and writes them into /etc/hosts.testing, logging both summary counts and the concrete IPs. |
parts/linux/cloud-init/artifacts/mcr-hosts-setup.service |
Ones-shot systemd unit that runs the mcr-hosts-setup.sh script after network-online.target is reached. |
parts/linux/cloud-init/artifacts/mcr-hosts-setup.timer |
Systemd timer that triggers mcr-hosts-setup.service at boot and every 5 minutes thereafter, with jitter and ordering relative to localdns.service. |
parts/linux/cloud-init/artifacts/cse_config.sh |
Adds shouldEnableMCRHostsSetup, which uses systemctlEnableAndStart mcr-hosts-setup.timer 30 to enable/start the timer and logs descriptive messages. |
parts/linux/cloud-init/artifacts/cse_main.sh |
Integrates shouldEnableMCRHostsSetup into the base provisioning flow, calling it when SHOULD_ENABLE_LOCALDNS is true so the timer is only enabled alongside LocalDNS scriptless corefile generation. |
pkg/agent/baker.go |
Extends the LocalDNS CoreDNS template so that, when $isRootDomain is true, a hosts /etc/hosts.testing { fallthrough } block is inserted before the Azure DNS forwarder. |
pkg/agent/baker_test.go |
Updates expected localdns corefile strings in tests to include the new hosts /etc/hosts.testing stanza, ensuring the template change is validated. |
spec/parts/linux/cloud-init/artifacts/mcr_hosts_setup_spec.sh |
New shellspec tests that (by re-simulating the logic) verify hosts file generation and content based on mocked dig output; currently they do not execute the real script, which has maintainability implications. |
spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh |
Adds tests to ensure shouldEnableMCRHostsSetup echoes the expected messages and calls systemctlEnableAndStart mcr-hosts-setup.timer 30. |
pkg/agent/testdata/CustomizedImage*/CustomData |
Refreshes gzipped CustomData payloads to include the new artifacts and behavior, keeping VHD-related tests aligned with the new provisioning logic. |
…refile selection Changes: 1. Remove synchronous DNS resolution from enableAKSHostsSetup() - Previously ran aks-hosts-setup.sh immediately during CSE, blocking provisioning - Now only enables the timer asynchronously with systemctlEnableAndStartNoBlock - Hosts file populated by timer in background (OnBootSec=0 + RandomizedDelaySec=60s) - Reduces CSE provisioning time by ~5-40 seconds depending on DNS latency 2. Add 30-second timeout wait in select_localdns_corefile() - Polls hosts file every 5 seconds to check for IP→FQDN mappings - Uses hosts plugin corefile if mappings found within 30s - Falls back to non-hosts corefile on timeout (safe degradation) - Validates hosts file content with grep -qE '^[0-9a-fA-F.:]+[[:space:]]+[a-zA-Z]' Benefits: - Faster node provisioning (async DNS resolution doesn't block CSE) - Still benefits from DNS caching when timer completes quickly - Graceful fallback if DNS resolution takes longer than expected - No provisioning failures if aks-hosts-setup timer is delayed Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 46 out of 170 changed files in this pull request and generated 3 comments.
You can also share your feedback on Copilot code review. Take the survey.
…ts-setup.sh The validation logic at lines 264-293 was a weaker duplicate of the thorough validation at lines 213-262. The second block: - Used inverted logic (negative validation vs positive) - Did not count valid entries (could pass with zero valid mappings) - Provided less detailed error messages - Was completely redundant since first block already validated everything Removed 31 lines of duplicate code while keeping the more thorough first validation block which: - Counts valid entries and ensures at least one exists - Provides detailed per-line error messages - Uses clearer positive validation logic (if IPv4 then count, elif IPv6 then count) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove the 60-second random delay from aks-hosts-setup.timer to fix a race condition where select_localdns_corefile() times out before the hosts file is populated. The random delay was intended to prevent thundering herd when multiple nodes boot simultaneously, but it causes the hosts plugin to be disabled on ~50% of nodes: - select_localdns_corefile() waits 30s for /etc/localdns/hosts - Timer triggers immediately (OnBootSec=0) but adds 0-60s random delay - If delay > 30s, selection times out and falls back to no-hosts corefile Removing the delay is safe because: 1. DNS resolution naturally staggers (6 FQDNs × 3s timeout per query) 2. Azure DNS (168.63.129.16) is designed to handle massive scale 3. Operation is read-only (no shared write bottleneck) 4. DNS resolution completes in ~18-36s, well within 30s timeout Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 47 out of 170 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
e2e/validation.go:59
for i := range maxRetriesdoes not compile (cannot range over an int). This should be a standard counted loop (e.g.,for i := 0; i < maxRetries; i++ { ... }).
func ValidatePodRunningWithRetry(ctx context.Context, s *Scenario, pod *corev1.Pod, maxRetries int) {
var err error
for i := range maxRetries {
err = validatePodRunning(ctx, s, pod)
if err != nil {
time.Sleep(1 * time.Second)
s.T.Logf("retrying pod %q validation (%d/%d)", pod.Name, i+1, maxRetries)
continue
You can also share your feedback on Copilot code review. Take the survey.
# Conflicts: # pkg/agent/testdata/AKSUbuntu2204+China/CustomData # pkg/agent/testdata/AKSUbuntu2204+Containerd+CDI/CustomData # pkg/agent/testdata/AKSUbuntu2204+Containerd+DevicePlugin/CustomData # pkg/agent/testdata/AKSUbuntu2204+Containerd+MIG+ArtifactStreaming/CustomData # pkg/agent/testdata/AKSUbuntu2204+Containerd+MIG/CustomData # pkg/agent/testdata/AKSUbuntu2204+CustomCloud+USNat/CustomData # pkg/agent/testdata/AKSUbuntu2204+CustomCloud+USSec/CustomData # pkg/agent/testdata/AKSUbuntu2204+CustomCloud+ootcredentialprovider/CustomData # pkg/agent/testdata/AKSUbuntu2204+CustomCloud/CustomData # pkg/agent/testdata/AKSUbuntu2204+CustomKubeletConfig+CustomLinuxOSConfig/CustomData # pkg/agent/testdata/AKSUbuntu2204+CustomKubeletConfig+SerializeImagePulls/CustomData # pkg/agent/testdata/AKSUbuntu2204+DisableKubeletServingCertificateRotation+CustomKubeletConfig/CustomData # pkg/agent/testdata/AKSUbuntu2204+DisableKubeletServingCertificateRotation/CustomData # pkg/agent/testdata/AKSUbuntu2204+EnableManagedGPU+Disabled/CustomData # pkg/agent/testdata/AKSUbuntu2204+EnableManagedGPU/CustomData # pkg/agent/testdata/AKSUbuntu2204+IMDSRestrictionOff/CustomData # pkg/agent/testdata/AKSUbuntu2204+IMDSRestrictionOnWithFilterTable/CustomData # pkg/agent/testdata/AKSUbuntu2204+IMDSRestrictionOnWithMangleTable/CustomData # pkg/agent/testdata/AKSUbuntu2204+ImplicitlyDisableKubeletServingCertificateRotation/CustomData # pkg/agent/testdata/AKSUbuntu2204+KubeletServingCertificateRotation+CustomKubeletConfig/CustomData # pkg/agent/testdata/AKSUbuntu2204+KubeletServingCertificateRotation/CustomData # pkg/agent/testdata/AKSUbuntu2204+ManagedGPUExperienceAFEC+Disabled/CustomData # pkg/agent/testdata/AKSUbuntu2204+ManagedGPUExperienceAFEC/CustomData # pkg/agent/testdata/AKSUbuntu2204+MigStrategy+Mixed/CustomData # pkg/agent/testdata/AKSUbuntu2204+MigStrategy+None/CustomData # pkg/agent/testdata/AKSUbuntu2204+MigStrategy+Single/CustomData # pkg/agent/testdata/AKSUbuntu2204+NoArtifactStreaming/CustomData # pkg/agent/testdata/AKSUbuntu2204+OutboundTypeBlocked/CustomData # pkg/agent/testdata/AKSUbuntu2204+OutboundTypeNil/CustomData # pkg/agent/testdata/AKSUbuntu2204+OutboundTypeNone/CustomData # pkg/agent/testdata/AKSUbuntu2204+SSHStatusEntraID/CustomData # pkg/agent/testdata/AKSUbuntu2204+SSHStatusOff/CustomData # pkg/agent/testdata/AKSUbuntu2204+SSHStatusOn/CustomData # pkg/agent/testdata/AKSUbuntu2204+SecurityProfile/CustomData # pkg/agent/testdata/AKSUbuntu2204+SerializeImagePulls/CustomData # pkg/agent/testdata/AKSUbuntu2204+cgroupv2/CustomData # pkg/agent/testdata/AKSUbuntu2204+ootcredentialprovider/CustomData # pkg/agent/testdata/AKSUbuntu2404+CustomLinuxOSConfigUlimit/CustomData # pkg/agent/testdata/AKSUbuntu2404+NetworkPolicy/CustomData # pkg/agent/testdata/AKSUbuntu2404+Teleport/CustomData # pkg/agent/testdata/AzureLinuxV2+Kata/CustomData # pkg/agent/testdata/AzureLinuxV3+Kata+DisableUnattendedUpgrades=false/CustomData # pkg/agent/testdata/AzureLinuxV3+Kata+DisableUnattendedUpgrades=true/CustomData # pkg/agent/testdata/AzureLinuxV3+Kata/CustomData # pkg/agent/testdata/AzureLinuxv2+DisableUnattendedUpgrades=false/CustomData # pkg/agent/testdata/AzureLinuxv2+DisableUnattendedUpgrades=true/CustomData # pkg/agent/testdata/AzureLinuxv2+Kata+DisableUnattendedUpgrades=false/CustomData # pkg/agent/testdata/AzureLinuxv2+Kata+DisableUnattendedUpgrades=true/CustomData # pkg/agent/testdata/CustomizedImage/CustomData # pkg/agent/testdata/CustomizedImageKata/CustomData # pkg/agent/testdata/CustomizedImageLinuxGuard/CustomData # pkg/agent/testdata/Flatcar+CustomCloud+USSec/CustomData # pkg/agent/testdata/Flatcar+CustomCloud+USSec/CustomData.inner # pkg/agent/testdata/Flatcar+CustomCloud/CustomData # pkg/agent/testdata/Flatcar+CustomCloud/CustomData.inner # pkg/agent/testdata/Flatcar/CustomData # pkg/agent/testdata/Flatcar/CustomData.inner # pkg/agent/testdata/MarinerV2+CustomCloud+USNat/CustomData # pkg/agent/testdata/MarinerV2+CustomCloud+USSec/CustomData # pkg/agent/testdata/MarinerV2+CustomCloud/CustomData # pkg/agent/testdata/MarinerV2+Kata/CustomData # pkg/agent/testdata/Marinerv2+DisableUnattendedUpgrades=false/CustomData # pkg/agent/testdata/Marinerv2+DisableUnattendedUpgrades=true/CustomData # pkg/agent/testdata/Marinerv2+Kata+DisableUnattendedUpgrades=false/CustomData # pkg/agent/testdata/Marinerv2+Kata+DisableUnattendedUpgrades=true/CustomData
Fix critical validation ordering bug in aks-hosts-setup.sh where validation occurred AFTER the atomic rename, allowing invalid data to be published to the production hosts file. ## Problem The old code flow was: 1. Write to temp file 2. Rename temp → production (atomic, but unvalidated) 3. Validate production file 4. If validation fails, exit with error This meant that if validation failed (e.g., FQDN with no IP address), the INVALID data was already live at /etc/localdns/hosts. CoreDNS would read the bad entries and serve incorrect DNS responses until the next timer run (15+ minutes later), causing: - Failed image pulls from mcr.microsoft.com - Pod startup failures - DaemonSet deployment issues ## Solution New code flow: 1. Write to temp file 2. Validate temp file (before it goes live) 3. If validation fails, delete temp file and exit 4. Only if validation passes, rename temp → production (atomic) This ensures CoreDNS never sees invalid data. If validation fails, the old hosts file stays intact and DNS continues working with previous valid entries. ## Changes - **aks-hosts-setup.sh**: Move all validation logic to occur on temp file before the atomic rename. Add cleanup (rm temp file) on all validation failure paths. - **test_aks_hosts_validation.sh**: Add Test 6 to verify validate-before-rename behavior. Test confirms that when temp file has invalid data, production file remains unchanged. - **verify_shell.sh**: Add aks-hosts-setup.sh and test_aks_hosts_validation.sh to BASH_ONLY_LIST to skip POSIX compliance checks. These scripts require bash features (arrays, [[]], regex matching with =~). - **testdata/***: Regenerate all CustomData snapshots to reflect the validation ordering fix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update the comment in select_localdns_corefile() to remove outdated reference to RandomizedDelaySec=60s, which was removed from the timer configuration in an earlier commit. The comment now accurately reflects: - Timer triggers immediately (OnBootSec=0) - DNS resolution takes 18-36 seconds (6 FQDNs × 3s timeout) - 30 second wait timeout is sufficient Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 48 out of 180 changed files in this pull request and generated 2 comments.
You can also share your feedback on Copilot code review. Take the survey.
| AKS_HOSTS_SETUP_SCRIPT_SRC=/home/packer/aks-hosts-setup.sh | ||
| AKS_HOSTS_SETUP_SCRIPT_DEST=/opt/azure/containers/aks-hosts-setup.sh | ||
| cpAndMode $AKS_HOSTS_SETUP_SCRIPT_SRC $AKS_HOSTS_SETUP_SCRIPT_DEST 0755 | ||
|
|
||
| AKS_HOSTS_SETUP_SERVICE_SRC=/home/packer/aks-hosts-setup.service | ||
| AKS_HOSTS_SETUP_SERVICE_DEST=/etc/systemd/system/aks-hosts-setup.service | ||
| cpAndMode $AKS_HOSTS_SETUP_SERVICE_SRC $AKS_HOSTS_SETUP_SERVICE_DEST 0644 | ||
|
|
||
| AKS_HOSTS_SETUP_TIMER_SRC=/home/packer/aks-hosts-setup.timer | ||
| AKS_HOSTS_SETUP_TIMER_DEST=/etc/systemd/system/aks-hosts-setup.timer | ||
| cpAndMode $AKS_HOSTS_SETUP_TIMER_SRC $AKS_HOSTS_SETUP_TIMER_DEST 0644 |
There was a problem hiding this comment.
copyPackerFiles() now unconditionally copies /home/packer/aks-hosts-setup.{sh,service,timer}. This will fail VHD builds for templates that don't upload these artifacts (e.g. vhd-image-builder-acl.json currently only uploads localdns.* and not aks-hosts-setup.), because cpAndMode exits on any copy failure. Either add these files to every relevant vhd-image-builder-.json (including ACL) or make the copy conditional on the source files existing to keep older/variant templates building.
Update the shellspec test expectations to match the actual stderr log
messages produced by select_localdns_corefile() in cse_helpers.sh.
Changed:
- "exists, checking for IP mappings" → "checking ${HOSTS_FILE} for content"
This matches the implementation at line 1325:
echo "Hosts plugin is enabled, checking ${hosts_file_path} for content..." >&2
All other test assertions already match the implementation correctly.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…Block Update enableAKSHostsSetup specs to mock systemctlEnableAndStartNoBlock instead of systemctlEnableAndStart, matching the actual implementation in cse_config.sh which calls systemctlEnableAndStartNoBlock. Changes: - Mock function: systemctlEnableAndStart → systemctlEnableAndStartNoBlock - Update test assertions to expect "systemctlEnableAndStartNoBlock" - Update test titles to reference correct function name This ensures tests accurately reflect the non-blocking timer startup behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace `command -v /opt/bin/kubectl` with `[ -x /opt/bin/kubectl ]` for checking kubectl binary existence. The `-x` test is more reliable for absolute paths since `command -v` behavior varies across shells and can produce false negatives/positives. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove test case expecting "Running initial aks-hosts-setup to resolve DNS..." output. The enableAKSHostsSetup function does not run the setup script directly - it only enables the systemd timer which runs the script asynchronously. The test was expecting behavior that was never implemented. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 48 out of 180 changed files in this pull request and generated 2 comments.
You can also share your feedback on Copilot code review. Take the survey.
Remove `head -n1` from corefile variant detection log. The hosts plugin directive "hosts /etc/localdns/hosts" does not appear on the first line of the corefile, so the previous check would almost always report "WITHOUT hosts plugin" even when the hosts plugin was present. Now using `grep -q` to search the entire decoded corefile content, which correctly identifies whether the hosts plugin is configured. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add guard to check if profile is nil before dereferencing it in GenerateLocalDNSCoreFile. Without this check, the function would panic if config.AgentPoolProfile is nil when GetGeneratedLocalDNSCoreFile or GetGeneratedLocalDNSCoreFileNoHosts template functions are called. Now returns empty corefile when profile is nil, matching the behavior when LocalDNS is disabled. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 48 out of 180 changed files in this pull request and generated 2 comments.
You can also share your feedback on Copilot code review. Take the survey.
| # Wait up to 30 seconds for the hosts file to be populated with IP mappings | ||
| # aks-hosts-setup timer has OnBootSec=0 which triggers immediately when timer starts. | ||
| # DNS resolution typically completes within 18-36 seconds (6 FQDNs × 3s timeout per query). | ||
| # 30 seconds provides reasonable time for initial DNS resolution without blocking provisioning. | ||
| local timeout=30 | ||
| local wait_interval=5 | ||
| local elapsed=0 | ||
|
|
||
| while [ $elapsed -lt $timeout ]; do | ||
| if [ -f "${hosts_file_path}" ]; then | ||
| if grep -qE '^[0-9a-fA-F.:]+[[:space:]]+[a-zA-Z]' "${hosts_file_path}"; then | ||
| echo "aks-hosts-setup produced hosts file with IP mappings after ${elapsed}s, using corefile with hosts plugin" >&2 | ||
| echo "${corefile_with_hosts}" | ||
| return 0 | ||
| fi | ||
| fi | ||
|
|
||
| if [ $elapsed -eq 0 ]; then | ||
| echo "Waiting for aks-hosts-setup to populate ${hosts_file_path} (timeout: ${timeout}s)..." >&2 | ||
| fi | ||
|
|
||
| sleep $wait_interval | ||
| elapsed=$((elapsed + wait_interval)) | ||
| done |
There was a problem hiding this comment.
select_localdns_corefile waits only 30s for /etc/localdns/hosts to contain at least one mapping. aks-hosts-setup.sh does up to 2 DNS queries (A + AAAA) per FQDN with a 3s timeout each; for the public cloud list this can be ~12 queries, i.e., up to ~36s in the worst case (even before network-online delays). With the current 30s timeout, a healthy but slow first run can cause a permanent fallback to the no-hosts Corefile on boot. Consider increasing the timeout (or triggering an initial synchronous run of aks-hosts-setup.service) to better match the worst-case resolution time.
| It 'should warn when initial hosts setup script fails' | ||
| echo '#!/bin/bash | ||
| exit 1' > "$AKS_HOSTS_SETUP_SCRIPT" | ||
| chmod +x "$AKS_HOSTS_SETUP_SCRIPT" | ||
| When call enableAKSHostsSetup | ||
| The status should be success | ||
| The output should include "Warning: Initial hosts setup failed" | ||
| The output should include "Enabling aks-hosts-setup timer..." | ||
| End |
There was a problem hiding this comment.
The new spec expects enableAKSHostsSetup to attempt an initial run and log "Warning: Initial hosts setup failed" when the setup script exits non-zero, but enableAKSHostsSetup in cse_config.sh currently never runs aks-hosts-setup.sh (it only enables the timer and touches the hosts file). This will make this test (and any similar ones) fail. Either update enableAKSHostsSetup to perform an initial run (and emit the warning on failure) or adjust/remove this test case to match the implemented behavior.
Add aks-hosts-setup.{sh,service,timer} file uploads to the ACL VHD
build template to bring ACL to feature parity with other OS variants
(Ubuntu, Mariner, Flatcar).
These files were missing from the ACL template because ACL was added
to the codebase after aks-hosts-setup was introduced. Without these
files, packer_source.sh's copyPackerFiles() would fail during ACL VHD
builds when trying to copy non-existent files from /home/packer/.
This change ensures ACL VHDs include the localdns hosts plugin
functionality for DNS caching of critical AKS FQDNs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update the aks-hosts-setup.sh ShellSpec test to validate that mcr.azk8s.cn (legacy China container registry) is resolved in AzureChinaCloud environment, matching the recent addition to the script itself. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 48 out of 181 changed files in this pull request and generated 1 comment.
You can also share your feedback on Copilot code review. Take the survey.
| if [ "$a" -le 255 ] && [ "$b" -le 255 ] && [ "$c" -le 255 ] && [ "$d" -le 255 ]; then | ||
| echo "${a}.${b}.${c}.${d}" | ||
| fi | ||
| done |
There was a problem hiding this comment.
With set -euo pipefail, the resolve_ipv4 pipeline can exit non-zero when grep finds no matches (e.g., domains with no A records). Because the pipeline isn’t guarded, that non-zero status can terminate the whole script even though “no records” should be a normal case. Make the pipeline tolerant (e.g., ensure it returns 0 when there are no matches) so missing A records just yield an empty result instead of aborting the run.
| done | |
| done || true |
Limit aks-hosts-setup.sh cloud support to: - AzurePublicCloud - AzureChinaCloud - AzureUSGovernmentCloud Remove support for USNatCloud, USSecCloud, AzureStackCloud, AzureGermanCloud, AzureGermanyCloud, and AzureBleuCloud. Script now exits with error code 1 for unsupported clouds with message: "ERROR: The following cloud is not supported: <cloud>" Updated unit tests to verify removed clouds properly fail and added tests for each removed cloud environment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Accept theirs for all CustomData conflicts - Will regenerate CustomData after merge
- Regenerated all CustomData files with 'make generate' after merging origin/main - Removed outdated test 'should warn when initial hosts setup script fails' - This test expected a warning message that no longer exists in the current implementation - The enableAKSHostsSetup() function now only enables the timer without running an initial setup script - All remaining tests for enableAKSHostsSetup() pass, including cloud validation tests
…ustomData - Added back the cloud validation case statement that was lost during merge - Validates TARGET_CLOUD is one of: AzurePublicCloud, AzureChinaCloud, or AzureUSGovernmentCloud - Returns with warning for unsupported clouds - Regenerated all CustomData files with 'make generate' - All 85 unit tests now pass
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 49 out of 195 changed files in this pull request and generated 3 comments.
You can also share your feedback on Copilot code review. Take the survey.
| // The corefile with hosts plugin is written to /opt/azure/containers/localdns/localdns.corefile. | ||
| // The corefile without hosts plugin is written to /opt/azure/containers/localdns/localdns-nohosts.corefile | ||
| // and used as a fallback when enableAKSHostsSetup fails or when older VHDs don't have aks-hosts-setup artifacts. | ||
| // Runtime selection happens in cse_main.sh based on the availability of /etc/localdns/hosts. |
There was a problem hiding this comment.
The comment says the generated corefile variants are "written" to specific paths (including localdns-nohosts.corefile), but this helper only returns base64 content; the actual write happens later in shell. This is misleading for future maintainers—either update the comment to reflect that this function only generates/encodes the corefile, or implement writing the no-hosts variant if that’s the intent.
| // The corefile with hosts plugin is written to /opt/azure/containers/localdns/localdns.corefile. | |
| // The corefile without hosts plugin is written to /opt/azure/containers/localdns/localdns-nohosts.corefile | |
| // and used as a fallback when enableAKSHostsSetup fails or when older VHDs don't have aks-hosts-setup artifacts. | |
| // Runtime selection happens in cse_main.sh based on the availability of /etc/localdns/hosts. | |
| // This helper only generates and encodes the corefile variants; it does not write them to disk. | |
| // The provisioning scripts (e.g., cse_main.sh) use these values to write the corefile with hosts | |
| // plugin to /opt/azure/containers/localdns/localdns.corefile and the corefile without hosts plugin | |
| // to /opt/azure/containers/localdns/localdns-nohosts.corefile, using the latter as a fallback when | |
| // enableAKSHostsSetup fails or when older VHDs don't have aks-hosts-setup artifacts. Runtime selection | |
| // happens in cse_main.sh based on the availability of /etc/localdns/hosts. |
|
|
||
| echo "localdns should be enabled." | ||
| systemctlEnableAndStart localdns 30 || exit $ERR_LOCALDNS_FAIL | ||
| echo "Enable localdns succeeded." |
There was a problem hiding this comment.
enableLocalDNS hard-fails if systemctlEnableAndStart localdns fails. On older VHDs that don't ship localdns.service, this causes provisioning failure (systemctl restart retries then exits). To preserve backward compatibility, add a guard before starting/enabling localdns (e.g., check systemctl list-unit-files for localdns.service or verify the unit file exists) and skip localdns setup with a warning instead of exiting when the unit is missing.
| ExecStart=/opt/azure/containers/aks-hosts-setup.sh | ||
|
|
||
| [Install] | ||
| WantedBy=multi-user.target kubelet.service |
There was a problem hiding this comment.
aks-hosts-setup.service is a oneshot intended to be triggered by the timer. Having WantedBy=... kubelet.service means enabling the service could also hook it to kubelet restarts/ordering, which is surprising and can cause extra DNS resolution runs unrelated to the timer. Consider limiting WantedBy to multi-user.target (or removing the install section entirely if the service is never meant to be enabled directly) and rely on enabling aks-hosts-setup.timer only.
| WantedBy=multi-user.target kubelet.service | |
| WantedBy=multi-user.target |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This is adding the aks-setup-hosts.sh, aks-setup-hosts.service, and aks-service.hosts-timer to resolve the AKS critical FQDNs every 15 minutes, which are written to /etc/localdns/hosts file. This file can be mounted onto the LocalDNS corefile if EnableHostsPlugin bool from AKS-RP is passed in.
Which issue(s) this PR fixes:
Fixes #
Requirements:
Special notes for your reviewer:
Release note: