Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
feat(e2e): dynamically fetch latest VM extension version with caching
Replace hardcoded VM extension version "1.406" with a cached Azure API
call that fetches the latest version at runtime. Add a caching layer
via CachedGetLatestVMExtensionImageVersion to avoid redundant API calls
across tests. Update all callers to pass context for proper propagation.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
  • Loading branch information
surajssd committed Mar 10, 2026
commit a9e929329edc2bbf92051d273075f8b37072fc00
15 changes: 15 additions & 0 deletions e2e/cache.go
Original file line number Diff line number Diff line change
Expand Up @@ -245,3 +245,18 @@ var CachedVMSizeSupportsNVMe = cachedFunc(func(ctx context.Context, req VMSizeSK
var CachedIsVMSizeGen2Only = cachedFunc(func(ctx context.Context, req VMSizeSKURequest) (bool, error) {
return config.Azure.IsVMSizeGen2Only(ctx, req.Location, req.VMSize)
})

// GetLatestExtensionVersionRequest is the cache key for VM extension version lookups.
type GetLatestExtensionVersionRequest struct {
Location string
ExtType string
Publisher string
}

// CachedGetLatestVMExtensionImageVersion caches the result of querying the Azure API
// for the latest VM extension image version.
var CachedGetLatestVMExtensionImageVersion = cachedFunc(
func(ctx context.Context, req GetLatestExtensionVersionRequest) (string, error) {
return config.Azure.GetLatestVMExtensionImageVersion(ctx, req.Location, req.ExtType, req.Publisher)
},
)
Comment on lines +256 to +262
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cachedFunc caches the first error permanently for a given key. For VM extension version lookups this can turn a transient Azure API failure into a persistent failure for the rest of the test run (and it also prevents a later retry from succeeding). Consider not caching errors for this specific cached function (e.g., evict the entry on error) so warm calls only come from cache after a successful fetch.

Suggested change
// CachedGetLatestVMExtensionImageVersion caches the result of querying the Azure API
// for the latest VM extension image version.
var CachedGetLatestVMExtensionImageVersion = cachedFunc(
func(ctx context.Context, req GetLatestExtensionVersionRequest) (string, error) {
return config.Azure.GetLatestVMExtensionImageVersion(ctx, req.Location, req.ExtType, req.Publisher)
},
)
// vmExtensionVersionCache provides a thread-safe cache for VM extension versions.
// Only successful lookups are cached; errors are never memoized so that transient
// Azure API failures can be retried on subsequent calls.
type vmExtensionVersionCache struct {
mu sync.Mutex
data map[GetLatestExtensionVersionRequest]string
}
func newVMExtensionVersionCache() *vmExtensionVersionCache {
return &vmExtensionVersionCache{
data: make(map[GetLatestExtensionVersionRequest]string),
}
}
func (c *vmExtensionVersionCache) Get(ctx context.Context, req GetLatestExtensionVersionRequest) (string, error) {
// Fast path: check if we already have a cached value.
c.mu.Lock()
if v, ok := c.data[req]; ok {
c.mu.Unlock()
return v, nil
}
c.mu.Unlock()
// Not cached; fetch from Azure.
v, err := config.Azure.GetLatestVMExtensionImageVersion(ctx, req.Location, req.ExtType, req.Publisher)
if err != nil {
// Do not cache errors to avoid turning transient failures into
// persistent ones for the lifetime of the test run.
return "", err
}
// Cache the successful result.
c.mu.Lock()
c.data[req] = v
c.mu.Unlock()
return v, nil
}
var vmExtVersionCache = newVMExtensionVersionCache()
// CachedGetLatestVMExtensionImageVersion caches the result of querying the Azure API
// for the latest VM extension image version. Errors are never cached so that
// transient Azure failures can be retried.
var CachedGetLatestVMExtensionImageVersion = func(ctx context.Context, req GetLatestExtensionVersionRequest) (string, error) {
return vmExtVersionCache.Get(ctx, req)
}

Copilot uses AI. Check for mistakes.
46 changes: 23 additions & 23 deletions e2e/config/azure.go
Original file line number Diff line number Diff line change
Expand Up @@ -753,67 +753,67 @@ func (a *AzureClient) GetLatestVMExtensionImageVersion(ctx context.Context, loca
if err != nil {
return "", fmt.Errorf("listing extension versions: %w", err)
}

if len(resp.VirtualMachineExtensionImageArray) == 0 {
return "", fmt.Errorf("no extension versions found")
}

version := make([]VMExtenstionVersion, len(resp.VirtualMachineExtensionImageArray))
versions := make([]vmExtensionVersion, len(resp.VirtualMachineExtensionImageArray))
for i, ext := range resp.VirtualMachineExtensionImageArray {
version[i] = parseVersion(ext)
versions[i] = parseVersion(ext)
}

sort.Slice(version, func(i, j int) bool {
return version[i].Less(version[j])
sort.Slice(versions, func(i, j int) bool {
return versions[i].less(versions[j])
})

return *version[len(version)-1].Original.Name, nil
return *versions[len(versions)-1].original.Name, nil
}

// VMExtenstionVersion represents a parsed version of a VM extension image.
type VMExtenstionVersion struct {
Original *armcompute.VirtualMachineExtensionImage
Major int
Minor int
Patch int
// vmExtensionVersion represents a parsed version of a VM extension image.
type vmExtensionVersion struct {
original *armcompute.VirtualMachineExtensionImage
major int
minor int
patch int
}

// parseVersion parses the version from a VM extension image name, which can be in the format 1.151, 1.0.1, etc.
// You can find all the versions of a specific VM extension by running:
// az vm extension image list -n Compute.AKS.Linux.AKSNode
func parseVersion(v *armcompute.VirtualMachineExtensionImage) VMExtenstionVersion {
func parseVersion(v *armcompute.VirtualMachineExtensionImage) vmExtensionVersion {
// Split by dots
parts := strings.Split(*v.Name, ".")

version := VMExtenstionVersion{Original: v}
version := vmExtensionVersion{original: v}

if len(parts) >= 1 {
if major, err := strconv.Atoi(parts[0]); err == nil {
version.Major = major
version.major = major
}
}
if len(parts) >= 2 {
if minor, err := strconv.Atoi(parts[1]); err == nil {
version.Minor = minor
version.minor = minor
}
}
if len(parts) >= 3 {
if patch, err := strconv.Atoi(parts[2]); err == nil {
version.Patch = patch
version.patch = patch
}
}

return version
}

func (v VMExtenstionVersion) Less(other VMExtenstionVersion) bool {
if v.Major != other.Major {
return v.Major < other.Major
// less returns true if v is a lower version than other.
func (v vmExtensionVersion) less(other vmExtensionVersion) bool {
if v.major != other.major {
return v.major < other.major
}
if v.Minor != other.Minor {
return v.Minor < other.Minor
if v.minor != other.minor {
return v.minor < other.minor
}
return v.Patch < other.Patch
return v.patch < other.patch
}

// getResourceSKU queries the Azure Resource SKUs API to find the SKU for the given VM size and location.
Expand Down
10 changes: 5 additions & 5 deletions e2e/scenario_gpu_managed_experience_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -259,7 +259,7 @@ func Test_Ubuntu2404_NvidiaDevicePluginRunning(t *testing.T) {
vmss.Tags["EnableManagedGPUExperience"] = to.Ptr("true")

// Enable the AKS VM extension for GPU nodes
extension, err := createVMExtensionLinuxAKSNode(vmss.Location)
extension, err := createVMExtensionLinuxAKSNode(context.TODO(), vmss.Location)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using context.TODO() here bypasses the suite’s shared cancellation/timeout contexts (e.g., testCtx/newTestCtx), so the Azure API call used to resolve the extension version won’t be canceled during graceful shutdown. Prefer using the package-level testCtx (or a derived context with timeout) instead of context.TODO(); this pattern appears multiple times in this file and should be updated consistently.

Suggested change
extension, err := createVMExtensionLinuxAKSNode(context.TODO(), vmss.Location)
extension, err := createVMExtensionLinuxAKSNode(testCtx, vmss.Location)

Copilot uses AI. Check for mistakes.
require.NoError(t, err, "creating AKS VM extension")
vmss.Properties = addVMExtensionToVMSS(vmss.Properties, extension)
},
Expand Down Expand Up @@ -334,7 +334,7 @@ func Test_Ubuntu2204_NvidiaDevicePluginRunning(t *testing.T) {
vmss.Tags["EnableManagedGPUExperience"] = to.Ptr("true")

// Enable the AKS VM extension for GPU nodes
extension, err := createVMExtensionLinuxAKSNode(vmss.Location)
extension, err := createVMExtensionLinuxAKSNode(context.TODO(), vmss.Location)
require.NoError(t, err, "creating AKS VM extension")
vmss.Properties = addVMExtensionToVMSS(vmss.Properties, extension)
},
Expand Down Expand Up @@ -408,7 +408,7 @@ func Test_AzureLinux3_NvidiaDevicePluginRunning(t *testing.T) {
vmss.Tags["EnableManagedGPUExperience"] = to.Ptr("true")

// Enable the AKS VM extension for GPU nodes
extension, err := createVMExtensionLinuxAKSNode(vmss.Location)
extension, err := createVMExtensionLinuxAKSNode(context.TODO(), vmss.Location)
require.NoError(t, err, "creating AKS VM extension")
vmss.Properties = addVMExtensionToVMSS(vmss.Properties, extension)
},
Expand Down Expand Up @@ -478,7 +478,7 @@ func Test_Ubuntu2404_NvidiaDevicePluginRunning_MIG(t *testing.T) {
vmss.SKU.Name = to.Ptr("Standard_NC24ads_A100_v4")

// Enable the AKS VM extension for GPU nodes
extension, err := createVMExtensionLinuxAKSNode(vmss.Location)
extension, err := createVMExtensionLinuxAKSNode(context.TODO(), vmss.Location)
require.NoError(t, err, "creating AKS VM extension")
vmss.Properties = addVMExtensionToVMSS(vmss.Properties, extension)
},
Expand Down Expand Up @@ -555,7 +555,7 @@ func Test_Ubuntu2204_NvidiaDevicePluginRunning_WithoutVMSSTag(t *testing.T) {
// to test that NBC EnableManagedGPU field works independently

// Enable the AKS VM extension for GPU nodes
extension, err := createVMExtensionLinuxAKSNode(vmss.Location)
extension, err := createVMExtensionLinuxAKSNode(context.TODO(), vmss.Location)
require.NoError(t, err, "creating AKS VM extension")
vmss.Properties = addVMExtensionToVMSS(vmss.Properties, extension)
},
Expand Down
2 changes: 1 addition & 1 deletion e2e/scenario_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -2138,7 +2138,7 @@ func Test_Ubuntu2404_NPD_Basic(t *testing.T) {
BootstrapConfigMutator: func(nbc *datamodel.NodeBootstrappingConfiguration) {
},
VMConfigMutator: func(vmss *armcompute.VirtualMachineScaleSet) {
extension, err := createVMExtensionLinuxAKSNode(vmss.Location)
extension, err := createVMExtensionLinuxAKSNode(context.TODO(), vmss.Location)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using context.TODO() here bypasses the suite’s shared cancellation/timeout contexts (e.g., testCtx/newTestCtx), so the Azure API call used to resolve the extension version won’t be canceled during graceful shutdown. Prefer using the package-level testCtx (or a derived context with timeout) instead of context.TODO(), and apply the same change to the other createVMExtensionLinuxAKSNode call sites.

Suggested change
extension, err := createVMExtensionLinuxAKSNode(context.TODO(), vmss.Location)
extension, err := createVMExtensionLinuxAKSNode(testCtx, vmss.Location)

Copilot uses AI. Check for mistakes.
require.NoError(t, err, "creating AKS VM extension")
vmss.Properties = addVMExtensionToVMSS(vmss.Properties, extension)
},
Expand Down
27 changes: 14 additions & 13 deletions e2e/test_helpers.go
Original file line number Diff line number Diff line change
Expand Up @@ -522,21 +522,22 @@ func addTrustedLaunchToVMSS(properties *armcompute.VirtualMachineScaleSetPropert
return properties
}

func createVMExtensionLinuxAKSNode(_ *string) (*armcompute.VirtualMachineScaleSetExtension, error) {
// Default to "westus" if location is nil.
// region := "westus"
// if location != nil {
// region = *location
// }
func createVMExtensionLinuxAKSNode(ctx context.Context, location *string) (*armcompute.VirtualMachineScaleSetExtension, error) {
region := config.Config.DefaultLocation
if location != nil {
region = *location
}

extensionName := "Compute.AKS.Linux.AKSNode"
publisher := "Microsoft.AKS"
extensionVersion := "1.406"
// NOTE (@surajssd): If this is gonna be called multiple times, then find a way to cache the latest version.
// extensionVersion, err := config.Azure.GetLatestVMExtensionImageVersion(context.TODO(), region, extensionName, publisher)
// if err != nil {
// return nil, fmt.Errorf("getting latest VM extension image version: %v", err)
// }
extensionVersion, err := CachedGetLatestVMExtensionImageVersion(ctx, GetLatestExtensionVersionRequest{
Location: region,
ExtType: extensionName,
Publisher: publisher,
})
if err != nil {
return nil, fmt.Errorf("getting latest VM extension image version: %w", err)
}
Comment on lines +525 to +540
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description/title mention adding a timing test that does a cold Azure call then a warm cached call and logs durations/versions, but this change set only switches the extension version resolution to use CachedGetLatestVMExtensionImageVersion and doesn’t add any timing assertions/logging. Please either add the intended timing test/logging (or an assertion that the second call is served from cache) or update the PR description/title to match the actual changes.

Copilot uses AI. Check for mistakes.

return &armcompute.VirtualMachineScaleSetExtension{
Name: to.Ptr(extensionName),
Expand Down Expand Up @@ -795,7 +796,7 @@ func runScenarioGPUNPD(t *testing.T, vmSize, location, k8sSystemPoolSKU string)
VMConfigMutator: func(vmss *armcompute.VirtualMachineScaleSet) {
vmss.SKU.Name = to.Ptr(vmSize)

extension, err := createVMExtensionLinuxAKSNode(vmss.Location)
extension, err := createVMExtensionLinuxAKSNode(context.TODO(), vmss.Location)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using context.TODO() here bypasses the suite’s shared cancellation/timeout contexts (e.g., testCtx/newTestCtx), so this Azure API call won’t be canceled during graceful shutdown and can hang longer than intended. Prefer using the package-level testCtx (or a derived context with timeout) instead of context.TODO(), and apply consistently to the other createVMExtensionLinuxAKSNode call sites.

Suggested change
extension, err := createVMExtensionLinuxAKSNode(context.TODO(), vmss.Location)
extension, err := createVMExtensionLinuxAKSNode(testCtx, vmss.Location)

Copilot uses AI. Check for mistakes.
require.NoError(t, err, "creating AKS VM extension")

vmss.Properties = addVMExtensionToVMSS(vmss.Properties, extension)
Expand Down
Loading