Skip to content

test(e2e): add timing test for createVMExtensionLinuxAKSNode#8064

Open
surajssd wants to merge 1 commit intomainfrom
suraj/optimize-extensio-version-lookup
Open

test(e2e): add timing test for createVMExtensionLinuxAKSNode#8064
surajssd wants to merge 1 commit intomainfrom
suraj/optimize-extensio-version-lookup

Conversation

@surajssd
Copy link
Member

What this PR does / why we need it:

Validate the Azure API call timing and caching behavior of the dynamically fetched VM extension version. The test makes a cold call that hits the Azure API, then a warm call that should be served from cache, logging both durations and the returned extension version for manual inspection.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the e2e VMSS model generation to dynamically resolve the latest Compute.AKS.Linux.AKSNode VM extension version via the Azure API and memoize the result to avoid repeated lookups.

Changes:

  • Add an e2e cache key + cached wrapper for GetLatestVMExtensionImageVersion.
  • Update createVMExtensionLinuxAKSNode to accept a context.Context and use the cached Azure lookup instead of a hardcoded version.
  • Optimize GetLatestVMExtensionImageVersion to request only the top result using Top + Orderby.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
e2e/test_helpers.go Switch AKSNode extension creation to use cached latest-version lookup; update one call site for new signature.
e2e/scenario_test.go Update scenario to pass a context into createVMExtensionLinuxAKSNode.
e2e/scenario_gpu_managed_experience_test.go Update multiple GPU scenarios to pass a context into createVMExtensionLinuxAKSNode.
e2e/config/azure.go Change extension version lookup to a server-side “top 1, order by name desc” query.
e2e/cache.go Add cached wrapper + cache-key struct for VM extension latest-version lookups.


// Enable the AKS VM extension for GPU nodes
extension, err := createVMExtensionLinuxAKSNode(vmss.Location)
extension, err := createVMExtensionLinuxAKSNode(context.TODO(), vmss.Location)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using context.TODO() here bypasses the suite’s shared cancellation/timeout contexts (e.g., testCtx/newTestCtx), so the Azure API call used to resolve the extension version won’t be canceled during graceful shutdown. Prefer using the package-level testCtx (or a derived context with timeout) instead of context.TODO(); this pattern appears multiple times in this file and should be updated consistently.

Suggested change
extension, err := createVMExtensionLinuxAKSNode(context.TODO(), vmss.Location)
extension, err := createVMExtensionLinuxAKSNode(testCtx, vmss.Location)

Copilot uses AI. Check for mistakes.
Comment on lines +525 to +540
func createVMExtensionLinuxAKSNode(ctx context.Context, location *string) (*armcompute.VirtualMachineScaleSetExtension, error) {
region := config.Config.DefaultLocation
if location != nil {
region = *location
}

extensionName := "Compute.AKS.Linux.AKSNode"
publisher := "Microsoft.AKS"
extensionVersion := "1.406"
// NOTE (@surajssd): If this is gonna be called multiple times, then find a way to cache the latest version.
// extensionVersion, err := config.Azure.GetLatestVMExtensionImageVersion(context.TODO(), region, extensionName, publisher)
// if err != nil {
// return nil, fmt.Errorf("getting latest VM extension image version: %v", err)
// }
extensionVersion, err := CachedGetLatestVMExtensionImageVersion(ctx, GetLatestExtensionVersionRequest{
Location: region,
ExtType: extensionName,
Publisher: publisher,
})
if err != nil {
return nil, fmt.Errorf("getting latest VM extension image version: %w", err)
}
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description/title mention adding a timing test that does a cold Azure call then a warm cached call and logs durations/versions, but this change set only switches the extension version resolution to use CachedGetLatestVMExtensionImageVersion and doesn’t add any timing assertions/logging. Please either add the intended timing test/logging (or an assertion that the second call is served from cache) or update the PR description/title to match the actual changes.

Copilot uses AI. Check for mistakes.
Comment on lines +256 to +262
// CachedGetLatestVMExtensionImageVersion caches the result of querying the Azure API
// for the latest VM extension image version.
var CachedGetLatestVMExtensionImageVersion = cachedFunc(
func(ctx context.Context, req GetLatestExtensionVersionRequest) (string, error) {
return config.Azure.GetLatestVMExtensionImageVersion(ctx, req.Location, req.ExtType, req.Publisher)
},
)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cachedFunc caches the first error permanently for a given key. For VM extension version lookups this can turn a transient Azure API failure into a persistent failure for the rest of the test run (and it also prevents a later retry from succeeding). Consider not caching errors for this specific cached function (e.g., evict the entry on error) so warm calls only come from cache after a successful fetch.

Suggested change
// CachedGetLatestVMExtensionImageVersion caches the result of querying the Azure API
// for the latest VM extension image version.
var CachedGetLatestVMExtensionImageVersion = cachedFunc(
func(ctx context.Context, req GetLatestExtensionVersionRequest) (string, error) {
return config.Azure.GetLatestVMExtensionImageVersion(ctx, req.Location, req.ExtType, req.Publisher)
},
)
// vmExtensionVersionCache provides a thread-safe cache for VM extension versions.
// Only successful lookups are cached; errors are never memoized so that transient
// Azure API failures can be retried on subsequent calls.
type vmExtensionVersionCache struct {
mu sync.Mutex
data map[GetLatestExtensionVersionRequest]string
}
func newVMExtensionVersionCache() *vmExtensionVersionCache {
return &vmExtensionVersionCache{
data: make(map[GetLatestExtensionVersionRequest]string),
}
}
func (c *vmExtensionVersionCache) Get(ctx context.Context, req GetLatestExtensionVersionRequest) (string, error) {
// Fast path: check if we already have a cached value.
c.mu.Lock()
if v, ok := c.data[req]; ok {
c.mu.Unlock()
return v, nil
}
c.mu.Unlock()
// Not cached; fetch from Azure.
v, err := config.Azure.GetLatestVMExtensionImageVersion(ctx, req.Location, req.ExtType, req.Publisher)
if err != nil {
// Do not cache errors to avoid turning transient failures into
// persistent ones for the lifetime of the test run.
return "", err
}
// Cache the successful result.
c.mu.Lock()
c.data[req] = v
c.mu.Unlock()
return v, nil
}
var vmExtVersionCache = newVMExtensionVersionCache()
// CachedGetLatestVMExtensionImageVersion caches the result of querying the Azure API
// for the latest VM extension image version. Errors are never cached so that
// transient Azure failures can be retried.
var CachedGetLatestVMExtensionImageVersion = func(ctx context.Context, req GetLatestExtensionVersionRequest) (string, error) {
return vmExtVersionCache.Get(ctx, req)
}

Copilot uses AI. Check for mistakes.
vmss.SKU.Name = to.Ptr(vmSize)

extension, err := createVMExtensionLinuxAKSNode(vmss.Location)
extension, err := createVMExtensionLinuxAKSNode(context.TODO(), vmss.Location)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using context.TODO() here bypasses the suite’s shared cancellation/timeout contexts (e.g., testCtx/newTestCtx), so this Azure API call won’t be canceled during graceful shutdown and can hang longer than intended. Prefer using the package-level testCtx (or a derived context with timeout) instead of context.TODO(), and apply consistently to the other createVMExtensionLinuxAKSNode call sites.

Suggested change
extension, err := createVMExtensionLinuxAKSNode(context.TODO(), vmss.Location)
extension, err := createVMExtensionLinuxAKSNode(testCtx, vmss.Location)

Copilot uses AI. Check for mistakes.
},
VMConfigMutator: func(vmss *armcompute.VirtualMachineScaleSet) {
extension, err := createVMExtensionLinuxAKSNode(vmss.Location)
extension, err := createVMExtensionLinuxAKSNode(context.TODO(), vmss.Location)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using context.TODO() here bypasses the suite’s shared cancellation/timeout contexts (e.g., testCtx/newTestCtx), so the Azure API call used to resolve the extension version won’t be canceled during graceful shutdown. Prefer using the package-level testCtx (or a derived context with timeout) instead of context.TODO(), and apply the same change to the other createVMExtensionLinuxAKSNode call sites.

Suggested change
extension, err := createVMExtensionLinuxAKSNode(context.TODO(), vmss.Location)
extension, err := createVMExtensionLinuxAKSNode(testCtx, vmss.Location)

Copilot uses AI. Check for mistakes.
Replace hardcoded VM extension version "1.406" with a cached Azure API
call that fetches the latest version at runtime. Add a caching layer
via CachedGetLatestVMExtensionImageVersion to avoid redundant API calls
across tests. Update all callers to pass context for proper propagation.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
@surajssd surajssd force-pushed the suraj/optimize-extensio-version-lookup branch from 8982d50 to a9e9293 Compare March 10, 2026 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants