Fix the GAIE inferencepool install in the serving stack by negz · Pull Request #159 · modelplaneai/modelplane

negz · 2026-06-16T00:42:36Z

Fixes #157.
Fixes #158.

The serving stack's Gateway API Inference Extension (GAIE) install, added in #142, can't complete, so a freshly provisioned InferenceCluster never reaches Ready and no model can be scheduled onto it. Three problems, found while bringing up an EKS-backed cluster end to end.

First, the inferencepool chart was pulled from oci://ghcr.io/kubernetes-sigs/gateway-api-inference-extension/charts, which denies anonymous pulls — its token endpoint returns 403. provider-helm can't fetch the chart, so the Release never installs. GAIE publishes the chart for public consumption on registry.k8s.io, which serves it anonymously, so this points the repo there.

Second, with the registry corrected, the install still fails: the inferencepool chart ships no CRDs. It renders a running InferencePool instance plus its endpoint picker and requires inferencePool.modelServers.matchLabels, which the serving stack doesn't set, so the render errors out. The CRDs the serving stack actually wants live in the upstream release's manifests.yaml. This vendors them into the function and applies them as provider-kubernetes Objects on the remote cluster, the same way compose-inference-gateway installs the Gateway API CRDs that the Traefik chart likewise doesn't ship. The upstream manifests are a live-cluster export carrying a top-level status and metadata.creationTimestamp; those are stripped from the vendored copy (with a header note for re-vendoring) so the Object doesn't re-apply every reconcile and trip the composite's watch circuit breaker.

Third, the ai-gateway-crds and ai-gateway releases were composed but never marked ready — they weren't in mark_readiness's list. So even with every release and object healthy on the cluster, the ServingStack's readiness aggregation waited on them forever and the InferenceCluster never reached Ready. This adds them to the list.

The first two unmask each other in sequence (the registry fix reveals the chart-content problem); the third was hidden behind both. Each maps to its own commit.

Validated end to end on an EKS-backed InferenceCluster: with all three fixes the GAIE CRDs install and settle, the serving stack converges, and the InferenceCluster reaches Ready=True.

I have:

Read and followed Modelplane's contribution process.
Run nix flake check (or ./nix.sh flake check) and made sure it passes.
Added or updated tests covering any composition function changes.
Signed off every commit with git commit -s.

The serving stack installs the Gateway API Inference Extension inferencepool Helm chart from oci://ghcr.io/kubernetes-sigs/gateway-api-inference-extension/charts. That ghcr path denies anonymous pulls: its token endpoint returns 403, so provider-helm can't fetch the chart and the Release never installs. Because the Release is always composed, this pins the ServingStack's BackendReady and the InferenceCluster's Ready to False, and the scheduler won't place a model on a cluster that isn't Ready. GAIE publishes the chart for public consumption on registry.k8s.io, which serves it anonymously. This points the repo there. The chart name and version are unchanged. Fixes #157. Signed-off-by: Nic Cope <nicc@rk0n.org>

dennis-upbound

thanks for doing this. hope this works

Copilot

Pull request overview

This PR fixes the ServingStack’s Gateway API Inference Extension (GAIE) installation so newly provisioned InferenceClusters can converge to Ready and schedule models by installing GAIE CRDs directly (instead of via the inferencepool Helm chart) and ensuring AI Gateway releases are included in readiness aggregation.

Changes:

Vendor GAIE CRDs from the upstream release manifest and compose them as provider-kubernetes Objects on the workload cluster.
Update serving-stack readiness aggregation to include ai-gateway-crds and ai-gateway.
Adjust compose-serving-stack unit tests to validate the new GAIE CRD composition shape and readiness behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
functions/compose-serving-stack/function/fn.py	Loads vendored GAIE CRDs, composes them as provider-kubernetes Objects, and marks AI Gateway releases ready in readiness aggregation.
functions/compose-serving-stack/function/gaie_crds.yaml	Adds vendored GAIE CRDs (from upstream `manifests.yaml`) for installation onto workload clusters.
functions/compose-serving-stack/tests/test_fn.py	Updates tests to expect GAIE CRDs as individual composed Objects and to account for AI Gateway readiness propagation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

compose-serving-stack installed the Gateway API Inference Extension inferencepool Helm chart to get the InferencePool CRD onto the workload cluster. That chart ships no CRDs: it renders a running InferencePool instance plus its endpoint picker, and requires inferencePool.modelServers.matchLabels, which the serving stack doesn't set. So the Release failed to render, pinning the ServingStack's BackendReady and the InferenceCluster's Ready to False, which blocks model scheduling. The CRDs are published in the upstream release's manifests.yaml, not the chart. This vendors them into the function and applies them as provider-kubernetes Objects on the remote cluster, marking each ready once its Object is Ready. It follows the pattern compose-inference-gateway already uses to install the Gateway API CRDs that the Traefik chart likewise doesn't ship. Fixes #158. Signed-off-by: Nic Cope <nicc@rk0n.org>

compose-serving-stack composed the ai-gateway-crds and ai-gateway Helm releases but never marked them ready. mark_readiness only marks the resources in its condition_ready list, and these two weren't in it, so the ServingStack's readiness aggregation waited on them forever: the XR stayed Ready=False even with every release and object healthy on the cluster, and the InferenceCluster never reached Ready. This adds both to condition_ready, so they're marked ready once their Releases report Ready, like the rest of the serving stack. Signed-off-by: Nic Cope <nicc@rk0n.org>

dennis-upbound approved these changes Jun 16, 2026

View reviewed changes

negz commented Jun 16, 2026

View reviewed changes

Comment thread functions/compose-serving-stack/function/fn.py Outdated

negz force-pushed the fix-gaie-serving-stack branch 2 times, most recently from 88c6219 to fb7c50f Compare June 16, 2026 04:18

negz marked this pull request as ready for review June 16, 2026 04:21

Copilot AI review requested due to automatic review settings June 16, 2026 04:21

Copilot started reviewing on behalf of negz June 16, 2026 04:21 View session

Copilot AI reviewed Jun 16, 2026

View reviewed changes

Comment thread functions/compose-serving-stack/function/gaie_crds.yaml Outdated

Comment thread functions/compose-serving-stack/function/fn.py

negz mentioned this pull request Jun 16, 2026

Fix deletion ordering for ProviderConfig consumers and ModelReplicas #154

Merged

4 tasks

negz added 2 commits June 15, 2026 21:37

negz force-pushed the fix-gaie-serving-stack branch from fb7c50f to 05e8b4c Compare June 16, 2026 04:37

negz merged commit 74a2cee into main Jun 16, 2026
4 checks passed

negz deleted the fix-gaie-serving-stack branch June 16, 2026 16:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix the GAIE inferencepool install in the serving stack#159

Fix the GAIE inferencepool install in the serving stack#159
negz merged 3 commits into
mainfrom
fix-gaie-serving-stack

negz commented Jun 16, 2026 •

edited

Loading

Uh oh!

dennis-upbound left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

negz commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dennis-upbound left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

negz commented Jun 16, 2026 •

edited

Loading