Skip to content

Fix the GAIE inferencepool install in the serving stack#159

Merged
negz merged 3 commits into
mainfrom
fix-gaie-serving-stack
Jun 16, 2026
Merged

Fix the GAIE inferencepool install in the serving stack#159
negz merged 3 commits into
mainfrom
fix-gaie-serving-stack

Conversation

@negz

@negz negz commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Fixes #157.
Fixes #158.

The serving stack's Gateway API Inference Extension (GAIE) install, added in #142, can't complete, so a freshly provisioned InferenceCluster never reaches Ready and no model can be scheduled onto it. Three problems, found while bringing up an EKS-backed cluster end to end.

First, the inferencepool chart was pulled from oci://ghcr.io/kubernetes-sigs/gateway-api-inference-extension/charts, which denies anonymous pulls — its token endpoint returns 403. provider-helm can't fetch the chart, so the Release never installs. GAIE publishes the chart for public consumption on registry.k8s.io, which serves it anonymously, so this points the repo there.

Second, with the registry corrected, the install still fails: the inferencepool chart ships no CRDs. It renders a running InferencePool instance plus its endpoint picker and requires inferencePool.modelServers.matchLabels, which the serving stack doesn't set, so the render errors out. The CRDs the serving stack actually wants live in the upstream release's manifests.yaml. This vendors them into the function and applies them as provider-kubernetes Objects on the remote cluster, the same way compose-inference-gateway installs the Gateway API CRDs that the Traefik chart likewise doesn't ship. The upstream manifests are a live-cluster export carrying a top-level status and metadata.creationTimestamp; those are stripped from the vendored copy (with a header note for re-vendoring) so the Object doesn't re-apply every reconcile and trip the composite's watch circuit breaker.

Third, the ai-gateway-crds and ai-gateway releases were composed but never marked ready — they weren't in mark_readiness's list. So even with every release and object healthy on the cluster, the ServingStack's readiness aggregation waited on them forever and the InferenceCluster never reached Ready. This adds them to the list.

The first two unmask each other in sequence (the registry fix reveals the chart-content problem); the third was hidden behind both. Each maps to its own commit.

Validated end to end on an EKS-backed InferenceCluster: with all three fixes the GAIE CRDs install and settle, the serving stack converges, and the InferenceCluster reaches Ready=True.

I have:

  • Read and followed Modelplane's contribution process.
  • Run nix flake check (or ./nix.sh flake check) and made sure it passes.
  • Added or updated tests covering any composition function changes.
  • Signed off every commit with git commit -s.

The serving stack installs the Gateway API Inference Extension inferencepool
Helm chart from oci://ghcr.io/kubernetes-sigs/gateway-api-inference-extension/charts.
That ghcr path denies anonymous pulls: its token endpoint returns 403, so
provider-helm can't fetch the chart and the Release never installs. Because
the Release is always composed, this pins the ServingStack's BackendReady and
the InferenceCluster's Ready to False, and the scheduler won't place a model
on a cluster that isn't Ready.

GAIE publishes the chart for public consumption on registry.k8s.io, which
serves it anonymously. This points the repo there. The chart name and version
are unchanged.

Fixes #157.

Signed-off-by: Nic Cope <nicc@rk0n.org>

@dennis-upbound dennis-upbound left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for doing this. hope this works

Comment thread functions/compose-serving-stack/function/fn.py Outdated
@negz negz force-pushed the fix-gaie-serving-stack branch 2 times, most recently from 88c6219 to fb7c50f Compare June 16, 2026 04:18
@negz negz marked this pull request as ready for review June 16, 2026 04:21
Copilot AI review requested due to automatic review settings June 16, 2026 04:21

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes the ServingStack’s Gateway API Inference Extension (GAIE) installation so newly provisioned InferenceClusters can converge to Ready and schedule models by installing GAIE CRDs directly (instead of via the inferencepool Helm chart) and ensuring AI Gateway releases are included in readiness aggregation.

Changes:

  • Vendor GAIE CRDs from the upstream release manifest and compose them as provider-kubernetes Objects on the workload cluster.
  • Update serving-stack readiness aggregation to include ai-gateway-crds and ai-gateway.
  • Adjust compose-serving-stack unit tests to validate the new GAIE CRD composition shape and readiness behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
functions/compose-serving-stack/function/fn.py Loads vendored GAIE CRDs, composes them as provider-kubernetes Objects, and marks AI Gateway releases ready in readiness aggregation.
functions/compose-serving-stack/function/gaie_crds.yaml Adds vendored GAIE CRDs (from upstream manifests.yaml) for installation onto workload clusters.
functions/compose-serving-stack/tests/test_fn.py Updates tests to expect GAIE CRDs as individual composed Objects and to account for AI Gateway readiness propagation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread functions/compose-serving-stack/function/gaie_crds.yaml Outdated
Comment thread functions/compose-serving-stack/function/fn.py
negz added 2 commits June 15, 2026 21:37
compose-serving-stack installed the Gateway API Inference Extension
inferencepool Helm chart to get the InferencePool CRD onto the workload
cluster. That chart ships no CRDs: it renders a running InferencePool
instance plus its endpoint picker, and requires
inferencePool.modelServers.matchLabels, which the serving stack doesn't
set. So the Release failed to render, pinning the ServingStack's
BackendReady and the InferenceCluster's Ready to False, which blocks
model scheduling.

The CRDs are published in the upstream release's manifests.yaml, not the
chart. This vendors them into the function and applies them as
provider-kubernetes Objects on the remote cluster, marking each ready
once its Object is Ready. It follows the pattern compose-inference-gateway
already uses to install the Gateway API CRDs that the Traefik chart
likewise doesn't ship.

Fixes #158.

Signed-off-by: Nic Cope <nicc@rk0n.org>
compose-serving-stack composed the ai-gateway-crds and ai-gateway Helm
releases but never marked them ready. mark_readiness only marks the
resources in its condition_ready list, and these two weren't in it, so
the ServingStack's readiness aggregation waited on them forever: the XR
stayed Ready=False even with every release and object healthy on the
cluster, and the InferenceCluster never reached Ready.

This adds both to condition_ready, so they're marked ready once their
Releases report Ready, like the rest of the serving stack.

Signed-off-by: Nic Cope <nicc@rk0n.org>
@negz negz force-pushed the fix-gaie-serving-stack branch from fb7c50f to 05e8b4c Compare June 16, 2026 04:37
@negz negz merged commit 74a2cee into main Jun 16, 2026
4 checks passed
@negz negz deleted the fix-gaie-serving-stack branch June 16, 2026 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants