Skip to content

Fix deletion ordering for ProviderConfig consumers and ModelReplicas#154

Merged
dennis-upbound merged 2 commits into
mainfrom
usage-and-abusage
Jun 16, 2026
Merged

Fix deletion ordering for ProviderConfig consumers and ModelReplicas#154
dennis-upbound merged 2 commits into
mainfrom
usage-and-abusage

Conversation

@negz

@negz negz commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Description of your changes

Fixes #97.

Two deletion-ordering gaps let Crossplane tear down a resource before its dependents finish cleaning up, wedging them with finalizers that need clearing by hand. Both come down to using Usages correctly across the namespaced/cluster-scoped boundary.

ProviderConfig consumers (#97). compose-serving-stack used two Usages with bare matchControllerRef by selectors to hold its ProviderConfigs alive until the Releases and Objects using them are gone. But a Usage's by resolves to a single resource, so each ProviderConfig was protected against only one consumer while the stack installs several. On teardown a ProviderConfig could be deleted out from under the rest, wedging them with cannot get provider config: ... not found. This adds a generic compose-usages function that runs as a late pipeline step: for every Release or Object referencing a ProviderConfig it stamps a unique label and composes one Usage per consumer, so each one holds the ProviderConfig until it's gone. It carries no serving-stack knowledge — any pipeline gets correct protection by running it last. compose-serving-stack keeps its separate Envoy Gateway ordering Usages.

ModelReplicas vs their InferenceCluster. Deleting an InferenceCluster out from under running ModelReplicas stranded them. A Usage can't express this directly: the cluster is cluster-scoped and replicas are namespaced, so a ClusterUsage's by can't reach a replica and a namespaced Usage can't name the cluster as its of. Instead, compose-inference-cluster protects the cluster with a reason-only ClusterUsage (no by), which blocks deletion until the Usage itself is gone. It requires the ModelReplicas labelled for this cluster across all namespaces and composes the ClusterUsage while any exist; when the last goes, it stops composing it and the cluster becomes deletable. The guard runs ahead of compose()'s early returns so a transient (e.g. classes briefly unresolved) doesn't drop it.

Validation

End to end on an EKS-backed InferenceCluster (Qwen2.5-0.5B on a g6.xlarge L4):

  • ServingStack Usages protect ProviderConfigs against only one Release/Object, not all #97: one usage-pc-* Usage composed per Release/Object referencing each ProviderConfig; on teardown the ProviderConfigs outlived their consumers, then released.
  • ModelReplica guard: deploying the model composed the ClusterUsage; kubectl delete inferencecluster was rejected by the nousages.protection.crossplane.io webhook; deleting the ModelDeployment dropped the guard and the cluster's pending delete replayed to completion.

Worth flagging: replayDeletion replays with the original attempt's propagation policy, which the webhook defaults to Background. A bare kubectl delete inferencecluster thus replays as a background delete — the teardown the docs warn against, which in my run leaked an ELB security group that blocked the VPC. --cascade=foreground replays cleanly. Not introduced here, but the guard makes foreground the path to take.

I have:

  • Read and followed Modelplane's contribution process.
  • Run nix flake check (or ./nix.sh flake check) and made sure it passes.
  • Added or updated tests covering any composition function changes.
  • Signed off every commit with git commit -s.

negz added 2 commits June 15, 2026 21:43
compose-serving-stack composed two Usages to keep its ProviderConfigs
alive until the Helm releases and provider-kubernetes Objects that use
them are deleted. Each Usage's `by` was a bare matchControllerRef
selector meant to mean "all releases" or "all objects".

A Usage's `by` selector resolves to a single resource: the Usage
controller lists candidates, takes the first that matches the
controller ref, and stops. So usage-helm-pc protected the helm
ProviderConfig against exactly one release and usage-k8s-pc the
kubernetes ProviderConfig against one object, whichever resolved first.
The serving stack installs several of each, so most were unprotected.
On teardown the ProviderConfigs could be deleted while releases and
objects still needed them, wedging those resources with "cannot get
provider config: ... not found".

This adds a compose-usages function that runs as a late pipeline step.
It reads the desired resources composed upstream and, for every Release
or Object that references a ProviderConfig, stamps a unique label on it
and composes one Usage of that ProviderConfig by that single labelled
resource. One Usage per consumer means every release and object holds
the ProviderConfig until it's gone. The function is generic: any
pipeline whose resources reference a ProviderConfig gets correct
protection by running it last.

The two bare-selector Usages are removed from compose-serving-stack;
its remaining Usages order the Envoy Gateway teardown, a separate
concern, and stay.

Fixes #97.

Signed-off-by: Nic Cope <nicc@rk0n.org>
Deleting an InferenceCluster out from under running ModelReplicas
strands them. A replica's workloads are provider-kubernetes Objects on
the cluster; they need the cluster's ClusterProviderConfig, and the
cluster itself, to finalize. Delete the InferenceCluster and those
Objects wedge, needing finalizer surgery to remove.

A Usage would normally express this, but it can't here. The
InferenceCluster is cluster scoped and ModelReplicas are namespaced. A
ClusterUsage's `by` can't reference a namespaced replica, and a
namespaced Usage can't reference the cluster-scoped cluster as its `of`.
Neither scope can name the other.

Instead this protects the cluster with a ClusterUsage that has no `by`
at all. A Usage with only a reason blocks deletion of its `of` until the
Usage itself is gone. compose-inference-cluster requires the
ModelReplicas labelled for this cluster, across all namespaces, and
composes the ClusterUsage while any exist. When the last replica is
deleted the function stops composing it and the cluster becomes
deletable; replayDeletion completes a delete that was attempted while
the guard was up.

The guard runs before the rest of compose(), ahead of its early
returns. Gating it behind cluster source or class resolution would drop
the ClusterUsage on a reconcile where classes are transiently
unresolved, deleting it and letting the cluster be deleted while
replicas still use it.

Signed-off-by: Nic Cope <nicc@rk0n.org>
@negz negz force-pushed the usage-and-abusage branch from 46d89f5 to 3038b58 Compare June 16, 2026 04:49
@negz negz marked this pull request as ready for review June 16, 2026 04:49
Copilot AI review requested due to automatic review settings June 16, 2026 04:49

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes two Crossplane deletion-ordering gaps that could leave composed resources stuck with finalizers by ensuring dependencies remain protected across namespace / cluster-scope boundaries.

Changes:

  • Added a new late-stage pipeline function (compose-usages) that labels each Helm Release / provider-kubernetes Object that references a ProviderConfig and composes one Usage per consumer, preventing premature ProviderConfig deletion.
  • Updated compose-serving-stack to remove its previous ProviderConfig-protection Usages (which only protected against one consumer) and rely on the new generic compose-usages step, while keeping the Envoy Gateway teardown ordering Usages.
  • Updated compose-inference-cluster to compose a reason-only ClusterUsage guard that blocks InferenceCluster deletion while any ModelReplica is scheduled to it, plus added/updated tests.

Reviewed changes

Copilot reviewed 11 out of 14 changed files in this pull request and generated no comments.

Show a summary per file
File Description
uv.lock Adds compose-usages as a workspace member/package with its dependencies.
flake.nix Includes compose-usages in the Nix functions list.
crossplane-project.yaml Packages compose-usages as a function tarball for the Crossplane XPKG.
apis/servingstacks/composition.yaml Adds compose-usages as a late pipeline step after compose-serving-stack.
functions/compose-usages/pyproject.toml Defines the new function package and entrypoint.
functions/compose-usages/function/main.py Adds the standard function CLI/server bootstrap.
functions/compose-usages/function/fn.py Implements per-consumer labeling and per-consumer Usage composition for ProviderConfig references.
functions/compose-usages/tests/test_fn.py Adds unit tests validating labeling + per-consumer Usage emission and no-op cases.
functions/compose-serving-stack/function/fn.py Removes the incorrect “one Usage protects all consumers” ProviderConfig logic; keeps Envoy teardown ordering usages.
functions/compose-serving-stack/tests/test_fn.py Updates expectations to remove the deleted ProviderConfig Usages.
functions/compose-inference-cluster/function/fn.py Adds the ModelReplica-based deletion guard via reason-only ClusterUsage and required-resources selector.
functions/compose-inference-cluster/tests/test_fn.py Adds test coverage for the new guard behavior and ensures the requirement is always emitted.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@dennis-upbound dennis-upbound left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@dennis-upbound dennis-upbound merged commit 95d06a3 into main Jun 16, 2026
7 checks passed
@negz negz deleted the usage-and-abusage branch June 16, 2026 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ServingStack Usages protect ProviderConfigs against only one Release/Object, not all

3 participants