Fix deletion ordering for ProviderConfig consumers and ModelReplicas#154
Merged
Conversation
compose-serving-stack composed two Usages to keep its ProviderConfigs alive until the Helm releases and provider-kubernetes Objects that use them are deleted. Each Usage's `by` was a bare matchControllerRef selector meant to mean "all releases" or "all objects". A Usage's `by` selector resolves to a single resource: the Usage controller lists candidates, takes the first that matches the controller ref, and stops. So usage-helm-pc protected the helm ProviderConfig against exactly one release and usage-k8s-pc the kubernetes ProviderConfig against one object, whichever resolved first. The serving stack installs several of each, so most were unprotected. On teardown the ProviderConfigs could be deleted while releases and objects still needed them, wedging those resources with "cannot get provider config: ... not found". This adds a compose-usages function that runs as a late pipeline step. It reads the desired resources composed upstream and, for every Release or Object that references a ProviderConfig, stamps a unique label on it and composes one Usage of that ProviderConfig by that single labelled resource. One Usage per consumer means every release and object holds the ProviderConfig until it's gone. The function is generic: any pipeline whose resources reference a ProviderConfig gets correct protection by running it last. The two bare-selector Usages are removed from compose-serving-stack; its remaining Usages order the Envoy Gateway teardown, a separate concern, and stay. Fixes #97. Signed-off-by: Nic Cope <nicc@rk0n.org>
Deleting an InferenceCluster out from under running ModelReplicas strands them. A replica's workloads are provider-kubernetes Objects on the cluster; they need the cluster's ClusterProviderConfig, and the cluster itself, to finalize. Delete the InferenceCluster and those Objects wedge, needing finalizer surgery to remove. A Usage would normally express this, but it can't here. The InferenceCluster is cluster scoped and ModelReplicas are namespaced. A ClusterUsage's `by` can't reference a namespaced replica, and a namespaced Usage can't reference the cluster-scoped cluster as its `of`. Neither scope can name the other. Instead this protects the cluster with a ClusterUsage that has no `by` at all. A Usage with only a reason blocks deletion of its `of` until the Usage itself is gone. compose-inference-cluster requires the ModelReplicas labelled for this cluster, across all namespaces, and composes the ClusterUsage while any exist. When the last replica is deleted the function stops composing it and the cluster becomes deletable; replayDeletion completes a delete that was attempted while the guard was up. The guard runs before the rest of compose(), ahead of its early returns. Gating it behind cluster source or class resolution would drop the ClusterUsage on a reconcile where classes are transiently unresolved, deleting it and letting the cluster be deleted while replicas still use it. Signed-off-by: Nic Cope <nicc@rk0n.org>
There was a problem hiding this comment.
Pull request overview
This PR fixes two Crossplane deletion-ordering gaps that could leave composed resources stuck with finalizers by ensuring dependencies remain protected across namespace / cluster-scope boundaries.
Changes:
- Added a new late-stage pipeline function (
compose-usages) that labels each HelmRelease/ provider-kubernetesObjectthat references aProviderConfigand composes oneUsageper consumer, preventing prematureProviderConfigdeletion. - Updated
compose-serving-stackto remove its previous ProviderConfig-protectionUsages (which only protected against one consumer) and rely on the new genericcompose-usagesstep, while keeping the Envoy Gateway teardown orderingUsages. - Updated
compose-inference-clusterto compose a reason-onlyClusterUsageguard that blocksInferenceClusterdeletion while anyModelReplicais scheduled to it, plus added/updated tests.
Reviewed changes
Copilot reviewed 11 out of 14 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Adds compose-usages as a workspace member/package with its dependencies. |
| flake.nix | Includes compose-usages in the Nix functions list. |
| crossplane-project.yaml | Packages compose-usages as a function tarball for the Crossplane XPKG. |
| apis/servingstacks/composition.yaml | Adds compose-usages as a late pipeline step after compose-serving-stack. |
| functions/compose-usages/pyproject.toml | Defines the new function package and entrypoint. |
| functions/compose-usages/function/main.py | Adds the standard function CLI/server bootstrap. |
| functions/compose-usages/function/fn.py | Implements per-consumer labeling and per-consumer Usage composition for ProviderConfig references. |
| functions/compose-usages/tests/test_fn.py | Adds unit tests validating labeling + per-consumer Usage emission and no-op cases. |
| functions/compose-serving-stack/function/fn.py | Removes the incorrect “one Usage protects all consumers” ProviderConfig logic; keeps Envoy teardown ordering usages. |
| functions/compose-serving-stack/tests/test_fn.py | Updates expectations to remove the deleted ProviderConfig Usages. |
| functions/compose-inference-cluster/function/fn.py | Adds the ModelReplica-based deletion guard via reason-only ClusterUsage and required-resources selector. |
| functions/compose-inference-cluster/tests/test_fn.py | Adds test coverage for the new guard behavior and ensures the requirement is always emitted. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of your changes
Fixes #97.
Two deletion-ordering gaps let Crossplane tear down a resource before its dependents finish cleaning up, wedging them with finalizers that need clearing by hand. Both come down to using Usages correctly across the namespaced/cluster-scoped boundary.
ProviderConfig consumers (#97). compose-serving-stack used two Usages with bare
matchControllerRefbyselectors to hold its ProviderConfigs alive until the Releases and Objects using them are gone. But a Usage'sbyresolves to a single resource, so each ProviderConfig was protected against only one consumer while the stack installs several. On teardown a ProviderConfig could be deleted out from under the rest, wedging them withcannot get provider config: ... not found. This adds a genericcompose-usagesfunction that runs as a late pipeline step: for every Release or Object referencing a ProviderConfig it stamps a unique label and composes one Usage per consumer, so each one holds the ProviderConfig until it's gone. It carries no serving-stack knowledge — any pipeline gets correct protection by running it last. compose-serving-stack keeps its separate Envoy Gateway ordering Usages.ModelReplicas vs their InferenceCluster. Deleting an InferenceCluster out from under running ModelReplicas stranded them. A Usage can't express this directly: the cluster is cluster-scoped and replicas are namespaced, so a ClusterUsage's
bycan't reach a replica and a namespaced Usage can't name the cluster as itsof. Instead, compose-inference-cluster protects the cluster with a reason-only ClusterUsage (noby), which blocks deletion until the Usage itself is gone. It requires the ModelReplicas labelled for this cluster across all namespaces and composes the ClusterUsage while any exist; when the last goes, it stops composing it and the cluster becomes deletable. The guard runs ahead of compose()'s early returns so a transient (e.g. classes briefly unresolved) doesn't drop it.Validation
End to end on an EKS-backed InferenceCluster (Qwen2.5-0.5B on a g6.xlarge L4):
usage-pc-*Usage composed per Release/Object referencing each ProviderConfig; on teardown the ProviderConfigs outlived their consumers, then released.kubectl delete inferenceclusterwas rejected by thenousages.protection.crossplane.iowebhook; deleting the ModelDeployment dropped the guard and the cluster's pending delete replayed to completion.Worth flagging:
replayDeletionreplays with the original attempt's propagation policy, which the webhook defaults toBackground. A barekubectl delete inferenceclusterthus replays as a background delete — the teardown the docs warn against, which in my run leaked an ELB security group that blocked the VPC.--cascade=foregroundreplays cleanly. Not introduced here, but the guard makes foreground the path to take.I have:
nix flake check(or./nix.sh flake check) and made sure it passes.git commit -s.