Conversation
c364b46 to
9417d10
Compare
ModelDeployment's nodeSelector was a single CEL expression matched against a pool's merged attributes and capacity, in a flat dialect that resembled DRA CEL but didn't behave like it (attributes[...]/capacity[...] indexing, version() rather than semver()). It also couldn't describe a node with more than one kind of device: a GPU and a NIC flattened onto one synthetic device, so a ModelDeployment couldn't filter on both, and the ResourceClaim translation wasn't mechanical for multi-device pools. This change reworks the design so InferenceClass and nodeSelector describe hardware as a list of DRA-style devices. InferenceClass gains a devices[] list, each with a driver, count, typed attributes, capacity, and a claim discriminator (DRA or Synthetic) that says whether the device is claimed via a ResourceClaim or described for scheduling only. This replaces resources.gpu and the modelplane.ai/* attribute-prefix convention. nodeSelector becomes devices[], each request carrying a name, count, and a selectors[] list whose entries are real DRA CEL (device.attributes["domain"].name, quantity(), semver()). Scheduling no longer derives a physical shape from topology. A device request's count subsumes the old GPUs-per-node check, so the scheduler evaluates the CEL requests and gates only on available nodes. Topology becomes purely a provisioning concern. ResourceClaim translation is now one DeviceRequest per claim: DRA request, with synthetic devices dropped. For #103. Signed-off-by: Nic Cope <nicc@rk0n.org>
The repo pinned the Crossplane CLI to negz/cli:diy, a fork branch carrying an unreleased datamodel-code-generator bump. That bump (crossplane/cli#24 and #64) has since merged to crossplane/cli main, so this repins the CLI to crossplane/cli directly and regenerates the Python models. The regen reflows the affected models with the newer generator (mostly Optional[X] -> X | None). The newer generator (datamodel-code-generator 0.59.0) emits object-typed field defaults as a default_factory rather than a plain value. The Crossplane SDK's resource.update serializes composed resources with model_dump(exclude_defaults=True), which no longer recognizes the factory-built default as equal to the declared default, so unset fields leak into composed resources. This keeps crossplane-function-sdk-python pinned to field?" rather than "is it different from its default?" - which is the correct question under server-side apply and immune to how a default is represented. Switching the whole repo to exclude_unset surfaces a few places that explicitly set fields to None or to a defaulted value, which exclude_defaults previously dropped. compose-serving-stack built provider-kubernetes Objects and Helm Releases with metadata=None and ObjectMeta(namespace=None); those now only set the field when it's present. The compose-inference-cluster and compose-model-deployment test fixtures are updated to reflect that explicitly set values (a node pool's kubernetesVersion and diskSizeGb, a replica's worker count and pipeline) now appear in composed resources. Signed-off-by: Nic Cope <nicc@rk0n.org>
ModelDeployment matched hardware with a single nodeSelector.cel expression over a pool's flat, merged attributes and capacity. It resembled DRA CEL but didn't behave like it, and it couldn't describe a node with more than one kind of device: a GPU and a NIC flattened onto one synthetic device, so a deployment couldn't filter on both and the translation to a DRA ResourceClaim wasn't mechanical. Serving pods bound GPUs through the legacy nvidia.com/gpu device plugin, not DRA. This reworks hardware as a list of DRA-style devices, end to end: InferenceClass declares spec.devices[]: each device has a driver, a count, a deviceClassName, typed attributes and capacity, and a claim discriminator (DRA or Synthetic). InferenceCluster copies a pool's devices verbatim into status.capacity.gpuPools[].devices for the scheduler to match against. ModelDeployment.nodeSelector becomes devices[]: each request carries a name, a count, and a list of CEL selectors that are real DRA CEL evaluated against one device (device.driver, device.attributes["<driver>"].<name>, device.capacity["<driver>"].<name>, with quantity() and semver()). The scheduler matches each request against a pool's devices, consuming device count across requests so it never places a replica onto a node DRA can't satisfy, and gates only on available nodes. It resolves the matched pool's claim: DRA devices into ModelReplica.spec.deviceRequests. compose-model-replica forms a DRA ResourceClaimTemplate (resource.k8s.io/v1) from those requests and wires every serving pod - the native Deployment pod and both the llm-d LeaderWorkerSet leader and worker - to claim through it, in place of the nvidia.com/gpu limit. A replica with no device requests (no nodeSelector) falls back to the device-plugin limit, so existing deployments are unaffected. The CEL extensions (quantity(), semver(), compareTo/isGreaterThan/isLessThan) are reimplemented for the pure-Python celpy evaluator; quantity arithmetic uses Decimal for exact ordering, and any evaluation error is treated as a non-match so arbitrary user CEL can't crash a reconcile. For #103. Signed-off-by: Nic Cope <nicc@rk0n.org>
A ModelDeployment couldn't place more than one ModelReplica on the same
InferenceCluster. The scheduler keyed every replica, endpoint, and
retention decision on the cluster name alone, so a second replica on a
cluster would collide with the first. A deployment wanting three replicas
across two clusters could only ever fill two, even when a cluster had
ample spare nodes.
This change makes a replica's identity the pair (cluster, index), where
index is a per-cluster-local integer that distinguishes co-located
replicas. The index is a collision breaker, not an ordering: replicas are
fungible, and a replica never moves cluster. Desired-resource keys become
replica-{cluster}-{index} and endpoint-{cluster}-{index}; names are
hashed from (deployment, cluster, index) so co-located replicas don't
collide. A new modelplane.ai/replica-index label carries the index, which
the scheduler reads back to reconstruct identity from observed state.
The scheduler is restructured as two phases over the observed state.
Retain keeps each existing replica on its (cluster, index) when the
cluster still exists and its pinned pool still matches the nodeSelector,
and never moves a healthy replica to rebalance. Fill places any shortfall
one replica at a time onto the eligible cluster hosting the fewest of this
deployment's replicas, so replicas spread across clusters and pack onto
fewer only when capacity forces it. Scale-down drops the highest-index
replica on the most-loaded cluster first, consolidating without emptying a
cluster that still holds a sole replica.
Placement runs against a node-capacity ledger built from each pool's
published nodes minus the replicas committed to it: other deployments'
replicas and this deployment's retained replicas, each charged at its own
observed node cost. Replicas dropped by retain or scale-down are not
charged, since their nodes are freeing up for the replicas that replace
them. The fill phase decrements the ledger as it places each replica, so a
single pass can't overcommit a cluster.
Cross-deployment device-count contention remains the workload cluster's
DRA admission to resolve; the control-plane scheduler stays coarse and
gates only on nodes.
Signed-off-by: Nic Cope <nicc@rk0n.org>
DRA names a device attribute's typed value bool and int, but the InferenceClass and InferenceCluster schemas used boolean and integer instead. This worked around crossplane/cli#63, where the project build generated broken Python schemas for OpenAPI fields named int or bool. That issue is now fixed, and the pinned Crossplane CLI commit carries the fix, so the workaround is no longer needed. This change renames the attribute one-of fields back to DRA's bool and int across both XRDs, the exactly-one validation rule, and the CEL activation that reads them, and regenerates the models. The generated models can't use bool and int as Python attribute names, so the code generator emits bool_ and int_ aliased to the bool and int wire names. model_dump defaults to the Python attribute names, which would hide the values from the CEL selector (it reads the wire names). The two device model_dump calls that feed published capacity and selector matching now pass by_alias=True so the wire shape keeps DRA's bool and int. Signed-off-by: Nic Cope <nicc@rk0n.org>
The cluster's published capacity lived under status.capacity, whose only member was gpuPools. The wrapper added a redundant level: callers wrote and read status.capacity.gpuPools where status.gpuPools would do. It also collided with the DRA device capacity field added by the device redesign. A device's typed capacity quantities live under device.capacity, which the model generator names Capacity. Two schemas can't share a class name, so the status wrapper was generated as CapacityModel, an artifact leaking into every caller that constructed it. This change removes the wrapper and moves gpuPools to the top of the InferenceCluster status. The generated status type now references GpuPool directly, and Capacity is free for the DRA device field it belongs to. The wrapper isn't worth keeping for future status fields: the obvious candidate, observed capacity, would replace the declared pools as a higher-fidelity signal rather than sit beside them, so it needs no grouping object. Signed-off-by: Nic Cope <nicc@rk0n.org>
The functions pinned function-sdk-python to a git branch carrying two unreleased serialization fixes: exclude_unset instead of exclude_defaults, and by_alias so keyword-named fields like bool and int serialize under their wire names. The git pin was built from an sdist, so nix/checks.nix and nix/functions.nix injected hatchling as a build-system input that the wheel overlay doesn't provide. Both fixes shipped in v0.13.0 on PyPI. This drops the git source pin, bumps the dependency to >=0.13.0 across the workspace, removes the hatchling build-system overlays, and relocks. The SDK now resolves as a PyPI wheel like every other dependency. Signed-off-by: Nic Cope <nicc@rk0n.org>
When an EKS-backed InferenceCluster became ready, compose-inference-cluster called self.compose_kserve_backend, a method that no longer exists after the KServe backend was replaced with the ServingStack. The function raised AttributeError and the InferenceCluster never composed its backend, so it never reached Ready. The EKS Usage that blocks cluster deletion until the backend is gone also still referenced the old KServeBackend kind. The GKE and Existing paths were already updated to compose a ServingStack; the EKS path was missed. This change points the EKS path at compose_serving_stack and the Usage's by reference at ServingStack, matching the other two. Signed-off-by: Nic Cope <nicc@rk0n.org>
A serving workload's provider-kubernetes Object used the DeriveFromObject readiness policy, which mirrors the wrapped resource's Ready condition. A Deployment and a LeaderWorkerSet publish Available, not Ready, so the Object never became ready: the ModelReplica stayed at ModelReady=False even while the model was serving, and the ModelDeployment's replica count never caught up. The device-plugin DaemonSet compose-eks-cluster installed had the same problem - a DaemonSet has no Ready condition either - and only looked ready because it fell through to the provider's SuccessfulCreate default. The workload Objects (the native Deployment and the llm-d LeaderWorkerSet) now derive readiness from a CEL query over their Available condition. The Service, HTTPRoute, and ResourceClaimTemplate Objects have no runtime readiness and are explicitly ready on create. The device plugin goes away entirely. It existed to advertise the legacy nvidia.com/gpu extended resource for replicas that bind GPUs through a device-plugin limit rather than DRA. The design binds GPUs via DRA on every cluster, so that fallback shouldn't exist: engine_resources references the pod's DRA claim when the replica has device requests and claims nothing otherwise, and compose-eks-cluster no longer installs the DaemonSet or its deletion-ordering Usages. Signed-off-by: Nic Cope <nicc@rk0n.org>
GPUs bind to pods only through DRA: each claim: DRA request in a ModelDeployment's nodeSelector becomes a DeviceRequest in the ResourceClaim the serving pods claim GPUs through. A deployment with no nodeSelector produced a replica with no device requests, and so no GPU. nodeSelector was optional, defaulting to "match any pool", which only ever worked because GPUs could also bind via a device plugin. Modelplane won't infer a request. A request's selectors are how the ML team says what the model needs - a 0.5B model and a 70B model want very different GPUs - and an inferred "any GPU" request would schedule a model onto whatever pool has a free device and hope it fits. This makes nodeSelector required on the XRD (at least one device, each with at least one selector) and removes the scheduler's no-nodeSelector path: compile_requests always returns the compiled requests, and every placed replica carries the device requests its pool resolved. The example deployments gain a GPU nodeSelector so they keep working. Signed-off-by: Nic Cope <nicc@rk0n.org>
GPUs bind via Dynamic Resource Allocation, which is alpha and off by default until Kubernetes 1.34, where the core APIs went GA. EKS clusters defaulted to 1.31, so DRA wasn't available and managed EKS gives no way to enable an alpha feature gate on the control plane. This defaults both the EKSCluster and InferenceCluster EKS version to 1.36, the latest EKS supports, where DRA is on by default. The version is still overridable. Signed-off-by: Nic Cope <nicc@rk0n.org>
The serving stack provisions everything a model needs to run on a workload cluster, but nothing published GPUs as DRA devices, so a ResourceClaim had no ResourceSlice to bind against and GPU pods stayed Pending. This adds two Helm releases to the serving stack. Node Feature Discovery labels GPU nodes. The NVIDIA DRA driver runs a kubelet plugin on those nodes that publishes each node's GPUs as ResourceSlices and registers the gpu.nvidia.com DeviceClass that ModelReplica ResourceClaimTemplates request through. GPU allocation is opt-in via gpuResourcesEnabledOverride; the driver's ComputeDomains support (Multi-Node NVLink) is disabled, since Modelplane doesn't use it and enabling it would pull in GPU Feature Discovery. Signed-off-by: Nic Cope <nicc@rk0n.org>
GPU node groups carry an nvidia.com/gpu:NoSchedule taint so non-GPU pods don't land on expensive GPU nodes. Pods used to schedule there because the device plugin made them request the nvidia.com/gpu extended resource, which EKS's ExtendedResourceToleration admission controller turns into a matching toleration. Binding GPUs via DRA instead, the serving pods make no such request, so nothing injects the toleration and they stay Pending off the GPU nodes. This adds the toleration to a serving pod when - and only when - the replica claims a GPU through DRA, alongside wiring up its ResourceClaimTemplate. A pod with no device requests claims no GPU and gets no toleration, so it can't land on a GPU node it has no reason to use. Signed-off-by: Nic Cope <nicc@rk0n.org>
A ModelReplica never became Ready even once its model was serving: ModelReady went True, but the XR stayed "Creating", reporting its Service, HTTPRoute, and ResourceClaimTemplate as unready. So the ModelDeployment never counted the replica and never became Ready either. The function only marked the workload (model-serving) ready. Crossplane gates an XR's Ready on every composed resource being ready, and a composed resource isn't ready just because provider-kubernetes set its own Object's Ready condition - the function has to mark it in the response. This marks every composed resource ready. The workload still gates on the model actually serving; the Service, HTTPRoute, and ResourceClaimTemplate have no runtime readiness to wait on, so they're ready once composed. Signed-off-by: Nic Cope <nicc@rk0n.org>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
A ModelReplica's name was built with the SDK's resource.child_name, joining the deployment name, cluster name, and replica index into the readable prefix - e.g. my-model-cluster-a-0-5ab63. That leaks a replica's placement (which cluster, which index) into its name, when a replica's identity is meant to live in its labels. The name should be an opaque, stable handle, the way a Pod's name doesn't encode its node. child_name can't produce this: it joins every part into both the hashed value and the visible prefix, so it has no way to fold a discriminator into the hash without also showing it. This change adds a name module with an opaque_name helper that hashes the visible name together with its discriminators but keeps only the visible name in the prefix, then derives a replica's name as the deployment name plus a hash of (deployment, cluster, index). Co-located replicas still get distinct, stable names - my-model-5ab63, my-model-609c5 - and the endpoint reuses the replica name so routing still lands on the right backend. The new module also takes over the replica_key and endpoint_key desired-resource handles, so one place owns every identifier a ModelDeployment derives for its replicas. Signed-off-by: Nic Cope <nicc@rk0n.org>
A ModelDeployment's nodeSelector matches DRA-style device requests against an InferenceCluster pool's devices. A pool device is either claim: DRA (a real device bound through a DRA ResourceClaim) or claim: Synthetic (matched for fleet scheduling but never claimed - hardware with no DRA driver). The serving workload binds its GPUs through a ResourceClaim built from the matched DRA devices. The scheduler matched a pool whenever every request matched some device, including a pool where the matched devices were all synthetic. Such a pool yields no DRA requests, so the composed ModelReplica carried an empty deviceRequests and its workload got no ResourceClaim - it would schedule with no GPU binding at all, defeating the point of selecting that hardware. A nodeSelector is required precisely so the workload can form a ResourceClaim, so a selector that resolves only to synthetic devices must not schedule. The claim kind lives on the InferenceClass/InferenceCluster device, not the ModelDeployment, so this can't be rejected when the deployment is admitted. This change enforces it where the information converges: _match_pool now treats a pool that resolves to zero DRA requests as a non-match, the same as a pool that fails a selector. A deployment whose nodeSelector matches only synthetic devices finds no eligible pool and reports InsufficientCapacity. Synthetic devices remain co-selectors that refine placement alongside a claimable device. The ModelDeployment XRD documents the requirement. Because the scheduler now only ever pins a replica to a named pool that yields at least one claimable device, nodePoolName and deviceRequests are made required on the ModelReplica. compose-model-deployment always stamps both, and compose-model-replica trusts them: the backends drop the branches that handled a replica with no device requests, and resource_claim_template always composes a ResourceClaimTemplate. Signed-off-by: Nic Cope <nicc@rk0n.org>
semver.py reimplements blang/semver's parsing and precedence ordering for DRA CEL selectors. Several lines mirror a quirk of that upstream implementation with no local cue: the patch/prerelease/build split peels build before prerelease (because build metadata may itself contain '-'), zip uses strict=False because prerelease lists are expected to differ in length, and the character-class checks reproduce blang's specific error cases rather than a regex. This adds commentary to those spots - why a line does what it does, tied back to the upstream behaviour it mirrors - while leaving the precedence contract in the class docstrings. Signed-off-by: Nic Cope <nicc@rk0n.org>
These needed no forward reference; the floor is Python 3.11. Signed-off-by: Nic Cope <nicc@rk0n.org>
test_fn.py built req1..req7/want1..want7 imperatively in one ~780-line method (suppressing a too-many-statements lint) and only assembled them into a case list at the end, with a separate co-located-replicas test asserting dynamically. It read nothing like the table-driven tests elsewhere in the package. This inlines each case's request and response into the cases table, folds the co-located-replicas scenario in as an eighth full-response case, and drops the lint suppression. Input fixtures (the deployment, the clusters, the observed replica) are now built as schema-validated pydantic models and dumped, like the rest of the package; expected responses stay as hand-written dict literals so each assertion is an independent oracle rather than a round-trip through the code under test. Signed-off-by: Nic Cope <nicc@rk0n.org>
| if backend_secrets or backend_exists: | ||
| if backend_secrets: | ||
| self.compose_kserve_backend(backend_secrets) | ||
| self.compose_serving_stack(backend_secrets) |
There was a problem hiding this comment.
Not strictly related to this change, but it was broken in main. I'd like to get type checking working as part of nix flake check - I think it'd have caught this since self.compose_kserve_backend was no longer defined.
| A request matches a pool device when the device has enough UNCONSUMED count | ||
| to cover the request and every selector evaluates true against that device. | ||
| Each resolved DRA request becomes a distinct DeviceRequest in one | ||
| ResourceClaim, and DRA allocates distinct devices per request, so a device's | ||
| count is consumed as requests claim it: two requests cannot both be satisfied | ||
| by the same single-count device, and N requests against one device must fit | ||
| within that device's count. Without this accounting the scheduler would place | ||
| a replica onto a node DRA can't actually satisfy. |
There was a problem hiding this comment.
Our scheduler can race if there's two MDs being scheduled at the same time. The failure mode is that it could overcommit an IC. We build the capacity ledger from observed state, which could be stale if we're racing with the same function serving another MD. That'll only be a problem if the IC is close to full.
I think we should fix this post-v0.1, so I'll raise a tracking issue. I'm thinking we could have the retain phase notice oversubscribed clusters and drop non-running replicas so they get rescheduled.
There was a problem hiding this comment.
Pull request overview
This PR aligns Modelplane’s hardware description, scheduling, and GPU binding with Kubernetes Dynamic Resource Allocation (DRA). It introduces DRA-style device modeling in InferenceClass, device-request-based nodeSelector in ModelDeployment, schedules replicas against per-pool devices surfaced on InferenceCluster.status.gpuPools, and binds GPUs via DRA ResourceClaimTemplate instead of the legacy nvidia.com/gpu device plugin path. It also updates the serving stack to install Node Feature Discovery and the NVIDIA DRA driver, plus updates generated schemas, examples, and docs accordingly.
Changes:
- Add DRA-style device modeling end-to-end:
InferenceClass.spec.devices[]→InferenceCluster.status.gpuPools[].devices→ModelDeployment.spec.nodeSelector.devices[]→ModelReplica.spec.deviceRequests[]. - Switch GPU binding from
nvidia.com/gpulimits to DRA claims by composing aResourceClaimTemplateand wiring pod/containerresourceClaims/resources.claims. - Update EKS defaults and serving stack to support DRA (Kubernetes 1.36 default, install NFD + NVIDIA DRA driver), and add CEL/quantity/semver selector evaluation + tests.
Reviewed changes
Copilot reviewed 65 out of 67 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| schemas/python/models/ai/modelplane/modelservice/v1alpha1.py | Pydantic model modernization (typing, aware datetimes). |
| schemas/python/models/ai/modelplane/modelreplica/v1alpha1.py | Add deviceRequests + nodePoolName to ModelReplica schema. |
| schemas/python/models/ai/modelplane/modelendpoint/v1alpha1.py | Pydantic model modernization (typing, aware datetimes). |
| schemas/python/models/ai/modelplane/modeldeployment/v1alpha1.py | Add nodeSelector.devices[] schema for device-request scheduling. |
| schemas/python/models/ai/modelplane/modelcache/v1alpha1.py | Pydantic model modernization (typing, aware datetimes). |
| schemas/python/models/ai/modelplane/infrastructure/servingstack/v1alpha1.py | Add version pins for NFD + NVIDIA DRA driver. |
| schemas/python/models/ai/modelplane/infrastructure/gkecluster/v1alpha1.py | Pydantic model modernization (typing, aware datetimes). |
| schemas/python/models/ai/modelplane/infrastructure/ekscluster/v1alpha1.py | Default EKS Kubernetes version to 1.36 for GA DRA support. |
| schemas/python/models/ai/modelplane/inferencegateway/v1alpha1.py | Pydantic model modernization (typing, aware datetimes). |
| schemas/python/models/ai/modelplane/inferencecluster/v1alpha1.py | Replace legacy capacity shape with status.gpuPools[] + devices. |
| schemas/python/models/ai/modelplane/inferenceclass/v1alpha1.py | Replace resources.gpu with DRA-style spec.devices[]. |
| schemas/.lock.json | Update schema lock hash for regenerated models. |
| pyproject.toml | Bump crossplane-function-sdk-python dependency group to >=0.13.0. |
| functions/compose-serving-stack/tests/test_fn.py | Expect NFD + NVIDIA DRA driver Helm releases. |
| functions/compose-serving-stack/pyproject.toml | Bump function dependency to crossplane-function-sdk-python>=0.13.0. |
| functions/compose-serving-stack/function/fn.py | Compose NFD + NVIDIA DRA driver; avoid emitting null metadata under exclude_unset. |
| functions/compose-model-service/pyproject.toml | Bump function dependency to crossplane-function-sdk-python>=0.13.0. |
| functions/compose-model-replica/tests/test_fn.py | Validate DRA claim wiring + readiness policy changes. |
| functions/compose-model-replica/tests/test_backends.py | Validate ResourceClaimTemplate creation and pod claim wiring across backends. |
| functions/compose-model-replica/pyproject.toml | Bump function dependency to crossplane-function-sdk-python>=0.13.0. |
| functions/compose-model-replica/function/fn.py | Update inferencecluster capacity assumptions and XR readiness handling. |
| functions/compose-model-replica/function/backends/native.py | Switch native backend to DRA claim-based GPU binding + wrap_object readiness. |
| functions/compose-model-replica/function/backends/llmd.py | Switch llm-d backend to DRA claim-based GPU binding + wrap_object readiness. |
| functions/compose-model-replica/function/backends/base.py | Add shared helpers for readiness policies and ResourceClaimTemplate composition. |
| functions/compose-model-endpoint/pyproject.toml | Bump function dependency to crossplane-function-sdk-python>=0.13.0. |
| functions/compose-model-deployment/tests/test_semver.py | Add parity tests for semver CEL surface. |
| functions/compose-model-deployment/tests/test_quantity.py | Add parity tests for quantity CEL surface. |
| functions/compose-model-deployment/tests/test_cel.py | Add parity tests for DRA-style selector evaluation over devices. |
| functions/compose-model-deployment/pyproject.toml | Add cel-python dependency and bump function SDK version. |
| functions/compose-model-deployment/function/semver.py | Implement semver CEL library parity with upstream behavior. |
| functions/compose-model-deployment/function/quantity.py | Implement quantity CEL library parity with upstream behavior. |
| functions/compose-model-deployment/function/name.py | Add replica naming keyed by (cluster, index) to avoid collisions. |
| functions/compose-model-deployment/function/fn.py | Handle invalid CEL selectors; compose per-replica deviceRequests + pool pins. |
| functions/compose-model-deployment/function/cel.py | Implement DRA-style CEL device activation + evaluation via celpy. |
| functions/compose-inference-gateway/pyproject.toml | Bump function dependency to crossplane-function-sdk-python>=0.13.0. |
| functions/compose-inference-cluster/pyproject.toml | Bump function dependency to crossplane-function-sdk-python>=0.13.0. |
| functions/compose-inference-cluster/function/fn.py | Publish status.gpuPools devices; switch backend XR from KServeBackend to ServingStack. |
| functions/compose-inference-class/tests/test_fn.py | Update tests for spec.devices[] schema. |
| functions/compose-inference-class/pyproject.toml | Bump function dependency to crossplane-function-sdk-python>=0.13.0. |
| functions/compose-inference-class/function/fn.py | Update docs/comments to reflect “devices” instead of “resources”. |
| functions/compose-gke-cluster/pyproject.toml | Bump function dependency to crossplane-function-sdk-python>=0.13.0. |
| functions/compose-eks-cluster/tests/test_fn.py | Update expected EKS version; remove device-plugin expectations. |
| functions/compose-eks-cluster/pyproject.toml | Bump function dependency to crossplane-function-sdk-python>=0.13.0. |
| functions/compose-eks-cluster/function/fn.py | Remove NVIDIA device plugin composition; rely on DRA stack. |
| flake.nix | Repin crossplane CLI to upstream main for bool/int field-name generation fix. |
| flake.lock | Update flake lock for the crossplane CLI repin. |
| examples/qwen-demo/02-class.yaml | Update example InferenceClass to devices[]. |
| examples/platform/inference-class-h100-byo.yaml | Update BYO class example to DRA-style devices (incl. synthetic NIC). |
| examples/platform/inference-class-gke-l4.yaml | Update GKE class example to DRA-style devices. |
| examples/platform/inference-class-eks-l4.yaml | Update EKS class example to DRA-style devices. |
| examples/deployment/model-deployment.yaml | Add required nodeSelector.devices[] example with DRA CEL. |
| examples/deployment/model-deployment-multinode.yaml | Add multi-node deployment nodeSelector.devices[] example. |
| docs/content/getting-started.md | Update narrative to describe nodeSelector + DRA ResourceClaim binding. |
| docs/content/concepts.md | Update concepts to describe device-based scheduling and DRA claiming. |
| design/design.md | Update design doc to device-request nodeSelector and DRA claim formation. |
| apis/servingstacks/definition.yaml | Add schema fields for NFD + NVIDIA DRA driver versions. |
| apis/modelreplicas/definition.yaml | Require nodePoolName + deviceRequests; define schema for requests/selectors. |
| apis/modeldeployments/definition.yaml | Require nodeSelector; define device-request schema and selector constraints. |
| apis/inferenceclusters/definition.yaml | Replace status.capacity.gpuPools with status.gpuPools[].devices. |
| apis/inferenceclasses/definition.yaml | Replace spec.resources with spec.devices (DRA-style). |
| apis/eksclusters/definition.yaml | Default EKS Kubernetes version to 1.36 for GA DRA support. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
compose-model-replica marked the Service, HTTPRoute, and ResourceClaimTemplate ready as soon as it composed them, before the resources existed on the workload cluster. Crossplane takes a function's per-resource readiness at face value when it computes the XR's Ready condition, so marking a resource ready is the function asserting it observed that resource ready - not a standing "treat it as ready once it exists" instruction. Asserting readiness for a resource that hasn't been applied yet claims something the function never observed. This gates every per-resource readiness mark on the resource being present in observed state. The Service, HTTPRoute, and ResourceClaimTemplate have no runtime readiness to wait on, so observing them is enough; the workload still additionally gates on the model serving. A freshly composed resource isn't observed yet, so it stays unready for one reconcile until Crossplane applies it and observes it back. Signed-off-by: Nic Cope <nicc@rk0n.org>
The control-plane scheduler matches a ModelDeployment's nodeSelector against an InferenceCluster pool's devices, picks a pool, and stamps it onto the ModelReplica as spec.nodePoolName. But the serving pod carried only its DRA ResourceClaim, with nothing tying it to that pool, so the workload cluster's scheduler was free to place it on any pool whose devices satisfied the claim. Two things broke as a result. The control plane's per-pool node-capacity accounting keys on nodePoolName, so it charged the scheduled pool while the pod ran elsewhere - and nothing ever reconciled the two, since the control plane never observes where a pod actually lands. The drift is permanent, not transient. Worse, a claim: Synthetic device (matched for placement but never turned into a DRA request - hardware with no driver, like an InfiniBand fabric) has no claim binding it, so pool selection is its only enforcement. Without a pin a pod could schedule onto a pool that lacks the synthetic hardware the model was placed for, silently, while serving. This pins every serving pod - the native Deployment pod and both llm-d LeaderWorkerSet templates - to its pool with a nodeSelector on the modelplane.ai/pool node label, valued at spec.nodePoolName. The EKS and GKE provisioning functions already stamp this label on every node group. On a BYO cluster Modelplane doesn't provision the nodes, so the operator must label them to match; the InferenceClass XRD documents this, and a missing label leaves the pod Pending rather than misplaced. The pinning joins the DRA claim wiring in one place, renamed place_pod. Signed-off-by: Nic Cope <nicc@rk0n.org>
The compose-model-cache function is declared in crossplane-project.yaml and is a member of the uv workspace, but it was never added to the functionNames list in flake.nix. That list drives two things: the OCI images nix builds for each function, and the per-function unit test checks. So the function's image was never built and its tests never ran. crossplane project build resolves every function in the project file against _output/functions, so a missing image fails the whole package build - the function can't be packaged or installed at all. CI's push-package job has been silently skipping for the same reason. Add compose-model-cache to functionNames so nix builds its image and runs its tests alongside the others. Signed-off-by: Nic Cope <nicc@rk0n.org>
dennis-upbound
left a comment
There was a problem hiding this comment.
Went deep on the DRA matching path — ran the compose-model-deployment tests and reproduced the two correctness items below against celpy. Solid overall; notes inline. The quantity and match-pool ones are the substantive ones, the rest are nits / a doc-vs-code mismatch.
dennis-upbound
left a comment
There was a problem hiding this comment.
approved with the comments
The quantity, semver, and cel modules reimplement Kubernetes' resource.Quantity and DRA device-selector CEL by hand, so it's easy to drift from upstream in ways tests that share the same assumptions won't catch. To check, this adds a parity oracle (tests/oracle) that runs inputs through the real apimachinery and dynamic-resource-allocation code and prints what upstream actually does. It surfaced several places where the reimplementation was wrong. quantity crashed on large binary-suffix values like 10Ei (decimal.InvalidOperation under Python's default 28-digit context), and where it did parse, it computed a value Kubernetes does not: resource.Quantity saturates binary-suffix overflow to int64-max, so 8Ei, 10Ei, and 100Ei all compare equal. cel exposed allowMultipleAllocations, absent from the DRA surface this targets, and a couple of tests asserted expression forms a real cluster rejects at compile time. Round to nano scale under a wide local precision so large decimal-path quantities parse, and saturate binary-suffix overflow to int64-max to match resource.Quantity.Cmp. Drop allowMultipleAllocations. Move the affected tests to the forms upstream accepts, and document the call-style and has()-index leniencies celpy can't avoid, plus the bare-suffix and decimal-exponent corners deliberately not reproduced. The oracle is a developer tool, not part of the suite: run it on demand via nix shell when changing these modules or bumping the target Kubernetes version, then transcribe its answers into the tests, which stay the regression guard. Signed-off-by: Nic Cope <nicc@rk0n.org>
The docstring described the count accounting as if it made pool selection sound, but assignment is greedy in request order, so it can falsely reject a pool when two requests' selectors overlap on a shared device kind. Spell out the greedy behaviour and why we accept it: the case needs overlapping rather than disjoint selectors (a shape no real workload writes) and fails safe as InsufficientCapacity, never an overcommit or a bad placement. Signed-off-by: Nic Cope <nicc@rk0n.org>
The native and llm-d backends each defined their own _ENGINE_PORT and _LABEL_SERVING with identical values, and each aliased base.REMOTE_NAMESPACE to a local _REMOTE_NAMESPACE that added nothing. The engine port especially is a contract - the ModelEndpoint URLs assume it - so two copies that can silently disagree is a hazard. Move ENGINE_PORT and LABEL_SERVING to base alongside REMOTE_NAMESPACE, and reference base directly from both backends instead of re-declaring or aliasing them. Signed-off-by: Nic Cope <nicc@rk0n.org>
The design doc gave nodes-per-replica as pipeline * data / dataLocal, but that's nodes-per-worker. A replica has workers.count workers, so the scheduler gates on nodes-per-worker times workers.count - which is what topology_shape actually computes. For a disaggregated decode role with count 3 and pipeline 2 the doc implied 2 nodes where the scheduler consumes 6. Multiply by workers.count in both the topology section and the scheduler steps. Signed-off-by: Nic Cope <nicc@rk0n.org>
The example predated #101 merging, where nodeSelector became required: show it on both the decode and prefill roles, each with its own device selectors (illustrating distinct per-role hardware), and model the InfiniBand fabric as a Synthetic device. Fold the operator "when to use" guidance into the summary as background. Give the routing-discriminator alternative an API sketch so the discriminator-vs-template tradeoff is concrete, and add the "two ModelDeployments" alternative with why a single MD is better (co-location and that a prefill-only MD isn't conceptually a model deployment). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dennis@upbound.io>
Fixes #56.
Fixes #103.
A
ModelDeploymentcould pick a cluster by label but couldn't steer a replica to a node pool with the right hardware. ItsnodeSelector.celmatched against a pool's flattened attributes in a dialect that resembled DRA CEL but didn't behave like it, and couldn't describe a node with more than one device.InferenceClasshad no DRA-style way to declare a node's hardware, and worker pods bound GPUs through the legacynvidia.com/gpudevice plugin rather than DRA.This makes hardware matching look and feel like DRA, end to end:
InferenceClassadvertises a pool's node's list of devices, like a DRAResourceSliceModelDeploymentspecifies a list of device requests, like a DRAResourceClaimAn
InferenceClassdescribes a pool's hardware as DRA devices. Each has adriver, acount, typedattributesandcapacity, and aclaimdiscriminator:DRAfor hardware a real driver exposes and claims at admission,Syntheticfor hardware that matters for placement but has no driver yet (an InfiniBand fabric, say).A
ModelDeploymentrequests devices in DRA CEL. The scheduler matches each request against a pool's devices and pins the replica to one that satisfies them. When aModelDeploymentrequests a device theInferenceClassmarks asclaim: DRA, that request is passed through as the real workload's (i.e. Deployment's or LWS's) DRAResourceClaim.nodeSelectoris required, and at least one request must resolve to a claimable device.A
ModelDeploymentfans out tospec.replicas, spread across clusters and packed onto fewer only when capacity forces it. A replica's identity is a(cluster, index)pair so co-located replicas don't collide. A replica is never re-homed: if its cluster goes away it's replaced, like a Pod whose node is gone. Editing thenodeSelectorrolls replicas whose pinned pool no longer matches.EKS is bumped to 1.36, where DRA is GA.
The
cel,quantity, andsemvermodules reimplement the DRA device-selector CEL surface on the pure-Python celpy evaluator, since base CEL and celpy don't ship the Kubernetes extensions. They're matched against the upstream libraries (k8s.io/apiserver/pkg/cel/library,blang/semver,resource.Quantity), with test tables mirroring the upstream suites. The divergences celpy can't reach are documented at the top ofcel.py.Validated end to end on a provisioned EKS 1.36 cluster: the DRA driver published an L4 as a ResourceSlice, a
ModelDeploymentwith a GPUnodeSelectorscheduled a replica, DRA allocated the L4 to the serving pod, and the pod served Qwen2.5-0.5B through the gateway.I have:
nix flake check(or./nix.sh flake check) and made sure it passes.git commit -s.