Skip to content

Match hardware as DRA-style devices and form DRA ResourceClaims#101

Merged
negz merged 26 commits into
mainfrom
celular
Jun 10, 2026
Merged

Match hardware as DRA-style devices and form DRA ResourceClaims#101
negz merged 26 commits into
mainfrom
celular

Conversation

@negz

@negz negz commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Fixes #56.
Fixes #103.

A ModelDeployment could pick a cluster by label but couldn't steer a replica to a node pool with the right hardware. Its nodeSelector.cel matched against a pool's flattened attributes in a dialect that resembled DRA CEL but didn't behave like it, and couldn't describe a node with more than one device. InferenceClass had no DRA-style way to declare a node's hardware, and worker pods bound GPUs through the legacy nvidia.com/gpu device plugin rather than DRA.

This makes hardware matching look and feel like DRA, end to end:

  • InferenceClass advertises a pool's node's list of devices, like a DRA ResourceSlice
  • ModelDeployment specifies a list of device requests, like a DRA ResourceClaim
  • The scheduler uses these to match deployments to pools

An InferenceClass describes a pool's hardware as DRA devices. Each has a driver, a count, typed attributes and capacity, and a claim discriminator: DRA for hardware a real driver exposes and claims at admission, Synthetic for hardware that matters for placement but has no driver yet (an InfiniBand fabric, say).

apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
  name: eks-l4-1x-g6
spec:
  provisioning:
    provider: EKS
    eks: { instanceType: g6.xlarge, accelerator: { type: nvidia-l4, count: 1 } }
  devices:
  - name: gpu
    claim: DRA
    driver: gpu.nvidia.com
    deviceClassName: gpu.nvidia.com
    count: 1
    attributes:
      architecture: { string: Ada Lovelace }
    capacity:
      memory: { value: "23034Mi" }   # real usable VRAM, not the nominal 24GB

A ModelDeployment requests devices in DRA CEL. The scheduler matches each request against a pool's devices and pins the replica to one that satisfies them. When a ModelDeployment requests a device the InferenceClass marks as claim: DRA, that request is passed through as the real workload's (i.e. Deployment's or LWS's) DRA ResourceClaim.

apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata: { name: qwen-demo, namespace: ml-team }
spec:
  replicas: 1
  nodeSelector:
    devices:
    - name: gpu
      count: 1
      selectors:
      - cel: device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("4Gi")) >= 0
  workers:
    topology: { tensor: 1 }
    template:
      spec:
        containers:
        - name: engine
          image: vllm/vllm-openai:v0.7.3
          args: ["--model=Qwen/Qwen2.5-0.5B-Instruct"]

nodeSelector is required, and at least one request must resolve to a claimable device.

A ModelDeployment fans out to spec.replicas, spread across clusters and packed onto fewer only when capacity forces it. A replica's identity is a (cluster, index) pair so co-located replicas don't collide. A replica is never re-homed: if its cluster goes away it's replaced, like a Pod whose node is gone. Editing the nodeSelector rolls replicas whose pinned pool no longer matches.

EKS is bumped to 1.36, where DRA is GA.

The cel, quantity, and semver modules reimplement the DRA device-selector CEL surface on the pure-Python celpy evaluator, since base CEL and celpy don't ship the Kubernetes extensions. They're matched against the upstream libraries (k8s.io/apiserver/pkg/cel/library, blang/semver, resource.Quantity), with test tables mirroring the upstream suites. The divergences celpy can't reach are documented at the top of cel.py.

Validated end to end on a provisioned EKS 1.36 cluster: the DRA driver published an L4 as a ResourceSlice, a ModelDeployment with a GPU nodeSelector scheduled a replica, DRA allocated the L4 to the serving pod, and the pod served Qwen2.5-0.5B through the gateway.

I have:

  • Read and followed Modelplane's contribution process.
  • Run nix flake check (or ./nix.sh flake check) and made sure it passes.
  • Added or updated tests covering any composition function changes.
  • Signed off every commit with git commit -s.

@negz negz changed the title Implement CEL-based nodeSelector matching for ModelDeployment Match hardware as DRA-style devices and form DRA ResourceClaims Jun 6, 2026
@negz negz force-pushed the celular branch 3 times, most recently from c364b46 to 9417d10 Compare June 9, 2026 06:10
negz added 14 commits June 9, 2026 17:26
ModelDeployment's nodeSelector was a single CEL expression matched
against a pool's merged attributes and capacity, in a flat dialect that
resembled DRA CEL but didn't behave like it (attributes[...]/capacity[...]
indexing, version() rather than semver()). It also couldn't describe a
node with more than one kind of device: a GPU and a NIC flattened onto
one synthetic device, so a ModelDeployment couldn't filter on both, and
the ResourceClaim translation wasn't mechanical for multi-device pools.

This change reworks the design so InferenceClass and nodeSelector
describe hardware as a list of DRA-style devices. InferenceClass gains a
devices[] list, each with a driver, count, typed attributes, capacity,
and a claim discriminator (DRA or Synthetic) that says whether the
device is claimed via a ResourceClaim or described for scheduling only.
This replaces resources.gpu and the modelplane.ai/* attribute-prefix
convention. nodeSelector becomes devices[], each request carrying a
name, count, and a selectors[] list whose entries are real DRA CEL
(device.attributes["domain"].name, quantity(), semver()).

Scheduling no longer derives a physical shape from topology. A device
request's count subsumes the old GPUs-per-node check, so the scheduler
evaluates the CEL requests and gates only on available nodes. Topology
becomes purely a provisioning concern. ResourceClaim translation is now
one DeviceRequest per claim: DRA request, with synthetic devices
dropped.

For #103.

Signed-off-by: Nic Cope <nicc@rk0n.org>
The repo pinned the Crossplane CLI to negz/cli:diy, a fork branch carrying
an unreleased datamodel-code-generator bump. That bump (crossplane/cli#24
and #64) has since merged to crossplane/cli main, so this repins the CLI to
crossplane/cli directly and regenerates the Python models. The regen reflows
the affected models with the newer generator (mostly Optional[X] -> X | None).

The newer generator (datamodel-code-generator 0.59.0) emits object-typed
field defaults as a default_factory rather than a plain value. The Crossplane
SDK's resource.update serializes composed resources with
model_dump(exclude_defaults=True), which no longer recognizes the
factory-built default as equal to the declared default, so unset fields leak
into composed resources. This keeps crossplane-function-sdk-python pinned to
field?" rather than "is it different from its default?" - which is the correct
question under server-side apply and immune to how a default is represented.

Switching the whole repo to exclude_unset surfaces a few places that
explicitly set fields to None or to a defaulted value, which exclude_defaults
previously dropped. compose-serving-stack built provider-kubernetes Objects
and Helm Releases with metadata=None and ObjectMeta(namespace=None); those now
only set the field when it's present. The compose-inference-cluster and
compose-model-deployment test fixtures are updated to reflect that explicitly
set values (a node pool's kubernetesVersion and diskSizeGb, a replica's worker
count and pipeline) now appear in composed resources.

Signed-off-by: Nic Cope <nicc@rk0n.org>
ModelDeployment matched hardware with a single nodeSelector.cel expression
over a pool's flat, merged attributes and capacity. It resembled DRA CEL but
didn't behave like it, and it couldn't describe a node with more than one kind
of device: a GPU and a NIC flattened onto one synthetic device, so a deployment
couldn't filter on both and the translation to a DRA ResourceClaim wasn't
mechanical. Serving pods bound GPUs through the legacy nvidia.com/gpu device
plugin, not DRA.

This reworks hardware as a list of DRA-style devices, end to end:

InferenceClass declares spec.devices[]: each device has a driver, a count, a
deviceClassName, typed attributes and capacity, and a claim discriminator (DRA
or Synthetic). InferenceCluster copies a pool's devices verbatim into
status.capacity.gpuPools[].devices for the scheduler to match against.

ModelDeployment.nodeSelector becomes devices[]: each request carries a name, a
count, and a list of CEL selectors that are real DRA CEL evaluated against one
device (device.driver, device.attributes["<driver>"].<name>,
device.capacity["<driver>"].<name>, with quantity() and semver()). The
scheduler matches each request against a pool's devices, consuming device count
across requests so it never places a replica onto a node DRA can't satisfy, and
gates only on available nodes. It resolves the matched pool's claim: DRA devices
into ModelReplica.spec.deviceRequests.

compose-model-replica forms a DRA ResourceClaimTemplate (resource.k8s.io/v1)
from those requests and wires every serving pod - the native Deployment pod and
both the llm-d LeaderWorkerSet leader and worker - to claim through it, in place
of the nvidia.com/gpu limit. A replica with no device requests (no nodeSelector)
falls back to the device-plugin limit, so existing deployments are unaffected.

The CEL extensions (quantity(), semver(), compareTo/isGreaterThan/isLessThan)
are reimplemented for the pure-Python celpy evaluator; quantity arithmetic uses
Decimal for exact ordering, and any evaluation error is treated as a non-match
so arbitrary user CEL can't crash a reconcile.

For #103.

Signed-off-by: Nic Cope <nicc@rk0n.org>
A ModelDeployment couldn't place more than one ModelReplica on the same
InferenceCluster. The scheduler keyed every replica, endpoint, and
retention decision on the cluster name alone, so a second replica on a
cluster would collide with the first. A deployment wanting three replicas
across two clusters could only ever fill two, even when a cluster had
ample spare nodes.

This change makes a replica's identity the pair (cluster, index), where
index is a per-cluster-local integer that distinguishes co-located
replicas. The index is a collision breaker, not an ordering: replicas are
fungible, and a replica never moves cluster. Desired-resource keys become
replica-{cluster}-{index} and endpoint-{cluster}-{index}; names are
hashed from (deployment, cluster, index) so co-located replicas don't
collide. A new modelplane.ai/replica-index label carries the index, which
the scheduler reads back to reconstruct identity from observed state.

The scheduler is restructured as two phases over the observed state.
Retain keeps each existing replica on its (cluster, index) when the
cluster still exists and its pinned pool still matches the nodeSelector,
and never moves a healthy replica to rebalance. Fill places any shortfall
one replica at a time onto the eligible cluster hosting the fewest of this
deployment's replicas, so replicas spread across clusters and pack onto
fewer only when capacity forces it. Scale-down drops the highest-index
replica on the most-loaded cluster first, consolidating without emptying a
cluster that still holds a sole replica.

Placement runs against a node-capacity ledger built from each pool's
published nodes minus the replicas committed to it: other deployments'
replicas and this deployment's retained replicas, each charged at its own
observed node cost. Replicas dropped by retain or scale-down are not
charged, since their nodes are freeing up for the replicas that replace
them. The fill phase decrements the ledger as it places each replica, so a
single pass can't overcommit a cluster.

Cross-deployment device-count contention remains the workload cluster's
DRA admission to resolve; the control-plane scheduler stays coarse and
gates only on nodes.

Signed-off-by: Nic Cope <nicc@rk0n.org>
DRA names a device attribute's typed value bool and int, but the
InferenceClass and InferenceCluster schemas used boolean and integer
instead. This worked around crossplane/cli#63, where the project build
generated broken Python schemas for OpenAPI fields named int or bool.

That issue is now fixed, and the pinned Crossplane CLI commit carries the
fix, so the workaround is no longer needed. This change renames the
attribute one-of fields back to DRA's bool and int across both XRDs, the
exactly-one validation rule, and the CEL activation that reads them, and
regenerates the models.

The generated models can't use bool and int as Python attribute names, so
the code generator emits bool_ and int_ aliased to the bool and int wire
names. model_dump defaults to the Python attribute names, which would hide
the values from the CEL selector (it reads the wire names). The two device
model_dump calls that feed published capacity and selector matching now
pass by_alias=True so the wire shape keeps DRA's bool and int.

Signed-off-by: Nic Cope <nicc@rk0n.org>
The cluster's published capacity lived under status.capacity, whose only
member was gpuPools. The wrapper added a redundant level: callers wrote
and read status.capacity.gpuPools where status.gpuPools would do.

It also collided with the DRA device capacity field added by the device
redesign. A device's typed capacity quantities live under
device.capacity, which the model generator names Capacity. Two schemas
can't share a class name, so the status wrapper was generated as
CapacityModel, an artifact leaking into every caller that constructed it.

This change removes the wrapper and moves gpuPools to the top of the
InferenceCluster status. The generated status type now references GpuPool
directly, and Capacity is free for the DRA device field it belongs to.
The wrapper isn't worth keeping for future status fields: the obvious
candidate, observed capacity, would replace the declared pools as a
higher-fidelity signal rather than sit beside them, so it needs no
grouping object.

Signed-off-by: Nic Cope <nicc@rk0n.org>
The functions pinned function-sdk-python to a git branch carrying two
unreleased serialization fixes: exclude_unset instead of exclude_defaults,
and by_alias so keyword-named fields like bool and int serialize under
their wire names. The git pin was built from an sdist, so nix/checks.nix
and nix/functions.nix injected hatchling as a build-system input that the
wheel overlay doesn't provide.

Both fixes shipped in v0.13.0 on PyPI. This drops the git source pin,
bumps the dependency to >=0.13.0 across the workspace, removes the
hatchling build-system overlays, and relocks. The SDK now resolves as a
PyPI wheel like every other dependency.

Signed-off-by: Nic Cope <nicc@rk0n.org>
When an EKS-backed InferenceCluster became ready, compose-inference-cluster
called self.compose_kserve_backend, a method that no longer exists after
the KServe backend was replaced with the ServingStack. The function raised
AttributeError and the InferenceCluster never composed its backend, so it
never reached Ready. The EKS Usage that blocks cluster deletion until the
backend is gone also still referenced the old KServeBackend kind.

The GKE and Existing paths were already updated to compose a ServingStack;
the EKS path was missed. This change points the EKS path at
compose_serving_stack and the Usage's by reference at ServingStack, matching
the other two.

Signed-off-by: Nic Cope <nicc@rk0n.org>
A serving workload's provider-kubernetes Object used the DeriveFromObject
readiness policy, which mirrors the wrapped resource's Ready condition. A
Deployment and a LeaderWorkerSet publish Available, not Ready, so the Object
never became ready: the ModelReplica stayed at ModelReady=False even while
the model was serving, and the ModelDeployment's replica count never caught
up. The device-plugin DaemonSet compose-eks-cluster installed had the same
problem - a DaemonSet has no Ready condition either - and only looked ready
because it fell through to the provider's SuccessfulCreate default.

The workload Objects (the native Deployment and the llm-d LeaderWorkerSet)
now derive readiness from a CEL query over their Available condition. The
Service, HTTPRoute, and ResourceClaimTemplate Objects have no runtime
readiness and are explicitly ready on create.

The device plugin goes away entirely. It existed to advertise the legacy
nvidia.com/gpu extended resource for replicas that bind GPUs through a
device-plugin limit rather than DRA. The design binds GPUs via DRA on every
cluster, so that fallback shouldn't exist: engine_resources references the
pod's DRA claim when the replica has device requests and claims nothing
otherwise, and compose-eks-cluster no longer installs the DaemonSet or its
deletion-ordering Usages.

Signed-off-by: Nic Cope <nicc@rk0n.org>
GPUs bind to pods only through DRA: each claim: DRA request in a
ModelDeployment's nodeSelector becomes a DeviceRequest in the ResourceClaim
the serving pods claim GPUs through. A deployment with no nodeSelector
produced a replica with no device requests, and so no GPU. nodeSelector was
optional, defaulting to "match any pool", which only ever worked because GPUs
could also bind via a device plugin.

Modelplane won't infer a request. A request's selectors are how the ML team
says what the model needs - a 0.5B model and a 70B model want very different
GPUs - and an inferred "any GPU" request would schedule a model onto whatever
pool has a free device and hope it fits.

This makes nodeSelector required on the XRD (at least one device, each with at
least one selector) and removes the scheduler's no-nodeSelector path:
compile_requests always returns the compiled requests, and every placed
replica carries the device requests its pool resolved. The example
deployments gain a GPU nodeSelector so they keep working.

Signed-off-by: Nic Cope <nicc@rk0n.org>
GPUs bind via Dynamic Resource Allocation, which is alpha and off by default
until Kubernetes 1.34, where the core APIs went GA. EKS clusters defaulted to
1.31, so DRA wasn't available and managed EKS gives no way to enable an
alpha feature gate on the control plane.

This defaults both the EKSCluster and InferenceCluster EKS version to 1.36,
the latest EKS supports, where DRA is on by default. The version is still
overridable.

Signed-off-by: Nic Cope <nicc@rk0n.org>
The serving stack provisions everything a model needs to run on a workload
cluster, but nothing published GPUs as DRA devices, so a ResourceClaim had no
ResourceSlice to bind against and GPU pods stayed Pending.

This adds two Helm releases to the serving stack. Node Feature Discovery
labels GPU nodes. The NVIDIA DRA driver runs a kubelet plugin on those nodes
that publishes each node's GPUs as ResourceSlices and registers the
gpu.nvidia.com DeviceClass that ModelReplica ResourceClaimTemplates request
through. GPU allocation is opt-in via gpuResourcesEnabledOverride; the driver's
ComputeDomains support (Multi-Node NVLink) is disabled, since Modelplane
doesn't use it and enabling it would pull in GPU Feature Discovery.

Signed-off-by: Nic Cope <nicc@rk0n.org>
GPU node groups carry an nvidia.com/gpu:NoSchedule taint so non-GPU pods
don't land on expensive GPU nodes. Pods used to schedule there because the
device plugin made them request the nvidia.com/gpu extended resource, which
EKS's ExtendedResourceToleration admission controller turns into a matching
toleration. Binding GPUs via DRA instead, the serving pods make no such
request, so nothing injects the toleration and they stay Pending off the GPU
nodes.

This adds the toleration to a serving pod when - and only when - the replica
claims a GPU through DRA, alongside wiring up its ResourceClaimTemplate. A pod
with no device requests claims no GPU and gets no toleration, so it can't land
on a GPU node it has no reason to use.

Signed-off-by: Nic Cope <nicc@rk0n.org>
A ModelReplica never became Ready even once its model was serving:
ModelReady went True, but the XR stayed "Creating", reporting its Service,
HTTPRoute, and ResourceClaimTemplate as unready. So the ModelDeployment never
counted the replica and never became Ready either.

The function only marked the workload (model-serving) ready. Crossplane gates
an XR's Ready on every composed resource being ready, and a composed resource
isn't ready just because provider-kubernetes set its own Object's Ready
condition - the function has to mark it in the response.

This marks every composed resource ready. The workload still gates on the
model actually serving; the Service, HTTPRoute, and ResourceClaimTemplate have
no runtime readiness to wait on, so they're ready once composed.

Signed-off-by: Nic Cope <nicc@rk0n.org>
@vercel

vercel Bot commented Jun 10, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
modelplane-docs Ready Ready Preview, Comment Jun 10, 2026 12:58am

Request Review

A ModelReplica's name was built with the SDK's resource.child_name,
joining the deployment name, cluster name, and replica index into the
readable prefix - e.g. my-model-cluster-a-0-5ab63. That leaks a
replica's placement (which cluster, which index) into its name, when a
replica's identity is meant to live in its labels. The name should be an
opaque, stable handle, the way a Pod's name doesn't encode its node.

child_name can't produce this: it joins every part into both the hashed
value and the visible prefix, so it has no way to fold a discriminator
into the hash without also showing it. This change adds a name module
with an opaque_name helper that hashes the visible name together with
its discriminators but keeps only the visible name in the prefix, then
derives a replica's name as the deployment name plus a hash of
(deployment, cluster, index). Co-located replicas still get distinct,
stable names - my-model-5ab63, my-model-609c5 - and the endpoint reuses
the replica name so routing still lands on the right backend.

The new module also takes over the replica_key and endpoint_key
desired-resource handles, so one place owns every identifier a
ModelDeployment derives for its replicas.

Signed-off-by: Nic Cope <nicc@rk0n.org>
negz added 4 commits June 9, 2026 18:40
A ModelDeployment's nodeSelector matches DRA-style device requests against
an InferenceCluster pool's devices. A pool device is either claim: DRA (a
real device bound through a DRA ResourceClaim) or claim: Synthetic (matched
for fleet scheduling but never claimed - hardware with no DRA driver). The
serving workload binds its GPUs through a ResourceClaim built from the
matched DRA devices.

The scheduler matched a pool whenever every request matched some device,
including a pool where the matched devices were all synthetic. Such a pool
yields no DRA requests, so the composed ModelReplica carried an empty
deviceRequests and its workload got no ResourceClaim - it would schedule
with no GPU binding at all, defeating the point of selecting that hardware.
A nodeSelector is required precisely so the workload can form a
ResourceClaim, so a selector that resolves only to synthetic devices must
not schedule.

The claim kind lives on the InferenceClass/InferenceCluster device, not the
ModelDeployment, so this can't be rejected when the deployment is admitted.
This change enforces it where the information converges: _match_pool now
treats a pool that resolves to zero DRA requests as a non-match, the same as
a pool that fails a selector. A deployment whose nodeSelector matches only
synthetic devices finds no eligible pool and reports InsufficientCapacity.
Synthetic devices remain co-selectors that refine placement alongside a
claimable device. The ModelDeployment XRD documents the requirement.

Because the scheduler now only ever pins a replica to a named pool that
yields at least one claimable device, nodePoolName and deviceRequests are
made required on the ModelReplica. compose-model-deployment always stamps
both, and compose-model-replica trusts them: the backends drop the branches
that handled a replica with no device requests, and resource_claim_template
always composes a ResourceClaimTemplate.

Signed-off-by: Nic Cope <nicc@rk0n.org>
semver.py reimplements blang/semver's parsing and precedence ordering for
DRA CEL selectors. Several lines mirror a quirk of that upstream
implementation with no local cue: the patch/prerelease/build split peels
build before prerelease (because build metadata may itself contain '-'),
zip uses strict=False because prerelease lists are expected to differ in
length, and the character-class checks reproduce blang's specific error
cases rather than a regex.

This adds commentary to those spots - why a line does what it does, tied
back to the upstream behaviour it mirrors - while leaving the precedence
contract in the class docstrings.

Signed-off-by: Nic Cope <nicc@rk0n.org>
These needed no forward reference; the floor is Python 3.11.

Signed-off-by: Nic Cope <nicc@rk0n.org>
test_fn.py built req1..req7/want1..want7 imperatively in one ~780-line
method (suppressing a too-many-statements lint) and only assembled them
into a case list at the end, with a separate co-located-replicas test
asserting dynamically. It read nothing like the table-driven tests
elsewhere in the package.

This inlines each case's request and response into the cases table, folds
the co-located-replicas scenario in as an eighth full-response case, and
drops the lint suppression. Input fixtures (the deployment, the clusters,
the observed replica) are now built as schema-validated pydantic models
and dumped, like the rest of the package; expected responses stay as
hand-written dict literals so each assertion is an independent oracle
rather than a round-trip through the code under test.

Signed-off-by: Nic Cope <nicc@rk0n.org>
if backend_secrets or backend_exists:
if backend_secrets:
self.compose_kserve_backend(backend_secrets)
self.compose_serving_stack(backend_secrets)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly related to this change, but it was broken in main. I'd like to get type checking working as part of nix flake check - I think it'd have caught this since self.compose_kserve_backend was no longer defined.

Comment on lines +200 to +207
A request matches a pool device when the device has enough UNCONSUMED count
to cover the request and every selector evaluates true against that device.
Each resolved DRA request becomes a distinct DeviceRequest in one
ResourceClaim, and DRA allocates distinct devices per request, so a device's
count is consumed as requests claim it: two requests cannot both be satisfied
by the same single-count device, and N requests against one device must fit
within that device's count. Without this accounting the scheduler would place
a replica onto a node DRA can't actually satisfy.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our scheduler can race if there's two MDs being scheduled at the same time. The failure mode is that it could overcommit an IC. We build the capacity ledger from observed state, which could be stale if we're racing with the same function serving another MD. That'll only be a problem if the IC is close to full.

I think we should fix this post-v0.1, so I'll raise a tracking issue. I'm thinking we could have the retain phase notice oversubscribed clusters and drop non-running replicas so they get rescheduled.

@negz negz marked this pull request as ready for review June 10, 2026 04:31
Copilot AI review requested due to automatic review settings June 10, 2026 04:31

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aligns Modelplane’s hardware description, scheduling, and GPU binding with Kubernetes Dynamic Resource Allocation (DRA). It introduces DRA-style device modeling in InferenceClass, device-request-based nodeSelector in ModelDeployment, schedules replicas against per-pool devices surfaced on InferenceCluster.status.gpuPools, and binds GPUs via DRA ResourceClaimTemplate instead of the legacy nvidia.com/gpu device plugin path. It also updates the serving stack to install Node Feature Discovery and the NVIDIA DRA driver, plus updates generated schemas, examples, and docs accordingly.

Changes:

  • Add DRA-style device modeling end-to-end: InferenceClass.spec.devices[]InferenceCluster.status.gpuPools[].devicesModelDeployment.spec.nodeSelector.devices[]ModelReplica.spec.deviceRequests[].
  • Switch GPU binding from nvidia.com/gpu limits to DRA claims by composing a ResourceClaimTemplate and wiring pod/container resourceClaims/resources.claims.
  • Update EKS defaults and serving stack to support DRA (Kubernetes 1.36 default, install NFD + NVIDIA DRA driver), and add CEL/quantity/semver selector evaluation + tests.

Reviewed changes

Copilot reviewed 65 out of 67 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
schemas/python/models/ai/modelplane/modelservice/v1alpha1.py Pydantic model modernization (typing, aware datetimes).
schemas/python/models/ai/modelplane/modelreplica/v1alpha1.py Add deviceRequests + nodePoolName to ModelReplica schema.
schemas/python/models/ai/modelplane/modelendpoint/v1alpha1.py Pydantic model modernization (typing, aware datetimes).
schemas/python/models/ai/modelplane/modeldeployment/v1alpha1.py Add nodeSelector.devices[] schema for device-request scheduling.
schemas/python/models/ai/modelplane/modelcache/v1alpha1.py Pydantic model modernization (typing, aware datetimes).
schemas/python/models/ai/modelplane/infrastructure/servingstack/v1alpha1.py Add version pins for NFD + NVIDIA DRA driver.
schemas/python/models/ai/modelplane/infrastructure/gkecluster/v1alpha1.py Pydantic model modernization (typing, aware datetimes).
schemas/python/models/ai/modelplane/infrastructure/ekscluster/v1alpha1.py Default EKS Kubernetes version to 1.36 for GA DRA support.
schemas/python/models/ai/modelplane/inferencegateway/v1alpha1.py Pydantic model modernization (typing, aware datetimes).
schemas/python/models/ai/modelplane/inferencecluster/v1alpha1.py Replace legacy capacity shape with status.gpuPools[] + devices.
schemas/python/models/ai/modelplane/inferenceclass/v1alpha1.py Replace resources.gpu with DRA-style spec.devices[].
schemas/.lock.json Update schema lock hash for regenerated models.
pyproject.toml Bump crossplane-function-sdk-python dependency group to >=0.13.0.
functions/compose-serving-stack/tests/test_fn.py Expect NFD + NVIDIA DRA driver Helm releases.
functions/compose-serving-stack/pyproject.toml Bump function dependency to crossplane-function-sdk-python>=0.13.0.
functions/compose-serving-stack/function/fn.py Compose NFD + NVIDIA DRA driver; avoid emitting null metadata under exclude_unset.
functions/compose-model-service/pyproject.toml Bump function dependency to crossplane-function-sdk-python>=0.13.0.
functions/compose-model-replica/tests/test_fn.py Validate DRA claim wiring + readiness policy changes.
functions/compose-model-replica/tests/test_backends.py Validate ResourceClaimTemplate creation and pod claim wiring across backends.
functions/compose-model-replica/pyproject.toml Bump function dependency to crossplane-function-sdk-python>=0.13.0.
functions/compose-model-replica/function/fn.py Update inferencecluster capacity assumptions and XR readiness handling.
functions/compose-model-replica/function/backends/native.py Switch native backend to DRA claim-based GPU binding + wrap_object readiness.
functions/compose-model-replica/function/backends/llmd.py Switch llm-d backend to DRA claim-based GPU binding + wrap_object readiness.
functions/compose-model-replica/function/backends/base.py Add shared helpers for readiness policies and ResourceClaimTemplate composition.
functions/compose-model-endpoint/pyproject.toml Bump function dependency to crossplane-function-sdk-python>=0.13.0.
functions/compose-model-deployment/tests/test_semver.py Add parity tests for semver CEL surface.
functions/compose-model-deployment/tests/test_quantity.py Add parity tests for quantity CEL surface.
functions/compose-model-deployment/tests/test_cel.py Add parity tests for DRA-style selector evaluation over devices.
functions/compose-model-deployment/pyproject.toml Add cel-python dependency and bump function SDK version.
functions/compose-model-deployment/function/semver.py Implement semver CEL library parity with upstream behavior.
functions/compose-model-deployment/function/quantity.py Implement quantity CEL library parity with upstream behavior.
functions/compose-model-deployment/function/name.py Add replica naming keyed by (cluster, index) to avoid collisions.
functions/compose-model-deployment/function/fn.py Handle invalid CEL selectors; compose per-replica deviceRequests + pool pins.
functions/compose-model-deployment/function/cel.py Implement DRA-style CEL device activation + evaluation via celpy.
functions/compose-inference-gateway/pyproject.toml Bump function dependency to crossplane-function-sdk-python>=0.13.0.
functions/compose-inference-cluster/pyproject.toml Bump function dependency to crossplane-function-sdk-python>=0.13.0.
functions/compose-inference-cluster/function/fn.py Publish status.gpuPools devices; switch backend XR from KServeBackend to ServingStack.
functions/compose-inference-class/tests/test_fn.py Update tests for spec.devices[] schema.
functions/compose-inference-class/pyproject.toml Bump function dependency to crossplane-function-sdk-python>=0.13.0.
functions/compose-inference-class/function/fn.py Update docs/comments to reflect “devices” instead of “resources”.
functions/compose-gke-cluster/pyproject.toml Bump function dependency to crossplane-function-sdk-python>=0.13.0.
functions/compose-eks-cluster/tests/test_fn.py Update expected EKS version; remove device-plugin expectations.
functions/compose-eks-cluster/pyproject.toml Bump function dependency to crossplane-function-sdk-python>=0.13.0.
functions/compose-eks-cluster/function/fn.py Remove NVIDIA device plugin composition; rely on DRA stack.
flake.nix Repin crossplane CLI to upstream main for bool/int field-name generation fix.
flake.lock Update flake lock for the crossplane CLI repin.
examples/qwen-demo/02-class.yaml Update example InferenceClass to devices[].
examples/platform/inference-class-h100-byo.yaml Update BYO class example to DRA-style devices (incl. synthetic NIC).
examples/platform/inference-class-gke-l4.yaml Update GKE class example to DRA-style devices.
examples/platform/inference-class-eks-l4.yaml Update EKS class example to DRA-style devices.
examples/deployment/model-deployment.yaml Add required nodeSelector.devices[] example with DRA CEL.
examples/deployment/model-deployment-multinode.yaml Add multi-node deployment nodeSelector.devices[] example.
docs/content/getting-started.md Update narrative to describe nodeSelector + DRA ResourceClaim binding.
docs/content/concepts.md Update concepts to describe device-based scheduling and DRA claiming.
design/design.md Update design doc to device-request nodeSelector and DRA claim formation.
apis/servingstacks/definition.yaml Add schema fields for NFD + NVIDIA DRA driver versions.
apis/modelreplicas/definition.yaml Require nodePoolName + deviceRequests; define schema for requests/selectors.
apis/modeldeployments/definition.yaml Require nodeSelector; define device-request schema and selector constraints.
apis/inferenceclusters/definition.yaml Replace status.capacity.gpuPools with status.gpuPools[].devices.
apis/inferenceclasses/definition.yaml Replace spec.resources with spec.devices (DRA-style).
apis/eksclusters/definition.yaml Default EKS Kubernetes version to 1.36 for GA DRA support.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread functions/compose-model-replica/function/backends/base.py Outdated
Comment thread functions/compose-model-replica/function/fn.py
negz added 3 commits June 9, 2026 22:12
compose-model-replica marked the Service, HTTPRoute, and
ResourceClaimTemplate ready as soon as it composed them, before the
resources existed on the workload cluster. Crossplane takes a function's
per-resource readiness at face value when it computes the XR's Ready
condition, so marking a resource ready is the function asserting it
observed that resource ready - not a standing "treat it as ready once it
exists" instruction. Asserting readiness for a resource that hasn't been
applied yet claims something the function never observed.

This gates every per-resource readiness mark on the resource being
present in observed state. The Service, HTTPRoute, and
ResourceClaimTemplate have no runtime readiness to wait on, so observing
them is enough; the workload still additionally gates on the model
serving. A freshly composed resource isn't observed yet, so it stays
unready for one reconcile until Crossplane applies it and observes it
back.

Signed-off-by: Nic Cope <nicc@rk0n.org>
The control-plane scheduler matches a ModelDeployment's nodeSelector
against an InferenceCluster pool's devices, picks a pool, and stamps it
onto the ModelReplica as spec.nodePoolName. But the serving pod carried
only its DRA ResourceClaim, with nothing tying it to that pool, so the
workload cluster's scheduler was free to place it on any pool whose
devices satisfied the claim.

Two things broke as a result. The control plane's per-pool node-capacity
accounting keys on nodePoolName, so it charged the scheduled pool while
the pod ran elsewhere - and nothing ever reconciled the two, since the
control plane never observes where a pod actually lands. The drift is
permanent, not transient. Worse, a claim: Synthetic device (matched for
placement but never turned into a DRA request - hardware with no driver,
like an InfiniBand fabric) has no claim binding it, so pool selection is
its only enforcement. Without a pin a pod could schedule onto a pool that
lacks the synthetic hardware the model was placed for, silently, while
serving.

This pins every serving pod - the native Deployment pod and both llm-d
LeaderWorkerSet templates - to its pool with a nodeSelector on the
modelplane.ai/pool node label, valued at spec.nodePoolName. The EKS and
GKE provisioning functions already stamp this label on every node group.
On a BYO cluster Modelplane doesn't provision the nodes, so the operator
must label them to match; the InferenceClass XRD documents this, and a
missing label leaves the pod Pending rather than misplaced. The pinning
joins the DRA claim wiring in one place, renamed place_pod.

Signed-off-by: Nic Cope <nicc@rk0n.org>
The compose-model-cache function is declared in crossplane-project.yaml
and is a member of the uv workspace, but it was never added to the
functionNames list in flake.nix. That list drives two things: the OCI
images nix builds for each function, and the per-function unit test
checks. So the function's image was never built and its tests never ran.

crossplane project build resolves every function in the project file
against _output/functions, so a missing image fails the whole package
build - the function can't be packaged or installed at all. CI's
push-package job has been silently skipping for the same reason.

Add compose-model-cache to functionNames so nix builds its image and
runs its tests alongside the others.

Signed-off-by: Nic Cope <nicc@rk0n.org>

@dennis-upbound dennis-upbound left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went deep on the DRA matching path — ran the compose-model-deployment tests and reproduced the two correctness items below against celpy. Solid overall; notes inline. The quantity and match-pool ones are the substantive ones, the rest are nits / a doc-vs-code mismatch.

Comment thread functions/compose-model-deployment/function/quantity.py Outdated
Comment thread functions/compose-model-deployment/function/scheduling.py
Comment thread functions/compose-model-deployment/function/fn.py
Comment thread functions/compose-model-replica/function/backends/native.py Outdated
Comment thread functions/compose-model-replica/function/backends/base.py
Comment thread design/design.md Outdated
@dennis-upbound dennis-upbound self-requested a review June 10, 2026 14:42

@dennis-upbound dennis-upbound left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved with the comments

negz added 4 commits June 10, 2026 11:12
The quantity, semver, and cel modules reimplement Kubernetes' resource.Quantity
and DRA device-selector CEL by hand, so it's easy to drift from upstream in ways
tests that share the same assumptions won't catch. To check, this adds a parity
oracle (tests/oracle) that runs inputs through the real apimachinery and
dynamic-resource-allocation code and prints what upstream actually does. It
surfaced several places where the reimplementation was wrong.

quantity crashed on large binary-suffix values like 10Ei (decimal.InvalidOperation
under Python's default 28-digit context), and where it did parse, it computed a
value Kubernetes does not: resource.Quantity saturates binary-suffix overflow to
int64-max, so 8Ei, 10Ei, and 100Ei all compare equal. cel exposed
allowMultipleAllocations, absent from the DRA surface this targets, and a couple
of tests asserted expression forms a real cluster rejects at compile time.

Round to nano scale under a wide local precision so large decimal-path quantities
parse, and saturate binary-suffix overflow to int64-max to match
resource.Quantity.Cmp. Drop allowMultipleAllocations. Move the affected tests to
the forms upstream accepts, and document the call-style and has()-index leniencies
celpy can't avoid, plus the bare-suffix and decimal-exponent corners deliberately
not reproduced.

The oracle is a developer tool, not part of the suite: run it on demand via
nix shell when changing these modules or bumping the target Kubernetes version,
then transcribe its answers into the tests, which stay the regression guard.

Signed-off-by: Nic Cope <nicc@rk0n.org>
The docstring described the count accounting as if it made pool selection sound,
but assignment is greedy in request order, so it can falsely reject a pool when
two requests' selectors overlap on a shared device kind. Spell out the greedy
behaviour and why we accept it: the case needs overlapping rather than disjoint
selectors (a shape no real workload writes) and fails safe as
InsufficientCapacity, never an overcommit or a bad placement.

Signed-off-by: Nic Cope <nicc@rk0n.org>
The native and llm-d backends each defined their own _ENGINE_PORT and
_LABEL_SERVING with identical values, and each aliased base.REMOTE_NAMESPACE to
a local _REMOTE_NAMESPACE that added nothing. The engine port especially is a
contract - the ModelEndpoint URLs assume it - so two copies that can silently
disagree is a hazard.

Move ENGINE_PORT and LABEL_SERVING to base alongside REMOTE_NAMESPACE, and
reference base directly from both backends instead of re-declaring or aliasing
them.

Signed-off-by: Nic Cope <nicc@rk0n.org>
The design doc gave nodes-per-replica as pipeline * data / dataLocal, but that's
nodes-per-worker. A replica has workers.count workers, so the scheduler gates on
nodes-per-worker times workers.count - which is what topology_shape actually
computes. For a disaggregated decode role with count 3 and pipeline 2 the doc
implied 2 nodes where the scheduler consumes 6.

Multiply by workers.count in both the topology section and the scheduler steps.

Signed-off-by: Nic Cope <nicc@rk0n.org>
@negz negz merged commit d864409 into main Jun 10, 2026
3 checks passed
@negz negz mentioned this pull request Jun 10, 2026
4 tasks
dennis-upbound added a commit that referenced this pull request Jun 10, 2026
The example predated #101 merging, where nodeSelector became required: show it
on both the decode and prefill roles, each with its own device selectors
(illustrating distinct per-role hardware), and model the InfiniBand fabric as a
Synthetic device. Fold the operator "when to use" guidance into the summary as
background. Give the routing-discriminator alternative an API sketch so the
discriminator-vs-template tradeoff is concrete, and add the "two
ModelDeployments" alternative with why a single MD is better (co-location and
that a prefill-only MD isn't conceptually a model deployment).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dennis@upbound.io>
@negz negz deleted the celular branch June 16, 2026 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

nodeSelector should describe hardware as a list of DRA devices Align hardware capabilities design with Kubernetes Dynamic Resource Allocation

3 participants