Skip to content

Update design to reflect latest thinking#64

Merged
negz merged 17 commits into
mainfrom
pages
May 16, 2026
Merged

Update design to reflect latest thinking#64
negz merged 17 commits into
mainfrom
pages

Conversation

@negz

@negz negz commented May 7, 2026

Copy link
Copy Markdown
Collaborator

The team spent a lot of time on Modelplane and its API last week. The most significant change we've aligned on is to dial down the separation of concerns on deploying a model. Specifically we've dropped the model catalog and we're going to expose more knobs to the ML teams authoring a ModelDeployment.

We've made ModelDeployment much more expressive in order to make sure it could deploy a frontier (open weight) model. We've adopted a subset of DRA to express scheduling constraints. We've also decided to scale only at the replica level - i.e. a ModelDeployment can scale the number of ModelReplicas but there's no scaling within one ModelReplica.

Signed-off-by: Nic Cope <nicc@rk0n.org>
Comment thread design/api-update.md Outdated
Comment thread design/api-update.md Outdated
Comment thread design/api-update.md Outdated
Comment thread design/api-update.md Outdated
Comment thread design/api-update.md Outdated
Comment thread design/api-update.md Outdated
Comment thread design/api-update.md Outdated
Comment thread design/api-update.md Outdated
Comment thread design/api-update.md Outdated
negz added 3 commits May 7, 2026 22:18
Refine the design/api-update.md sketch following discussion of the
scaling model, disaggregated serving, and the routing surface:

- Rename ModelPlacement to ModelReplica throughout. Each replica is one
  complete serving instance — single-node, multi-node via LWS, or full
  prefill/decode disagg. Mirrors Deployment -> Pod naming.

- Drop spec.scaling from ModelDeployment. Autoscaling is opt-in via a
  separate KEDA ScaledObject targeting the MD's scale subresource —
  same pattern as Deployment + HPA. Add a worked Mixtral example with
  ScaledObject alongside.

- Add a discriminated-union pattern for disaggregated prefill/decode.
  A serving profile is either unified (root poolSelector / parallelism
  / engine) or disagg (explicit decode and prefill blocks, each
  self-contained — no inheritance from the root). Decode and prefill
  must land on the same InferenceCluster but can target different pools.

- Move inter-node networking onto InferenceClass instead of cluster-
  level capabilities. Different networking implies a different class
  (h200-nvl-8x-ib vs h200-nvl-8x); networking belongs to the pool that
  uses it. Drop spec.capabilities from InferenceCluster — cluster-level
  metadata is captured as standard Kubernetes labels.

- Lift clusterSelector to deployment level. Profiles only carry pool
  selection and per-pool composition, since the cluster intent doesn't
  change between fallback profiles.

- Switch ModelService routing to a single spec.endpoints[] pattern (was
  separate selector vs routes paths). One mechanism for both simple and
  weighted routing.

- Drop spec.model.name in favor of metadata.name as the served model
  identifier. The HuggingFace repo (or other source) is purely where
  weights come from, not the model's identity.

- Add YAML comments throughout the examples explaining what each field
  does — what gets matched, what gets composed, what's optional.

Signed-off-by: Nic Cope <nicc@rk0n.org>
Different hardware targets typically require different model weight
checkpoints (FP8 vs BF16 are different HuggingFace repos). That makes
fallback profiles within one deployment the wrong abstraction — they're
genuinely different deployments. Silent degradation (falling back to a
config with lower context length or different quantization) is also
arguably worse than explicit failure.

This commit flattens the serving profile array. poolSelector,
parallelism, and engine are now top-level fields on
ModelDeployment.spec. Different hardware configurations are separate
ModelDeployments behind one ModelService. The deployer makes explicit
decisions about which configurations to run.

The disaggregated prefill/decode pattern is now a discriminated union
on ModelDeployment itself: either root-level poolSelector/parallelism/
engine (unified) or explicit decode/prefill blocks (disaggregated).

If preferential scheduling is needed later, it would be a coordination
mechanism between ModelDeployments, not inline profiles.

Signed-off-by: Nic Cope <nicc@rk0n.org>
The parallelism block describes more than just parallelism — it's the
complete compute topology of a role within one ModelReplica. Renaming
to topology makes room for the per-role instance count (instances)
which describes replication rather than sharding.

For disaggregated prefill/decode, the P:D ratio (e.g., 5P3D) is the
number of independent instances per role within one ModelReplica. This
is a topology parameter — fixed per deployment, not a scaling knob. It
maps to KServe's LLMInferenceService.spec.replicas (decode) and
spec.prefill.replicas (prefill). For unified serving, instances
defaults to 1 and can be omitted.

Other changes in this commit:

- Require DRA on all InferenceClusters. Drop nodeSelector from pool
  declarations — DRA handles device-to-node binding. Pools are now
  just name, class, and maxNodes.

- Rename poolSelector to nodeSelector on ModelDeployment. With DRA
  required and nodeSelector gone from pools, the naming collision is
  resolved.

- Replace driver.version with cuda.toolkit as the typed capability
  example on InferenceClass. Driver version is a runtime property of
  the cluster, not the hardware SKU. CUDA toolkit version is a better
  example of where {type: version} decoration matters.

Signed-off-by: Nic Cope <nicc@rk0n.org>
dennis-upbound pushed a commit that referenced this pull request May 8, 2026
…shape to #64

- Rename ModelPlacement → ModelReplica everywhere (XRD, status fields,
  printer columns, doc, examples). Aligns with the "replica == placement"
  mental model. Pure rename + role expansion.
- Add Federation-layer scheduling section: composer / matcher / backend
  adapter / capacity adapter contracts, the actual matcher pseudocode,
  out-of-scope items with effort + uncertainty, ordering reasoning.
- Add Plugin/adapter system section: six adapter axes, version-pinned
  KServe absorbing schema churn, end-to-end managed + BYOC examples.
- Add BYOC scheduling section: onboarding flow, what "managed" means
  axis-by-axis, edge cases (no DRA, multiple schedulers, RBAC).
- Expand autoscaling walkthrough: KEDA / composer / matcher / backend
  loop with concrete scale-up + scale-down sequences.
- Replace v1/v2 framing with effort sizes + ordering + uncertainty.
- Defer API shape, hardware taxonomy, engine-features detail to #64.
dennis-upbound pushed a commit that referenced this pull request May 8, 2026
- Remove xrds/ (and lint.py / LINT.md) — API shape lives in #64.
  Examples stay as illustrative scheduling-relevant YAML.
- New section: "What we treat as IR" — three IRs (ModelReplica explicit;
  cluster substrate + endpoint binding implicit today). Argues why naming
  IR seams is what makes BYO-* cheap.
- New section: "Crossplane lifecycle layers" — per-layer XR ownership is
  what enables pause/resume, GitOps drift, RBAC boundaries, version-skew
  handling per cluster.
- Reframe plugin/adapter axes: be honest the count is contingent. Two
  user-visible axes (scheduler, backend); the other four are internal /
  collapsible. Don't read into the number.
- New section: "User-facing surface preview" with Quickstart (4 CRs,
  ~60 lines of YAML to a working curl) + 5 Advanced scenarios as deltas
  (multi-region, BYOC+KAI, P/D disagg, custom InferenceClass, spillover).
  Goal: gauge complexity of the proposed scheduling design from the
  user's seat.
dennis-upbound pushed a commit that referenced this pull request May 8, 2026
…nter

Pivots this PR from design preview to implementation sketch. The code under
functions/ doesn't run (Nic's #64 protos aren't generated yet) but the
shape, dependencies, and use cases are real.

Implementation:
- functions/compose-model-deployment/scheduling.py — federation matcher.
  Plain Python, no Crossplane imports — testable in isolation. Filters ICs
  by clusterSelector.matchLabels, filters pools by nodeSelector.cel against
  class capabilities, capacity check with sticky-placement accounting,
  scores and picks per replica. Topology strategies map to (nodes_per_inst,
  gpus_per_node). Disagg requires same-cluster decode + prefill pools.
- functions/compose-model-deployment/main.py — Crossplane glue. Required-
  resources for clusters/classes/owned MRs, calls scheduling.match(),
  emits ModelReplica + ModelEndpoint per spec.replicas, sets MD conditions.
- functions/compose-model-placement/main.py — renderer. Reads MR + matched
  IC + class(es), composes KServe LLMInferenceService (decode + optional
  prefill) + DRA ResourceClaim(s) on the target cluster via remote-object
  provider. Lifts cold-start conditions back as MR.status.

Docs:
- design.md compressed to ~180 lines: architecture diagram, what-lives-where
  table, dependencies per function, use cases traced through the code.
- README.md is now a 2-section pointer at design.md + the code.
- Deleted quickstart.md / advanced.md / scheduling.md — served the design
  phase; the code is the new source of truth.
- examples/README.md maps each example to the matcher/renderer code path
  it exercises.

Adapter functions (_load_md, _resolve_clusters, _load_mr, _cel_from_capabilities)
raise NotImplementedError — they're the wiring points that fill in once #64
lands and protos are generated.
dennis-upbound pushed a commit that referenced this pull request May 8, 2026
Adds stage-2 scheduler integration to the renderer + a sketch of the
capacity feedback loop. Same sketch quality as the rest of the PR —
doesn't run, but the dispatch shape, per-scheduler differences, and
capacity-status pipeline are honest.

- functions/compose-model-placement/scheduler.py — per-scheduler wrap
  functions. KAI: schedulerName + PodGroup CRD wrapping the LWS gang
  (minMember = total pods). Kueue: kueue.x-k8s.io/queue-name label +
  suspend gate (Kueue's webhook creates the Workload). none: pass-through.
  Single dispatch table; new schedulers (Volcano, etc.) plug in here.

- functions/compose-model-placement/main.py — wired to call scheduler.wrap()
  after building the base LLM-IS spec. Emits the wrapped spec + any
  scheduler-companion objects (PodGroup) onto the same target cluster
  via the existing remote-object provider.

- lib/capacity_adapter/{__init__,common,kai,kueue}.py — sketch of the
  per-scheduler status pullers. Reads the scheduler's status CRDs
  (KAI Queue/ResourcePool, Kueue ClusterQueue.flavorsUsage[]),
  normalizes into the shared CapacitySnapshot type, writes to
  IC.status.capacity. NOT a Crossplane composition function — runs as
  a separate controller, one per IC. Sketch shows the projection logic;
  K8s client wiring is NotImplementedError stubs.

- design.md updated with a KAI/Kueue section: per-scheduler differences
  table, dispatch wiring diagram, capacity feedback loop diagram, how
  to add a new scheduler. Notes the small API extension needed on
  Nic's #64 (IC.spec.scheduler.type).
dennis-upbound pushed a commit that referenced this pull request May 8, 2026
Won't merge until after #64 (or later) lands, so the "delta from main"
table was documenting terminology nobody will see by then. Replaces it
with a "Scheduler properties" section that pins down the load-bearing
behavior in K8s SIG-Scheduling terms — what the scheduler actually
does, no comparison column.

No code changes; doc only.
negz added 2 commits May 8, 2026 16:55
The topology block describes the shape of each worker, but workers.count
(how many of that shape) is a sibling concern at the same level — not
a property of the topology itself. Group them together under workers:

  workers:
    count: 3
    topology:
      strategy: TensorPipeline
      tensor: 8
      pipeline: 2

This reads as "3 workers, each TensorPipeline TP=8 PP=2." The topology
describes one worker. The count says how many. nodeSelector and engine
stay alongside workers as separate concerns — what hardware each worker
needs and what engine it runs.

For unified serving, workers just contains topology (count defaults to
1). For disaggregated P/D, workers.count on each role is the P:D
ratio — the "5" and "3" in 5P3D. It's a topology parameter (fixed per
deployment), not a scaling knob.

Signed-off-by: Nic Cope <nicc@rk0n.org>
The InferenceClass becomes a tested recipe that bundles both
capabilities (for scheduling) and optionally cloud-specific
provisioning config (for cluster composition). When Modelplane
provisions a GKE cluster, the composition function reads
class.provisioning.gke to get the machineType, accelerator config,
and networking — guaranteed consistent with the capabilities the
scheduler uses for matching.

The provisioning block is optional. Classes without it are
capabilities-only, used for BYO clusters where the pool already
exists. The provisioning.provider discriminator selects the
cloud-specific sibling block (gke, eks, aks).

Modelplane ships a default catalog: cloud-specific classes for
provisioned clusters (gke-h200-8x-a3-ib, gke-l4-1x-g2) and
cloud-agnostic classes for BYO (h200-8x-ib, l4-1x). The
InferenceCluster section now shows both a GKE-provisioned cluster
(pools reference cloud-specific classes) and a BYO cluster (pools
reference capabilities-only classes).

Cluster-level config (project, region, K8s version) stays on the
InferenceCluster. Pool-level config (machineType, GPU, networking)
moves to the class. Pool sizing (maxNodes, nodeCount) stays on the
InferenceCluster pool — it's a per-cluster capacity decision, not a
property of the hardware SKU.

Signed-off-by: Nic Cope <nicc@rk0n.org>
@bassam

bassam commented May 10, 2026

Copy link
Copy Markdown
Collaborator

I think we should drop the source discriminator and huggingFace: block on ModelDeployment. I don't think they represent how LLMs are actually packaged today and they get in the way of future caching features..

LLM serving has settled into three distinct factorings of (runtime, weights, optional compiled engine):

Pattern 1 — engine fetches weights at startup. Generic engine image (vLLM, SGLang, TGI upstream); engine pulls weights from HuggingFace, S3, or NGC via its native mechanism. The deployment specifies engine.image and engine args; the engine handles the rest. This is the dominant pattern for vLLM/SGLang/TGI workloads.

Pattern 2 — engine image includes weights. NIM is the canonical case. Runtime, optimization metadata, and weights are baked into one OCI image. Pull, run, serve. No separate fetch step.

Pattern 3 — runtime and artifacts stored separately. Generic runtime image plus separately-stored artifacts: weights the platform fetches and stages, compiled TensorRT-LLM .engine files in object storage, weights in an internal registry, or weights pre-staged on a PVC. Runtime mounts the artifacts at a known path and reads from there.

For pattern 1, source is redundant with what the engine arg already says. If the user writes engine.args: ["--model=meta-llama/Llama-3.1-70B-Instruct"], having a parallel source: { huggingFace: { repo: meta-llama/Llama-3.1-70B-Instruct } } field means the user has to keep two fields in sync, and Modelplane has to decide whether to download via storage initializer (and then have the engine read from a mount) or let the engine fetch directly. Either way, one of the two configurations is doing nothing.

For pattern 2, source is meaningless. The image is the source. There's nothing external to point at.

For pattern 3, source could point at the external artifacts and have Modelplane stage them. This works, but it's the case where I believe we want the primitive to be at fleet-level — a future separate ModelCache resource that names artifacts see #66. Putting source on ModelDeployment scopes the staging to a single deployment, which is exactly the wrong scope for the cases where staging matters.

I propose we drop source and the huggingFace: block from ModelDeployment for v0.1. Patterns 1 and 2 work entirely from engine.image and engine.args. Pattern 3 waits for ModelCache in v0.2, which is also the right layer for fleet-aware caching optimizations.

For private repo access (HF tokens, NGC API keys, S3 credentials), engines already accept env vars. Add an env field on engine mirroring standard PodSpec env, with valueFrom: secretKeyRef for pull tokens:

spec:
  engine:
    name: vLLM
    version: "0.8.5"
    image: vllm/vllm-openai:v0.8.5
    imagePullSecrets:
    - name: nvcr-creds
    env:
    - name: HF_TOKEN
      valueFrom:
        secretKeyRef:
          name: hf-token
          key: token
    args:
    - "--model=moonshotai/Kimi-K2-Instruct"
    - "--tensor-parallel-size=8"
    - "--pipeline-parallel-size=2"

The Kimi K2 example becomes simpler — no source: block, no huggingFace.secretRef, just the engine config. Same for Mixtral, Qwen3-Coder, and the disagg example.

negz added 2 commits May 11, 2026 16:54
Model fetching is the engine's concern, not Modelplane's. All major
engines (vLLM, SGLang, TGI) accept the model name as a CLI arg
(--model=...) and handle downloading natively. KServe's storage
initializer is KServe-specific — neither Dynamo nor llm-d uses it.

This commit removes spec.source and spec.huggingFace from all
ModelDeployment examples. The model repo moves into engine.args as
--model=<repo>. For gated models requiring authentication, engine.env
injects credentials (HF_TOKEN via secretKeyRef). Fleet-level weight
staging (pre-caching to nodes) is a separate concern addressed by a
future ModelCache resource (#66), not by fields on ModelDeployment.

Signed-off-by: Nic Cope <nicc@rk0n.org>
KEDA's ScaledObject depends on the ModelDeployment XRD declaring a
Kubernetes scale subresource. Call out the specReplicasPath and
statusReplicasPath explicitly so it's clear this is an XRD-level
configuration, not hand-waving.

Signed-off-by: Nic Cope <nicc@rk0n.org>
@negz negz changed the title WIP: Nic's API sketch Update design to reflect latest thinking May 12, 2026
@negz negz marked this pull request as ready for review May 12, 2026 06:49
Copilot AI review requested due to automatic review settings May 12, 2026 06:49

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates the Modelplane v0.1 design to reflect the latest API direction: removing the model catalog split, making ModelDeployment self-contained and expressive enough for frontier models, and shifting to replica-level fleet scheduling and scaling.

Changes:

  • Replaces InferenceEnvironment/ModelPlacement/ClusterModel with a new resource model (InferenceCluster, InferenceClass, ModelReplica, ModelService, ModelEndpoint).
  • Defines two-level scheduling (cluster label selection + node/pool CEL matching) and documents fleet scheduling + replica-only autoscaling.
  • Updates examples and diagrams to reflect the new deployment and routing model.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread design/design.md Outdated
Comment thread design/design.md Outdated
Comment thread design/design.md
Comment thread design/design.md Outdated
Comment thread design/design.md Outdated
Comment thread design/design.md
Comment thread design/design.md
Comment thread design/design.md Outdated
Comment thread design/design.md
Comment thread design/design.md
Comment thread design/design.md
Comment thread design/design.md
Comment thread design/design.md Outdated
negz added a commit that referenced this pull request May 13, 2026
InferenceClass is the bridge between hardware capabilities and
provisioning recipes. Platform teams author InferenceClass resources
describing the shape of a GPU node pool (resources block: GPUs per
node, per-GPU memory) and optionally how to provision one on a cloud
(provisioning block: machine type, accelerator, disk size). Each
InferenceCluster.nodePools[] entry references a class by name and
declares only cluster-specific counts (nodeCount, minNodeCount,
maxNodeCount, zones).

This replaces per-pool inline hardware fields and converges the GKE
and BYO (Existing) cluster shapes. The same node pool schema works for
both: classes with provisioning describe pools Modelplane creates,
classes without provisioning describe pools that already exist on a
BYO cluster.

The system node pool that hosts control-plane components (Envoy
Gateway, KEDA, etc.) is no longer in the user-facing API. The
composition function injects it automatically for GKE clusters
(e2-standard-4, 1-2 nodes). Users only declare GPU pools.

PR #64's design used DRA-shaped attributes and capacity on the class
specifically so that ModelDeployment.spec.nodeSelector CEL could
evaluate against them. With nodeSelector dropped from this branch and
pod-shape moved to workers.resources, the DRA shape adds verbosity
without a consumer. spec.resources.gpu carries the count and per-GPU
memory the scheduler and composition function actually use. The
nvidia.com/gpu device plugin name remains an internal detail of the
composition function rather than a user-facing key.

The scheduler is untouched: compose-inference-cluster still populates
status.capacity.gpuPools[] in the same shape, just sourced from the
referenced classes instead of inline pool config.

InferenceClass itself has no composed children. compose-inference-class
just marks the XR Ready.

lib/resource.py now serialises with by_alias=True so the generated
class_ alias field renders as "class" in YAML.

14 composition tests pass.

Signed-off-by: Nic Cope <nicc@rk0n.org>
negz added a commit that referenced this pull request May 13, 2026
PR #64 splits routing apart from deployment so that fan-out (replicas
on clusters) and exposure (where requests land) can evolve
independently. ModelEndpoint is a reachable inference endpoint;
ModelService selects endpoints by label and composes the Gateway-API
HTTPRoute that exposes them.

This commit introduces both kinds, moves the Envoy Backend composition
out of ModelReplica and into ModelEndpoint, and moves the HTTPRoute
composition out of ModelDeployment and into ModelService. The pattern
mirrors Kubernetes Deployment + Service: applying a ModelDeployment
alone gets you running replicas; you author a ModelService to make
them reachable.

ModelEndpoint (namespaced, short me): carries the informational URL,
the api protocol, and the rewritePath that ModelService consumes when
composing the URLRewrite filter. compose-model-endpoint parses
spec.url, composes an Envoy Backend on the control plane, and
surfaces the Backend's name in status.routing.backendName.

ModelService (namespaced, short ms): carries spec.endpoints, each a
label selector. compose-model-service fetches the InferenceGateway and
all matching ModelEndpoints, then composes an HTTPRoute that matches
the service's namespace/name path prefix and rewrites to the first
matched endpoint's rewritePath, with all matched endpoints as
backendRefs (equal weighting; weight as a field is deferred). The
service's public address surfaces on status.address.

ModelDeployment changes: stops composing the HTTPRoute, composes one
ModelEndpoint per matched cluster (labeled
modelplane.ai/deployment: <name>, with rewritePath pointing at the
remote LLMInferenceService path), and drops status.endpoint.url. The
URL surface lives on ModelService now.

ModelReplica changes: stops composing the Envoy Backend (that moves
to ModelEndpoint) and drops both status.endpoint.url and
status.routing.backendName. The replica becomes purely about
composing the LLMInferenceService on the remote cluster.

External / SaaS endpoint support (fqdn-style Backends) is deferred.
spec.url is expected to be an http://<ip>:<port>/... shape today; the
schema doesn't enforce that yet.

16 composition tests pass.

Signed-off-by: Nic Cope <nicc@rk0n.org>
dennis-upbound pushed a commit that referenced this pull request May 13, 2026
Fold in Bassam's "three packaging patterns" framing from PR #64 review
comment 4414021192 (engine-fetches-weights / engine-image-bakes-weights /
runtime-and-artifacts-separate). Grounds ModelCache as the Pattern 3
primitive that also accelerates Pattern 1; clarifies why Pattern 2 (NIM)
doesn't need it.

Add Locality routing subsection in v0.3 substrate unification connecting
the three primitives to #71 ModelService routing affinity. Cold-start
pipeline covers what new replicas need; locality routing covers where
existing requests go. ModelCache feeds both — status.clusters[] is the
eligibility signal for fleet routing.

v0.1 mechanism: tighten the scheduling-gating bullet to explicitly call
out status.clusters[] as the eligibility signal the fleet matcher reads,
not just an implicit scheduler hook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dennis-upbound pushed a commit that referenced this pull request May 13, 2026
Status header now explicitly notes ModelCache advances from v0.2 (per
Bassam's PR #64 review framing) to v0.1, driven by multi-node serving
requirements (#61 closure) and DRA landing in v0.1 (#56). Flagged for
team alignment since this is a deliberate timeline shift from the
earlier framing.

Roadmap #66 line: tighten the awkward kind/source bundling
("Weights/Tokenizer/Bytes/inline/configMap" mixed kinds with sources)
into separate categories with backtick-normalized backend / kind names.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
negz added 5 commits May 15, 2026 11:45
Signed-off-by: Nic Cope <nicc@rk0n.org>
Signed-off-by: Nic Cope <nicc@rk0n.org>
An MD's spec.engine was a tiny subset of a podSpecTemplate, containing
only the image and args. Design feedback pointed out that we'd need
imagePullSecrets, env, and a configurable /dev/shm size as well.

This got me thinking: is this just going to grow into podSpecTemplate
over time? I think that's likely. Dynamo hit it and corrected it in
ai-dynamo/dynamo#8069.

A full podSpecTemplate has some quirks at the cluster scheduler level
thougn. For example when we reference pullSecrets or envFrom - are those
resolved at the modelplane control plane level or at the inference
cluster? I think mostly it has to be at the inference cluster level.
Ultimately it'll be a node on the inference cluster pulling the worker
pod, so it needs pull secrets.

As a compromise I think we should start with a subset of
podSpecTemplate. Just the things we know we need today. We can grow into
adding more fields without a breaking change.

One important side effect of this is we now have a containers array -
not just the engine container. So spec.engine feels like the wrong
place. The user _could_ add a sidecar container too. I doubt they would
at first, but I could see this becoming the case. With that in mind I'm
proposing we move this to spec.workers.template - i.e. a worker
template.

Signed-off-by: Nic Cope <nicc@rk0n.org>
InferenceClass attributes and capacity were framed as though
Modelplane defines a vocabulary -- "gpu.nvidia.com/* for what NVIDIA
publishes, modelplane.ai/* for what Modelplane defines." But
Modelplane is not autodiscovering these from a DRA driver. They are
authored by the platform team and matched against by the ML team's
CEL expressions. They are a contract between those two teams.

The contract does have a meaningful boundary though: when the
composition function forms DRA ResourceClaims on the workload cluster,
it passes through keys that match real DRA device attributes (e.g.
gpu.nvidia.com/architecture) and filters out modelplane.ai/* keys.
So gpu.nvidia.com/* keys should match what the DRA driver actually
publishes in ResourceSlices, while modelplane.ai/* keys are for
fleet-scheduling properties that don't exist as per-device DRA
attributes (GPU count per node, inter-node networking).

This revealed two misplaced keys. gpu.nvidia.com/features was
invented -- the NVIDIA DRA driver doesn't publish a features list.
The Kimi K2 CEL example now matches on cudaComputeCapability instead,
which is a real DRA device attribute. gpu.nvidia.com/gpuCount was also
invented -- DRA models individual devices, not "how many GPUs per
node." Both are pool-level scheduling properties, not per-device DRA
attributes, so gpuCount moves to modelplane.ai/ and features is
dropped.

Signed-off-by: Nic Cope <nicc@rk0n.org>
The topology block had a strategy discriminator (Tensor,
TensorPipeline, DataExpert) selecting which parallelism fields were
relevant. But TP, PP, and DP/EP are independent axes that compose
multiplicatively through a single universal formula. The discriminator
was not selecting between different derivation rules or different field
sets -- it was naming which axes happened to be non-default.

That is not what a discriminated union is for. In a real discriminated
union (Service.type, VolumeSource) the variants have disjoint fields
with different semantics. Here every field participates in the same
formula regardless of which others are set. Setting pipeline=2 does
not change what tensor=8 means.

The practical cost of the discriminator is that every new axis
combination needs a new enum value. Models like DeepSeek V3
(TP+EP+PP) do not fit any of the three existing strategies.

topology now has four flat fields: tensor (required), pipeline, data,
and dataLocal (all defaulting to 1). The derivation is always:

  Nodes per worker = pipeline * (data / dataLocal)
  GPUs per node    = tensor * dataLocal

Signed-off-by: Nic Cope <nicc@rk0n.org>
@negz negz requested review from bassam and dennis-upbound May 16, 2026 01:33
Signed-off-by: Nic Cope <nicc@rk0n.org>

@dennis-upbound dennis-upbound left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay!!!

@negz negz merged commit 32bc956 into main May 16, 2026
4 checks passed
negz added a commit that referenced this pull request May 18, 2026
… ModelReplica

The PR #64 design replaces InferenceEnvironment with InferenceCluster and
ModelPlacement with ModelReplica. The new vocabulary lines up with how
the rest of the design is shifting: ModelDeployment fans out to
ModelReplicas (one per cluster), the way a Kubernetes Deployment fans
out to Pods. "Environment" was always overloaded - it meant both "GPU
cluster" and "organizational stage" - so dropping it in favour of
"cluster" tightens the model.

This commit applies both renames mechanically:

- apis/inferenceenvironments  -> apis/inferenceclusters (kind, plural,
  short name ic)
- apis/modelplacements        -> apis/modelreplicas (kind, plural,
  short name mr)
- functions/compose-inference-env   -> compose-inference-cluster
- functions/compose-model-placement -> compose-model-replica
- tests/test-inference-env{,-existing} -> tests/test-inference-cluster{,-existing}
- tests/test-model-placement{,-autoscaling,-multinode}
  -> tests/test-model-replica{,-autoscaling,-multinode}
- tests/test-model-deployment-incompatible-env
  -> tests/test-model-deployment-incompatible-cluster
- examples/platform/inference-environment-*.yaml
  -> inference-cluster-*.yaml

Field renames that follow from the kind renames:

- ModelReplica.spec.inferenceEnvironmentRef -> inferenceClusterRef
- ModelDeployment.spec.environmentSelector  -> clusterSelector
- ModelDeployment.spec.environments         -> clusters
- ModelDeployment.status.placements         -> replicas
- ModelDeployment printer column ENVS       -> CLUSTERS
- ModelReplica printer column ENVIRONMENT   -> CLUSTER

Label renames:

- modelplane.ai/environment -> modelplane.ai/cluster
- modelplane.ai/placement   -> modelplane.ai/replica

Condition type / reason renames track the same vocabulary (e.g.
PlacementsScheduled -> ReplicasScheduled, NoEnvironments -> NoClusters).
The composed resource keys in compose-model-deployment also move from
"placement-<name>" to "replica-<name>".

The ClusterModel/Model serving[].environmentSelector field is left
alone here - those resources are being removed entirely in the next
commit.

Signed-off-by: Nic Cope <nicc@rk0n.org>
negz added a commit that referenced this pull request May 18, 2026
The catalog split between ClusterModel/Model and ModelDeployment didn't
hold up in practice. Engine args are inherently model-specific (the
model name lives in --model=...), different quantization variants
reference different weight checkpoints, and the "platform team curates
a catalog" responsibility is real ongoing engineering work that most
organizations don't have a team for. This commit folds engine and
topology config inline on ModelDeployment, following the PR #64 design.

ClusterModel and Model are removed entirely along with their
composition function (compose-model). Source / huggingFace blocks are
gone too - engines fetch their own weights via their native
--model=<repo> argument. Auth for private repos goes through standard
PodSpec env (HF_TOKEN, NGC_API_KEY).

The new ModelDeployment spec:

- spec.replicas        How many ModelReplicas to fan out to.
- spec.clusterSelector Label selector against InferenceCluster.
- spec.workers         Compute shape of one worker:
    .count               (default 1)
    .topology.strategy   Tensor | TensorPipeline
    .topology.tensor     GPUs per node
    .topology.pipeline   Nodes per worker (TensorPipeline only)
    .resources.cpu       Required, no default
    .resources.memory    Required, no default
- spec.engine          Engine config:
    .image               Container image
    .args                Engine args (opaque)
    .env                 PodSpec-shaped env vars (HF_TOKEN etc.)
    .imagePullSecrets    For NGC and similar registries.

ModelReplica mirrors this shape minus spec.replicas and plus
spec.inferenceClusterRef.

DataExpert topology, disaggregated prefill/decode (spec.prefill), and
real CEL evaluation on nodeSelector are deferred. The scheduler reads
the topology shape directly (no more VRAM math): a pool fits when its
countPerNode >= topology.tensor and its nodes >= topology.pipeline.

Autoscaling drops out. The XRD declares the standard /scale subresource
so kubectl scale works, but the KEDA ScaledObject and Prometheus query
plumbing are removed. KEDA-via-scale-subresource opt-in lands later.

The status.model.name field is gone - the model identity now lives in
opaque engine args, and best-effort parsing would be brittle.

Naming on the remote cluster shifts from "model name sanitized to a DNS
label" to just the ModelDeployment name. Each remote cluster gets one
LLMInferenceService per deployment with this name, so the control plane
HTTPRoute can rewrite to a uniform path on every backend.

The compose-model-deployment scheduler still prefers clusters that
already have a replica for this deployment (stability). Capacity
accounting subtracts GPUs consumed by other deployments' replicas
based on each replica's own workers.topology.

13 composition tests pass.

Signed-off-by: Nic Cope <nicc@rk0n.org>
negz added a commit that referenced this pull request May 18, 2026
InferenceClass is the bridge between hardware capabilities and
provisioning recipes. Platform teams author InferenceClass resources
describing the shape of a GPU node pool (resources block: GPUs per
node, per-GPU memory) and optionally how to provision one on a cloud
(provisioning block: machine type, accelerator, disk size). Each
InferenceCluster.nodePools[] entry references a class by name and
declares only cluster-specific counts (nodeCount, minNodeCount,
maxNodeCount, zones).

This replaces per-pool inline hardware fields and converges the GKE
and BYO (Existing) cluster shapes. The same node pool schema works for
both: classes with provisioning describe pools Modelplane creates,
classes without provisioning describe pools that already exist on a
BYO cluster.

The system node pool that hosts control-plane components (Envoy
Gateway, KEDA, etc.) is no longer in the user-facing API. The
composition function injects it automatically for GKE clusters
(e2-standard-4, 1-2 nodes). Users only declare GPU pools.

PR #64's design used DRA-shaped attributes and capacity on the class
specifically so that ModelDeployment.spec.nodeSelector CEL could
evaluate against them. With nodeSelector dropped from this branch and
pod-shape moved to workers.resources, the DRA shape adds verbosity
without a consumer. spec.resources.gpu carries the count and per-GPU
memory the scheduler and composition function actually use. The
nvidia.com/gpu device plugin name remains an internal detail of the
composition function rather than a user-facing key.

The scheduler is untouched: compose-inference-cluster still populates
status.capacity.gpuPools[] in the same shape, just sourced from the
referenced classes instead of inline pool config.

InferenceClass itself has no composed children. compose-inference-class
just marks the XR Ready.

lib/resource.py now serialises with by_alias=True so the generated
class_ alias field renders as "class" in YAML.

14 composition tests pass.

Signed-off-by: Nic Cope <nicc@rk0n.org>
negz added a commit that referenced this pull request May 18, 2026
PR #64 splits routing apart from deployment so that fan-out (replicas
on clusters) and exposure (where requests land) can evolve
independently. ModelEndpoint is a reachable inference endpoint;
ModelService selects endpoints by label and composes the Gateway-API
HTTPRoute that exposes them.

This commit introduces both kinds, moves the Envoy Backend composition
out of ModelReplica and into ModelEndpoint, and moves the HTTPRoute
composition out of ModelDeployment and into ModelService. The pattern
mirrors Kubernetes Deployment + Service: applying a ModelDeployment
alone gets you running replicas; you author a ModelService to make
them reachable.

ModelEndpoint (namespaced, short me): carries the informational URL,
the api protocol, and the rewritePath that ModelService consumes when
composing the URLRewrite filter. compose-model-endpoint parses
spec.url, composes an Envoy Backend on the control plane, and
surfaces the Backend's name in status.routing.backendName.

ModelService (namespaced, short ms): carries spec.endpoints, each a
label selector. compose-model-service fetches the InferenceGateway and
all matching ModelEndpoints, then composes an HTTPRoute that matches
the service's namespace/name path prefix and rewrites to the first
matched endpoint's rewritePath, with all matched endpoints as
backendRefs (equal weighting; weight as a field is deferred). The
service's public address surfaces on status.address.

ModelDeployment changes: stops composing the HTTPRoute, composes one
ModelEndpoint per matched cluster (labeled
modelplane.ai/deployment: <name>, with rewritePath pointing at the
remote LLMInferenceService path), and drops status.endpoint.url. The
URL surface lives on ModelService now.

ModelReplica changes: stops composing the Envoy Backend (that moves
to ModelEndpoint) and drops both status.endpoint.url and
status.routing.backendName. The replica becomes purely about
composing the LLMInferenceService on the remote cluster.

External / SaaS endpoint support (fqdn-style Backends) is deferred.
spec.url is expected to be an http://<ip>:<port>/... shape today; the
schema doesn't enforce that yet.

16 composition tests pass.

Signed-off-by: Nic Cope <nicc@rk0n.org>
negz added a commit that referenced this pull request Jun 4, 2026
The previous CLI (negz/cli:diy) pinned datamodel-code-generator 0.31.2,
which generated broken Python models for fields named int/bool - it
emitted undefined int_aliased/bool_aliased type references across every
model file. This forced workarounds like naming DRA attribute fields
boolean/integer instead of their wire names bool/int.

This pins the CLI to negz/cli:mp (crossplane/cli#24 and #64 cherry-picked
onto main), which bumps datamodel-code-generator past the fix in 0.54.0.
Builtin-conflicting field names now generate a trailing-underscore Python
attribute with the original name preserved as a Pydantic alias.

This commit only bumps the CLI and regenerates schemas/python/models. The
regen reflows every model with the newer generator (mostly Optional[X] ->
X | None), so the diff is large but mechanical.

Signed-off-by: Nic Cope <nicc@rk0n.org>
negz added a commit that referenced this pull request Jun 6, 2026
The repo pinned the Crossplane CLI to negz/cli:diy, a fork branch carrying
an unreleased datamodel-code-generator bump. That bump (crossplane/cli#24
and #64) has since merged to crossplane/cli main, so this repins the CLI to
crossplane/cli directly and regenerates the Python models. The regen reflows
the affected models with the newer generator (mostly Optional[X] -> X | None).

The newer generator (datamodel-code-generator 0.59.0) emits object-typed
field defaults as a default_factory rather than a plain value. The Crossplane
SDK's resource.update serializes composed resources with
model_dump(exclude_defaults=True), which no longer recognizes the
factory-built default as equal to the declared default, so unset fields leak
into composed resources. This keeps crossplane-function-sdk-python pinned to
#208, which serializes with exclude_unset instead - "did the caller set this
field?" rather than "is it different from its default?" - which is the correct
question under server-side apply and immune to how a default is represented.

Switching the whole repo to exclude_unset surfaces a few places that
explicitly set fields to None or to a defaulted value, which exclude_defaults
previously dropped. compose-serving-stack built provider-kubernetes Objects
and Helm Releases with metadata=None and ObjectMeta(namespace=None); those now
only set the field when it's present. The compose-inference-cluster and
compose-model-deployment test fixtures are updated to reflect that explicitly
set values (a node pool's kubernetesVersion and diskSizeGb, a replica's worker
count and pipeline) now appear in composed resources.

Signed-off-by: Nic Cope <nicc@rk0n.org>
negz added a commit that referenced this pull request Jun 9, 2026
The repo pinned the Crossplane CLI to negz/cli:diy, a fork branch carrying
an unreleased datamodel-code-generator bump. That bump (crossplane/cli#24
and #64) has since merged to crossplane/cli main, so this repins the CLI to
crossplane/cli directly and regenerates the Python models. The regen reflows
the affected models with the newer generator (mostly Optional[X] -> X | None).

The newer generator (datamodel-code-generator 0.59.0) emits object-typed
field defaults as a default_factory rather than a plain value. The Crossplane
SDK's resource.update serializes composed resources with
model_dump(exclude_defaults=True), which no longer recognizes the
factory-built default as equal to the declared default, so unset fields leak
into composed resources. This keeps crossplane-function-sdk-python pinned to
#208, which serializes with exclude_unset instead - "did the caller set this
field?" rather than "is it different from its default?" - which is the correct
question under server-side apply and immune to how a default is represented.

Switching the whole repo to exclude_unset surfaces a few places that
explicitly set fields to None or to a defaulted value, which exclude_defaults
previously dropped. compose-serving-stack built provider-kubernetes Objects
and Helm Releases with metadata=None and ObjectMeta(namespace=None); those now
only set the field when it's present. The compose-inference-cluster and
compose-model-deployment test fixtures are updated to reflect that explicitly
set values (a node pool's kubernetesVersion and diskSizeGb, a replica's worker
count and pipeline) now appear in composed resources.

Signed-off-by: Nic Cope <nicc@rk0n.org>
negz added a commit that referenced this pull request Jun 10, 2026
The repo pinned the Crossplane CLI to negz/cli:diy, a fork branch carrying
an unreleased datamodel-code-generator bump. That bump (crossplane/cli#24
and #64) has since merged to crossplane/cli main, so this repins the CLI to
crossplane/cli directly and regenerates the Python models. The regen reflows
the affected models with the newer generator (mostly Optional[X] -> X | None).

The newer generator (datamodel-code-generator 0.59.0) emits object-typed
field defaults as a default_factory rather than a plain value. The Crossplane
SDK's resource.update serializes composed resources with
model_dump(exclude_defaults=True), which no longer recognizes the
factory-built default as equal to the declared default, so unset fields leak
into composed resources. This keeps crossplane-function-sdk-python pinned to
field?" rather than "is it different from its default?" - which is the correct
question under server-side apply and immune to how a default is represented.

Switching the whole repo to exclude_unset surfaces a few places that
explicitly set fields to None or to a defaulted value, which exclude_defaults
previously dropped. compose-serving-stack built provider-kubernetes Objects
and Helm Releases with metadata=None and ObjectMeta(namespace=None); those now
only set the field when it's present. The compose-inference-cluster and
compose-model-deployment test fixtures are updated to reflect that explicitly
set values (a node pool's kubernetesVersion and diskSizeGb, a replica's worker
count and pipeline) now appear in composed resources.

Signed-off-by: Nic Cope <nicc@rk0n.org>
@negz negz deleted the pages branch June 16, 2026 16:56
negz added a commit that referenced this pull request Jun 20, 2026
The flake pinned crossplane-cli to the negz/cli default-to-go branch
because the CLI changes modelplane depends on weren't yet merged
upstream. They now are: #126 (host-native default flake package) merged,
joining the already-merged #24, #64, and #119, and #127 (decompress
function runtime tarballs once when loading) merged on top.

This change repoints the input at crossplane/cli main and bumps the lock
to that commit, so we no longer depend on a personal fork. It stays on
main rather than a tag because the fixes aren't in a release yet.

Signed-off-by: Nic Cope <nicc@rk0n.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants