Update design to reflect latest thinking by negz · Pull Request #64 · modelplaneai/modelplane

negz · 2026-05-07T23:56:52Z

The team spent a lot of time on Modelplane and its API last week. The most significant change we've aligned on is to dial down the separation of concerns on deploying a model. Specifically we've dropped the model catalog and we're going to expose more knobs to the ML teams authoring a ModelDeployment.

We've made ModelDeployment much more expressive in order to make sure it could deploy a frontier (open weight) model. We've adopted a subset of DRA to express scheduling constraints. We've also decided to scale only at the replica level - i.e. a ModelDeployment can scale the number of ModelReplicas but there's no scaling within one ModelReplica.

Signed-off-by: Nic Cope <nicc@rk0n.org>

Refine the design/api-update.md sketch following discussion of the scaling model, disaggregated serving, and the routing surface: - Rename ModelPlacement to ModelReplica throughout. Each replica is one complete serving instance — single-node, multi-node via LWS, or full prefill/decode disagg. Mirrors Deployment -> Pod naming. - Drop spec.scaling from ModelDeployment. Autoscaling is opt-in via a separate KEDA ScaledObject targeting the MD's scale subresource — same pattern as Deployment + HPA. Add a worked Mixtral example with ScaledObject alongside. - Add a discriminated-union pattern for disaggregated prefill/decode. A serving profile is either unified (root poolSelector / parallelism / engine) or disagg (explicit decode and prefill blocks, each self-contained — no inheritance from the root). Decode and prefill must land on the same InferenceCluster but can target different pools. - Move inter-node networking onto InferenceClass instead of cluster- level capabilities. Different networking implies a different class (h200-nvl-8x-ib vs h200-nvl-8x); networking belongs to the pool that uses it. Drop spec.capabilities from InferenceCluster — cluster-level metadata is captured as standard Kubernetes labels. - Lift clusterSelector to deployment level. Profiles only carry pool selection and per-pool composition, since the cluster intent doesn't change between fallback profiles. - Switch ModelService routing to a single spec.endpoints[] pattern (was separate selector vs routes paths). One mechanism for both simple and weighted routing. - Drop spec.model.name in favor of metadata.name as the served model identifier. The HuggingFace repo (or other source) is purely where weights come from, not the model's identity. - Add YAML comments throughout the examples explaining what each field does — what gets matched, what gets composed, what's optional. Signed-off-by: Nic Cope <nicc@rk0n.org>

Different hardware targets typically require different model weight checkpoints (FP8 vs BF16 are different HuggingFace repos). That makes fallback profiles within one deployment the wrong abstraction — they're genuinely different deployments. Silent degradation (falling back to a config with lower context length or different quantization) is also arguably worse than explicit failure. This commit flattens the serving profile array. poolSelector, parallelism, and engine are now top-level fields on ModelDeployment.spec. Different hardware configurations are separate ModelDeployments behind one ModelService. The deployer makes explicit decisions about which configurations to run. The disaggregated prefill/decode pattern is now a discriminated union on ModelDeployment itself: either root-level poolSelector/parallelism/ engine (unified) or explicit decode/prefill blocks (disaggregated). If preferential scheduling is needed later, it would be a coordination mechanism between ModelDeployments, not inline profiles. Signed-off-by: Nic Cope <nicc@rk0n.org>

The parallelism block describes more than just parallelism — it's the complete compute topology of a role within one ModelReplica. Renaming to topology makes room for the per-role instance count (instances) which describes replication rather than sharding. For disaggregated prefill/decode, the P:D ratio (e.g., 5P3D) is the number of independent instances per role within one ModelReplica. This is a topology parameter — fixed per deployment, not a scaling knob. It maps to KServe's LLMInferenceService.spec.replicas (decode) and spec.prefill.replicas (prefill). For unified serving, instances defaults to 1 and can be omitted. Other changes in this commit: - Require DRA on all InferenceClusters. Drop nodeSelector from pool declarations — DRA handles device-to-node binding. Pools are now just name, class, and maxNodes. - Rename poolSelector to nodeSelector on ModelDeployment. With DRA required and nodeSelector gone from pools, the naming collision is resolved. - Replace driver.version with cuda.toolkit as the typed capability example on InferenceClass. Driver version is a runtime property of the cluster, not the hardware SKU. CUDA toolkit version is a better example of where {type: version} decoration matters. Signed-off-by: Nic Cope <nicc@rk0n.org>

…shape to #64 - Rename ModelPlacement → ModelReplica everywhere (XRD, status fields, printer columns, doc, examples). Aligns with the "replica == placement" mental model. Pure rename + role expansion. - Add Federation-layer scheduling section: composer / matcher / backend adapter / capacity adapter contracts, the actual matcher pseudocode, out-of-scope items with effort + uncertainty, ordering reasoning. - Add Plugin/adapter system section: six adapter axes, version-pinned KServe absorbing schema churn, end-to-end managed + BYOC examples. - Add BYOC scheduling section: onboarding flow, what "managed" means axis-by-axis, edge cases (no DRA, multiple schedulers, RBAC). - Expand autoscaling walkthrough: KEDA / composer / matcher / backend loop with concrete scale-up + scale-down sequences. - Replace v1/v2 framing with effort sizes + ordering + uncertainty. - Defer API shape, hardware taxonomy, engine-features detail to #64.

- Remove xrds/ (and lint.py / LINT.md) — API shape lives in #64. Examples stay as illustrative scheduling-relevant YAML. - New section: "What we treat as IR" — three IRs (ModelReplica explicit; cluster substrate + endpoint binding implicit today). Argues why naming IR seams is what makes BYO-* cheap. - New section: "Crossplane lifecycle layers" — per-layer XR ownership is what enables pause/resume, GitOps drift, RBAC boundaries, version-skew handling per cluster. - Reframe plugin/adapter axes: be honest the count is contingent. Two user-visible axes (scheduler, backend); the other four are internal / collapsible. Don't read into the number. - New section: "User-facing surface preview" with Quickstart (4 CRs, ~60 lines of YAML to a working curl) + 5 Advanced scenarios as deltas (multi-region, BYOC+KAI, P/D disagg, custom InferenceClass, spillover). Goal: gauge complexity of the proposed scheduling design from the user's seat.

…nter Pivots this PR from design preview to implementation sketch. The code under functions/ doesn't run (Nic's #64 protos aren't generated yet) but the shape, dependencies, and use cases are real. Implementation: - functions/compose-model-deployment/scheduling.py — federation matcher. Plain Python, no Crossplane imports — testable in isolation. Filters ICs by clusterSelector.matchLabels, filters pools by nodeSelector.cel against class capabilities, capacity check with sticky-placement accounting, scores and picks per replica. Topology strategies map to (nodes_per_inst, gpus_per_node). Disagg requires same-cluster decode + prefill pools. - functions/compose-model-deployment/main.py — Crossplane glue. Required- resources for clusters/classes/owned MRs, calls scheduling.match(), emits ModelReplica + ModelEndpoint per spec.replicas, sets MD conditions. - functions/compose-model-placement/main.py — renderer. Reads MR + matched IC + class(es), composes KServe LLMInferenceService (decode + optional prefill) + DRA ResourceClaim(s) on the target cluster via remote-object provider. Lifts cold-start conditions back as MR.status. Docs: - design.md compressed to ~180 lines: architecture diagram, what-lives-where table, dependencies per function, use cases traced through the code. - README.md is now a 2-section pointer at design.md + the code. - Deleted quickstart.md / advanced.md / scheduling.md — served the design phase; the code is the new source of truth. - examples/README.md maps each example to the matcher/renderer code path it exercises. Adapter functions (_load_md, _resolve_clusters, _load_mr, _cel_from_capabilities) raise NotImplementedError — they're the wiring points that fill in once #64 lands and protos are generated.

Adds stage-2 scheduler integration to the renderer + a sketch of the capacity feedback loop. Same sketch quality as the rest of the PR — doesn't run, but the dispatch shape, per-scheduler differences, and capacity-status pipeline are honest. - functions/compose-model-placement/scheduler.py — per-scheduler wrap functions. KAI: schedulerName + PodGroup CRD wrapping the LWS gang (minMember = total pods). Kueue: kueue.x-k8s.io/queue-name label + suspend gate (Kueue's webhook creates the Workload). none: pass-through. Single dispatch table; new schedulers (Volcano, etc.) plug in here. - functions/compose-model-placement/main.py — wired to call scheduler.wrap() after building the base LLM-IS spec. Emits the wrapped spec + any scheduler-companion objects (PodGroup) onto the same target cluster via the existing remote-object provider. - lib/capacity_adapter/{__init__,common,kai,kueue}.py — sketch of the per-scheduler status pullers. Reads the scheduler's status CRDs (KAI Queue/ResourcePool, Kueue ClusterQueue.flavorsUsage[]), normalizes into the shared CapacitySnapshot type, writes to IC.status.capacity. NOT a Crossplane composition function — runs as a separate controller, one per IC. Sketch shows the projection logic; K8s client wiring is NotImplementedError stubs. - design.md updated with a KAI/Kueue section: per-scheduler differences table, dispatch wiring diagram, capacity feedback loop diagram, how to add a new scheduler. Notes the small API extension needed on Nic's #64 (IC.spec.scheduler.type).

Won't merge until after #64 (or later) lands, so the "delta from main" table was documenting terminology nobody will see by then. Replaces it with a "Scheduler properties" section that pins down the load-bearing behavior in K8s SIG-Scheduling terms — what the scheduler actually does, no comparison column. No code changes; doc only.

The topology block describes the shape of each worker, but workers.count (how many of that shape) is a sibling concern at the same level — not a property of the topology itself. Group them together under workers: workers: count: 3 topology: strategy: TensorPipeline tensor: 8 pipeline: 2 This reads as "3 workers, each TensorPipeline TP=8 PP=2." The topology describes one worker. The count says how many. nodeSelector and engine stay alongside workers as separate concerns — what hardware each worker needs and what engine it runs. For unified serving, workers just contains topology (count defaults to 1). For disaggregated P/D, workers.count on each role is the P:D ratio — the "5" and "3" in 5P3D. It's a topology parameter (fixed per deployment), not a scaling knob. Signed-off-by: Nic Cope <nicc@rk0n.org>

The InferenceClass becomes a tested recipe that bundles both capabilities (for scheduling) and optionally cloud-specific provisioning config (for cluster composition). When Modelplane provisions a GKE cluster, the composition function reads class.provisioning.gke to get the machineType, accelerator config, and networking — guaranteed consistent with the capabilities the scheduler uses for matching. The provisioning block is optional. Classes without it are capabilities-only, used for BYO clusters where the pool already exists. The provisioning.provider discriminator selects the cloud-specific sibling block (gke, eks, aks). Modelplane ships a default catalog: cloud-specific classes for provisioned clusters (gke-h200-8x-a3-ib, gke-l4-1x-g2) and cloud-agnostic classes for BYO (h200-8x-ib, l4-1x). The InferenceCluster section now shows both a GKE-provisioned cluster (pools reference cloud-specific classes) and a BYO cluster (pools reference capabilities-only classes). Cluster-level config (project, region, K8s version) stays on the InferenceCluster. Pool-level config (machineType, GPU, networking) moves to the class. Pool sizing (maxNodes, nodeCount) stays on the InferenceCluster pool — it's a per-cluster capacity decision, not a property of the hardware SKU. Signed-off-by: Nic Cope <nicc@rk0n.org>

bassam · 2026-05-10T00:16:33Z

I think we should drop the source discriminator and huggingFace: block on ModelDeployment. I don't think they represent how LLMs are actually packaged today and they get in the way of future caching features..

LLM serving has settled into three distinct factorings of (runtime, weights, optional compiled engine):

Pattern 1 — engine fetches weights at startup. Generic engine image (vLLM, SGLang, TGI upstream); engine pulls weights from HuggingFace, S3, or NGC via its native mechanism. The deployment specifies engine.image and engine args; the engine handles the rest. This is the dominant pattern for vLLM/SGLang/TGI workloads.

Pattern 2 — engine image includes weights. NIM is the canonical case. Runtime, optimization metadata, and weights are baked into one OCI image. Pull, run, serve. No separate fetch step.

Pattern 3 — runtime and artifacts stored separately. Generic runtime image plus separately-stored artifacts: weights the platform fetches and stages, compiled TensorRT-LLM .engine files in object storage, weights in an internal registry, or weights pre-staged on a PVC. Runtime mounts the artifacts at a known path and reads from there.

For pattern 1, source is redundant with what the engine arg already says. If the user writes engine.args: ["--model=meta-llama/Llama-3.1-70B-Instruct"], having a parallel source: { huggingFace: { repo: meta-llama/Llama-3.1-70B-Instruct } } field means the user has to keep two fields in sync, and Modelplane has to decide whether to download via storage initializer (and then have the engine read from a mount) or let the engine fetch directly. Either way, one of the two configurations is doing nothing.

For pattern 2, source is meaningless. The image is the source. There's nothing external to point at.

For pattern 3, source could point at the external artifacts and have Modelplane stage them. This works, but it's the case where I believe we want the primitive to be at fleet-level — a future separate ModelCache resource that names artifacts see #66. Putting source on ModelDeployment scopes the staging to a single deployment, which is exactly the wrong scope for the cases where staging matters.

I propose we drop source and the huggingFace: block from ModelDeployment for v0.1. Patterns 1 and 2 work entirely from engine.image and engine.args. Pattern 3 waits for ModelCache in v0.2, which is also the right layer for fleet-aware caching optimizations.

For private repo access (HF tokens, NGC API keys, S3 credentials), engines already accept env vars. Add an env field on engine mirroring standard PodSpec env, with valueFrom: secretKeyRef for pull tokens:

spec:
  engine:
    name: vLLM
    version: "0.8.5"
    image: vllm/vllm-openai:v0.8.5
    imagePullSecrets:
    - name: nvcr-creds
    env:
    - name: HF_TOKEN
      valueFrom:
        secretKeyRef:
          name: hf-token
          key: token
    args:
    - "--model=moonshotai/Kimi-K2-Instruct"
    - "--tensor-parallel-size=8"
    - "--pipeline-parallel-size=2"

The Kimi K2 example becomes simpler — no source: block, no huggingFace.secretRef, just the engine config. Same for Mixtral, Qwen3-Coder, and the disagg example.

Model fetching is the engine's concern, not Modelplane's. All major engines (vLLM, SGLang, TGI) accept the model name as a CLI arg (--model=...) and handle downloading natively. KServe's storage initializer is KServe-specific — neither Dynamo nor llm-d uses it. This commit removes spec.source and spec.huggingFace from all ModelDeployment examples. The model repo moves into engine.args as --model=<repo>. For gated models requiring authentication, engine.env injects credentials (HF_TOKEN via secretKeyRef). Fleet-level weight staging (pre-caching to nodes) is a separate concern addressed by a future ModelCache resource (#66), not by fields on ModelDeployment. Signed-off-by: Nic Cope <nicc@rk0n.org>

KEDA's ScaledObject depends on the ModelDeployment XRD declaring a Kubernetes scale subresource. Call out the specReplicasPath and statusReplicasPath explicitly so it's clear this is an XRD-level configuration, not hand-waving. Signed-off-by: Nic Cope <nicc@rk0n.org>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates the Modelplane v0.1 design to reflect the latest API direction: removing the model catalog split, making ModelDeployment self-contained and expressive enough for frontier models, and shifting to replica-level fleet scheduling and scaling.

Changes:

Replaces InferenceEnvironment/ModelPlacement/ClusterModel with a new resource model (InferenceCluster, InferenceClass, ModelReplica, ModelService, ModelEndpoint).
Defines two-level scheduling (cluster label selection + node/pool CEL matching) and documents fleet scheduling + replica-only autoscaling.
Updates examples and diagrams to reflect the new deployment and routing model.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

InferenceClass is the bridge between hardware capabilities and provisioning recipes. Platform teams author InferenceClass resources describing the shape of a GPU node pool (resources block: GPUs per node, per-GPU memory) and optionally how to provision one on a cloud (provisioning block: machine type, accelerator, disk size). Each InferenceCluster.nodePools[] entry references a class by name and declares only cluster-specific counts (nodeCount, minNodeCount, maxNodeCount, zones). This replaces per-pool inline hardware fields and converges the GKE and BYO (Existing) cluster shapes. The same node pool schema works for both: classes with provisioning describe pools Modelplane creates, classes without provisioning describe pools that already exist on a BYO cluster. The system node pool that hosts control-plane components (Envoy Gateway, KEDA, etc.) is no longer in the user-facing API. The composition function injects it automatically for GKE clusters (e2-standard-4, 1-2 nodes). Users only declare GPU pools. PR #64's design used DRA-shaped attributes and capacity on the class specifically so that ModelDeployment.spec.nodeSelector CEL could evaluate against them. With nodeSelector dropped from this branch and pod-shape moved to workers.resources, the DRA shape adds verbosity without a consumer. spec.resources.gpu carries the count and per-GPU memory the scheduler and composition function actually use. The nvidia.com/gpu device plugin name remains an internal detail of the composition function rather than a user-facing key. The scheduler is untouched: compose-inference-cluster still populates status.capacity.gpuPools[] in the same shape, just sourced from the referenced classes instead of inline pool config. InferenceClass itself has no composed children. compose-inference-class just marks the XR Ready. lib/resource.py now serialises with by_alias=True so the generated class_ alias field renders as "class" in YAML. 14 composition tests pass. Signed-off-by: Nic Cope <nicc@rk0n.org>

PR #64 splits routing apart from deployment so that fan-out (replicas on clusters) and exposure (where requests land) can evolve independently. ModelEndpoint is a reachable inference endpoint; ModelService selects endpoints by label and composes the Gateway-API HTTPRoute that exposes them. This commit introduces both kinds, moves the Envoy Backend composition out of ModelReplica and into ModelEndpoint, and moves the HTTPRoute composition out of ModelDeployment and into ModelService. The pattern mirrors Kubernetes Deployment + Service: applying a ModelDeployment alone gets you running replicas; you author a ModelService to make them reachable. ModelEndpoint (namespaced, short me): carries the informational URL, the api protocol, and the rewritePath that ModelService consumes when composing the URLRewrite filter. compose-model-endpoint parses spec.url, composes an Envoy Backend on the control plane, and surfaces the Backend's name in status.routing.backendName. ModelService (namespaced, short ms): carries spec.endpoints, each a label selector. compose-model-service fetches the InferenceGateway and all matching ModelEndpoints, then composes an HTTPRoute that matches the service's namespace/name path prefix and rewrites to the first matched endpoint's rewritePath, with all matched endpoints as backendRefs (equal weighting; weight as a field is deferred). The service's public address surfaces on status.address. ModelDeployment changes: stops composing the HTTPRoute, composes one ModelEndpoint per matched cluster (labeled modelplane.ai/deployment: <name>, with rewritePath pointing at the remote LLMInferenceService path), and drops status.endpoint.url. The URL surface lives on ModelService now. ModelReplica changes: stops composing the Envoy Backend (that moves to ModelEndpoint) and drops both status.endpoint.url and status.routing.backendName. The replica becomes purely about composing the LLMInferenceService on the remote cluster. External / SaaS endpoint support (fqdn-style Backends) is deferred. spec.url is expected to be an http://<ip>:<port>/... shape today; the schema doesn't enforce that yet. 16 composition tests pass. Signed-off-by: Nic Cope <nicc@rk0n.org>

Fold in Bassam's "three packaging patterns" framing from PR #64 review comment 4414021192 (engine-fetches-weights / engine-image-bakes-weights / runtime-and-artifacts-separate). Grounds ModelCache as the Pattern 3 primitive that also accelerates Pattern 1; clarifies why Pattern 2 (NIM) doesn't need it. Add Locality routing subsection in v0.3 substrate unification connecting the three primitives to #71 ModelService routing affinity. Cold-start pipeline covers what new replicas need; locality routing covers where existing requests go. ModelCache feeds both — status.clusters[] is the eligibility signal for fleet routing. v0.1 mechanism: tighten the scheduling-gating bullet to explicitly call out status.clusters[] as the eligibility signal the fleet matcher reads, not just an implicit scheduler hook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Status header now explicitly notes ModelCache advances from v0.2 (per Bassam's PR #64 review framing) to v0.1, driven by multi-node serving requirements (#61 closure) and DRA landing in v0.1 (#56). Flagged for team alignment since this is a deliberate timeline shift from the earlier framing. Roadmap #66 line: tighten the awkward kind/source bundling ("Weights/Tokenizer/Bytes/inline/configMap" mixed kinds with sources) into separate categories with backtick-normalized backend / kind names. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Nic Cope <nicc@rk0n.org>

An MD's spec.engine was a tiny subset of a podSpecTemplate, containing only the image and args. Design feedback pointed out that we'd need imagePullSecrets, env, and a configurable /dev/shm size as well. This got me thinking: is this just going to grow into podSpecTemplate over time? I think that's likely. Dynamo hit it and corrected it in ai-dynamo/dynamo#8069. A full podSpecTemplate has some quirks at the cluster scheduler level thougn. For example when we reference pullSecrets or envFrom - are those resolved at the modelplane control plane level or at the inference cluster? I think mostly it has to be at the inference cluster level. Ultimately it'll be a node on the inference cluster pulling the worker pod, so it needs pull secrets. As a compromise I think we should start with a subset of podSpecTemplate. Just the things we know we need today. We can grow into adding more fields without a breaking change. One important side effect of this is we now have a containers array - not just the engine container. So spec.engine feels like the wrong place. The user _could_ add a sidecar container too. I doubt they would at first, but I could see this becoming the case. With that in mind I'm proposing we move this to spec.workers.template - i.e. a worker template. Signed-off-by: Nic Cope <nicc@rk0n.org>

InferenceClass attributes and capacity were framed as though Modelplane defines a vocabulary -- "gpu.nvidia.com/* for what NVIDIA publishes, modelplane.ai/* for what Modelplane defines." But Modelplane is not autodiscovering these from a DRA driver. They are authored by the platform team and matched against by the ML team's CEL expressions. They are a contract between those two teams. The contract does have a meaningful boundary though: when the composition function forms DRA ResourceClaims on the workload cluster, it passes through keys that match real DRA device attributes (e.g. gpu.nvidia.com/architecture) and filters out modelplane.ai/* keys. So gpu.nvidia.com/* keys should match what the DRA driver actually publishes in ResourceSlices, while modelplane.ai/* keys are for fleet-scheduling properties that don't exist as per-device DRA attributes (GPU count per node, inter-node networking). This revealed two misplaced keys. gpu.nvidia.com/features was invented -- the NVIDIA DRA driver doesn't publish a features list. The Kimi K2 CEL example now matches on cudaComputeCapability instead, which is a real DRA device attribute. gpu.nvidia.com/gpuCount was also invented -- DRA models individual devices, not "how many GPUs per node." Both are pool-level scheduling properties, not per-device DRA attributes, so gpuCount moves to modelplane.ai/ and features is dropped. Signed-off-by: Nic Cope <nicc@rk0n.org>

The topology block had a strategy discriminator (Tensor, TensorPipeline, DataExpert) selecting which parallelism fields were relevant. But TP, PP, and DP/EP are independent axes that compose multiplicatively through a single universal formula. The discriminator was not selecting between different derivation rules or different field sets -- it was naming which axes happened to be non-default. That is not what a discriminated union is for. In a real discriminated union (Service.type, VolumeSource) the variants have disjoint fields with different semantics. Here every field participates in the same formula regardless of which others are set. Setting pipeline=2 does not change what tensor=8 means. The practical cost of the discriminator is that every new axis combination needs a new enum value. Models like DeepSeek V3 (TP+EP+PP) do not fit any of the three existing strategies. topology now has four flat fields: tensor (required), pipeline, data, and dataLocal (all defaulting to 1). The derivation is always: Nodes per worker = pipeline * (data / dataLocal) GPUs per node = tensor * dataLocal Signed-off-by: Nic Cope <nicc@rk0n.org>

Signed-off-by: Nic Cope <nicc@rk0n.org>

dennis-upbound

Yay!!!

… ModelReplica The PR #64 design replaces InferenceEnvironment with InferenceCluster and ModelPlacement with ModelReplica. The new vocabulary lines up with how the rest of the design is shifting: ModelDeployment fans out to ModelReplicas (one per cluster), the way a Kubernetes Deployment fans out to Pods. "Environment" was always overloaded - it meant both "GPU cluster" and "organizational stage" - so dropping it in favour of "cluster" tightens the model. This commit applies both renames mechanically: - apis/inferenceenvironments -> apis/inferenceclusters (kind, plural, short name ic) - apis/modelplacements -> apis/modelreplicas (kind, plural, short name mr) - functions/compose-inference-env -> compose-inference-cluster - functions/compose-model-placement -> compose-model-replica - tests/test-inference-env{,-existing} -> tests/test-inference-cluster{,-existing} - tests/test-model-placement{,-autoscaling,-multinode} -> tests/test-model-replica{,-autoscaling,-multinode} - tests/test-model-deployment-incompatible-env -> tests/test-model-deployment-incompatible-cluster - examples/platform/inference-environment-*.yaml -> inference-cluster-*.yaml Field renames that follow from the kind renames: - ModelReplica.spec.inferenceEnvironmentRef -> inferenceClusterRef - ModelDeployment.spec.environmentSelector -> clusterSelector - ModelDeployment.spec.environments -> clusters - ModelDeployment.status.placements -> replicas - ModelDeployment printer column ENVS -> CLUSTERS - ModelReplica printer column ENVIRONMENT -> CLUSTER Label renames: - modelplane.ai/environment -> modelplane.ai/cluster - modelplane.ai/placement -> modelplane.ai/replica Condition type / reason renames track the same vocabulary (e.g. PlacementsScheduled -> ReplicasScheduled, NoEnvironments -> NoClusters). The composed resource keys in compose-model-deployment also move from "placement-<name>" to "replica-<name>". The ClusterModel/Model serving[].environmentSelector field is left alone here - those resources are being removed entirely in the next commit. Signed-off-by: Nic Cope <nicc@rk0n.org>

The catalog split between ClusterModel/Model and ModelDeployment didn't hold up in practice. Engine args are inherently model-specific (the model name lives in --model=...), different quantization variants reference different weight checkpoints, and the "platform team curates a catalog" responsibility is real ongoing engineering work that most organizations don't have a team for. This commit folds engine and topology config inline on ModelDeployment, following the PR #64 design. ClusterModel and Model are removed entirely along with their composition function (compose-model). Source / huggingFace blocks are gone too - engines fetch their own weights via their native --model=<repo> argument. Auth for private repos goes through standard PodSpec env (HF_TOKEN, NGC_API_KEY). The new ModelDeployment spec: - spec.replicas How many ModelReplicas to fan out to. - spec.clusterSelector Label selector against InferenceCluster. - spec.workers Compute shape of one worker: .count (default 1) .topology.strategy Tensor | TensorPipeline .topology.tensor GPUs per node .topology.pipeline Nodes per worker (TensorPipeline only) .resources.cpu Required, no default .resources.memory Required, no default - spec.engine Engine config: .image Container image .args Engine args (opaque) .env PodSpec-shaped env vars (HF_TOKEN etc.) .imagePullSecrets For NGC and similar registries. ModelReplica mirrors this shape minus spec.replicas and plus spec.inferenceClusterRef. DataExpert topology, disaggregated prefill/decode (spec.prefill), and real CEL evaluation on nodeSelector are deferred. The scheduler reads the topology shape directly (no more VRAM math): a pool fits when its countPerNode >= topology.tensor and its nodes >= topology.pipeline. Autoscaling drops out. The XRD declares the standard /scale subresource so kubectl scale works, but the KEDA ScaledObject and Prometheus query plumbing are removed. KEDA-via-scale-subresource opt-in lands later. The status.model.name field is gone - the model identity now lives in opaque engine args, and best-effort parsing would be brittle. Naming on the remote cluster shifts from "model name sanitized to a DNS label" to just the ModelDeployment name. Each remote cluster gets one LLMInferenceService per deployment with this name, so the control plane HTTPRoute can rewrite to a uniform path on every backend. The compose-model-deployment scheduler still prefers clusters that already have a replica for this deployment (stability). Capacity accounting subtracts GPUs consumed by other deployments' replicas based on each replica's own workers.topology. 13 composition tests pass. Signed-off-by: Nic Cope <nicc@rk0n.org>

InferenceClass is the bridge between hardware capabilities and provisioning recipes. Platform teams author InferenceClass resources describing the shape of a GPU node pool (resources block: GPUs per node, per-GPU memory) and optionally how to provision one on a cloud (provisioning block: machine type, accelerator, disk size). Each InferenceCluster.nodePools[] entry references a class by name and declares only cluster-specific counts (nodeCount, minNodeCount, maxNodeCount, zones). This replaces per-pool inline hardware fields and converges the GKE and BYO (Existing) cluster shapes. The same node pool schema works for both: classes with provisioning describe pools Modelplane creates, classes without provisioning describe pools that already exist on a BYO cluster. The system node pool that hosts control-plane components (Envoy Gateway, KEDA, etc.) is no longer in the user-facing API. The composition function injects it automatically for GKE clusters (e2-standard-4, 1-2 nodes). Users only declare GPU pools. PR #64's design used DRA-shaped attributes and capacity on the class specifically so that ModelDeployment.spec.nodeSelector CEL could evaluate against them. With nodeSelector dropped from this branch and pod-shape moved to workers.resources, the DRA shape adds verbosity without a consumer. spec.resources.gpu carries the count and per-GPU memory the scheduler and composition function actually use. The nvidia.com/gpu device plugin name remains an internal detail of the composition function rather than a user-facing key. The scheduler is untouched: compose-inference-cluster still populates status.capacity.gpuPools[] in the same shape, just sourced from the referenced classes instead of inline pool config. InferenceClass itself has no composed children. compose-inference-class just marks the XR Ready. lib/resource.py now serialises with by_alias=True so the generated class_ alias field renders as "class" in YAML. 14 composition tests pass. Signed-off-by: Nic Cope <nicc@rk0n.org>

PR #64 splits routing apart from deployment so that fan-out (replicas on clusters) and exposure (where requests land) can evolve independently. ModelEndpoint is a reachable inference endpoint; ModelService selects endpoints by label and composes the Gateway-API HTTPRoute that exposes them. This commit introduces both kinds, moves the Envoy Backend composition out of ModelReplica and into ModelEndpoint, and moves the HTTPRoute composition out of ModelDeployment and into ModelService. The pattern mirrors Kubernetes Deployment + Service: applying a ModelDeployment alone gets you running replicas; you author a ModelService to make them reachable. ModelEndpoint (namespaced, short me): carries the informational URL, the api protocol, and the rewritePath that ModelService consumes when composing the URLRewrite filter. compose-model-endpoint parses spec.url, composes an Envoy Backend on the control plane, and surfaces the Backend's name in status.routing.backendName. ModelService (namespaced, short ms): carries spec.endpoints, each a label selector. compose-model-service fetches the InferenceGateway and all matching ModelEndpoints, then composes an HTTPRoute that matches the service's namespace/name path prefix and rewrites to the first matched endpoint's rewritePath, with all matched endpoints as backendRefs (equal weighting; weight as a field is deferred). The service's public address surfaces on status.address. ModelDeployment changes: stops composing the HTTPRoute, composes one ModelEndpoint per matched cluster (labeled modelplane.ai/deployment: <name>, with rewritePath pointing at the remote LLMInferenceService path), and drops status.endpoint.url. The URL surface lives on ModelService now. ModelReplica changes: stops composing the Envoy Backend (that moves to ModelEndpoint) and drops both status.endpoint.url and status.routing.backendName. The replica becomes purely about composing the LLMInferenceService on the remote cluster. External / SaaS endpoint support (fqdn-style Backends) is deferred. spec.url is expected to be an http://<ip>:<port>/... shape today; the schema doesn't enforce that yet. 16 composition tests pass. Signed-off-by: Nic Cope <nicc@rk0n.org>

The previous CLI (negz/cli:diy) pinned datamodel-code-generator 0.31.2, which generated broken Python models for fields named int/bool - it emitted undefined int_aliased/bool_aliased type references across every model file. This forced workarounds like naming DRA attribute fields boolean/integer instead of their wire names bool/int. This pins the CLI to negz/cli:mp (crossplane/cli#24 and #64 cherry-picked onto main), which bumps datamodel-code-generator past the fix in 0.54.0. Builtin-conflicting field names now generate a trailing-underscore Python attribute with the original name preserved as a Pydantic alias. This commit only bumps the CLI and regenerates schemas/python/models. The regen reflows every model with the newer generator (mostly Optional[X] -> X | None), so the diff is large but mechanical. Signed-off-by: Nic Cope <nicc@rk0n.org>

The repo pinned the Crossplane CLI to negz/cli:diy, a fork branch carrying an unreleased datamodel-code-generator bump. That bump (crossplane/cli#24 and #64) has since merged to crossplane/cli main, so this repins the CLI to crossplane/cli directly and regenerates the Python models. The regen reflows the affected models with the newer generator (mostly Optional[X] -> X | None). The newer generator (datamodel-code-generator 0.59.0) emits object-typed field defaults as a default_factory rather than a plain value. The Crossplane SDK's resource.update serializes composed resources with model_dump(exclude_defaults=True), which no longer recognizes the factory-built default as equal to the declared default, so unset fields leak into composed resources. This keeps crossplane-function-sdk-python pinned to #208, which serializes with exclude_unset instead - "did the caller set this field?" rather than "is it different from its default?" - which is the correct question under server-side apply and immune to how a default is represented. Switching the whole repo to exclude_unset surfaces a few places that explicitly set fields to None or to a defaulted value, which exclude_defaults previously dropped. compose-serving-stack built provider-kubernetes Objects and Helm Releases with metadata=None and ObjectMeta(namespace=None); those now only set the field when it's present. The compose-inference-cluster and compose-model-deployment test fixtures are updated to reflect that explicitly set values (a node pool's kubernetesVersion and diskSizeGb, a replica's worker count and pipeline) now appear in composed resources. Signed-off-by: Nic Cope <nicc@rk0n.org>

The repo pinned the Crossplane CLI to negz/cli:diy, a fork branch carrying an unreleased datamodel-code-generator bump. That bump (crossplane/cli#24 and #64) has since merged to crossplane/cli main, so this repins the CLI to crossplane/cli directly and regenerates the Python models. The regen reflows the affected models with the newer generator (mostly Optional[X] -> X | None). The newer generator (datamodel-code-generator 0.59.0) emits object-typed field defaults as a default_factory rather than a plain value. The Crossplane SDK's resource.update serializes composed resources with model_dump(exclude_defaults=True), which no longer recognizes the factory-built default as equal to the declared default, so unset fields leak into composed resources. This keeps crossplane-function-sdk-python pinned to field?" rather than "is it different from its default?" - which is the correct question under server-side apply and immune to how a default is represented. Switching the whole repo to exclude_unset surfaces a few places that explicitly set fields to None or to a defaulted value, which exclude_defaults previously dropped. compose-serving-stack built provider-kubernetes Objects and Helm Releases with metadata=None and ObjectMeta(namespace=None); those now only set the field when it's present. The compose-inference-cluster and compose-model-deployment test fixtures are updated to reflect that explicitly set values (a node pool's kubernetesVersion and diskSizeGb, a replica's worker count and pipeline) now appear in composed resources. Signed-off-by: Nic Cope <nicc@rk0n.org>

The flake pinned crossplane-cli to the negz/cli default-to-go branch because the CLI changes modelplane depends on weren't yet merged upstream. They now are: #126 (host-native default flake package) merged, joining the already-merged #24, #64, and #119, and #127 (decompress function runtime tarballs once when loading) merged on top. This change repoints the input at crossplane/cli main and bumps the lock to that commit, so we no longer depend on a personal fork. It stays on main rather than a tag because the fixes aren't in a release yet. Signed-off-by: Nic Cope <nicc@rk0n.org>

WIP: Nic's API sketch

e23bf1d

Signed-off-by: Nic Cope <nicc@rk0n.org>

dennis-upbound reviewed May 8, 2026

View reviewed changes

Comment thread design/api-update.md Outdated

bassam reviewed May 8, 2026

View reviewed changes

negz added 3 commits May 7, 2026 22:18

dennis-upbound mentioned this pull request May 8, 2026

Federation scheduler + KServe renderer (managed-kai) #63

Closed

negz added 2 commits May 8, 2026 16:55

negz added 2 commits May 11, 2026 16:54

negz changed the title ~~WIP: Nic's API sketch~~ Update design to reflect latest thinking May 12, 2026

negz marked this pull request as ready for review May 12, 2026 06:49

Copilot AI review requested due to automatic review settings May 12, 2026 06:49

Copilot AI reviewed May 12, 2026

View reviewed changes

Comment thread design/design.md Outdated

Comment thread design/design.md Outdated

Comment thread design/design.md

Comment thread design/design.md Outdated

Comment thread design/design.md Outdated