Federation scheduler + KServe renderer (managed-kai) by dennis-upbound · Pull Request #63 · modelplaneai/modelplane

dennis-upbound · 2026-05-06T22:25:59Z

What this is

The federation scheduling algorithm + KServe rendering composition functions, scoped to managed-kai as the in-cluster scheduler. The plugin/dispatch system (Kueue, Volcano, none) and per-scheduler capacity adapters land in a follow-up MR — kept out of here so the algorithm + IR boundary are the focus of review.

API shape is owned by #64; this branch implements against that shape.

Read order

design/proposed-modelplane-api/design.md — architecture, what-lives-where, dependencies, scheduler properties, KAI integration, use-case traces.
functions/compose-model-deployment/scheduling.py — federation scheduler. Plain Python, no Crossplane imports. schedule(md, clusters, existing) → ScheduleResult. Filter → Score → Bind.
functions/compose-model-deployment/main.py — composer. Required-resources → scheduling.schedule() → emit ModelReplica × spec.replicas + ModelEndpoint × spec.replicas.
functions/compose-model-placement/rendering.py — pure builders for KServe LLM-IS, DRA ResourceClaim, KAI PodGroup.
functions/compose-model-placement/main.py — renderer. MR → LLMInferenceService + PodGroup + ResourceClaim(s) on the target cluster.

What's in vs out

In: federation scheduler (Filter → Score → Bind), ModelReplica IR, sticky placement, multi-replica capacity reservation, disaggregation (decode + prefill same-cluster), KServe v0.18 LLM-IS rendering, DRA selector CEL derivation, managed-kai wrap (schedulerName + PodGroup), 49 unit tests.

Out (separate MRs):

Per-scheduler dispatch (Kueue/Volcano/none) + IC.spec.scheduler.type enum
Per-scheduler capacity adapter controllers
Cluster onboarding controller (auto-detect installed scheduler)
Eviction controller (cluster-degraded re-placement)
Per-version KServe adapters (v0.16/v0.17 dispatch)
Real CEL evaluator (placeholder + monkeypatched in tests today)

Tests

uv venv .venv-test
uv pip install --python .venv-test/bin/python pytest ruff pyright
.venv-test/bin/python -m pytest tests/unit -v
.venv-test/bin/ruff check functions/ lib/

49/49 tests pass in ~20ms. Ruff + pyright clean. Doesn't run end-to-end yet — adapters raise NotImplementedError until #64's protos are generated; algorithm and shape are testable in isolation.

🤖 Generated with Claude Code

A 1-pager + design-time preview of the CRDs and example resources for the scheduler & capability model. The 1-pager (design/scheduler-1pager.md) is the source of truth; the deliverables directory is a copy of what the API would look like once aligned. Nothing here is wired up — no CRDs installed, no controller code, no CI hooks. The 1-pager covers architecture (control plane + workload planes), who owns what (ML/App team, Platform team, Modelplane), API shape, capability vocabulary tiers, risks, and v1/v2 themes. The deliverables directory includes proposed XRDs for InferenceCluster, InferenceProvider, CapabilityVocabulary, ModelDeployment, ModelEndpoint, and ModelPlacement (the IR), plus example resources covering platform substrate (Coreweave cluster, Together provider, default vocabulary) and workloads (Kimi K2 5P3D disaggregation, Qwen3-Coder n-gram + multi-LoRA, gpt-oss-20b scale-to-zero, weighted ModelEndpoint). Once the API is finalized, XRDs move into apis/ and examples into the repo-root examples/. Nic owns the final API design; this PR is meant to support the scheduler discussion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three fixes from PR #63 review: 1. Drop `engine` from InferenceCluster XRD + example. Engine + features live on KServeBackend (apis/kservebackends/), the existing internal XR that represents the inference stack on a cluster. Substrate / runtime split — InferenceCluster is hardware; KServeBackend is runtime. Allows multiple engines per cluster. 2. Drop `engine` from InferenceProvider XRD + example. Providers are opaque routes; declaring features pretends we know what's inside. Match is now supportedModels[] + env-level attributes. Workloads requiring engine features are excluded automatically (matchTrace surfaces "skipped: provider doesn't expose engine features"). 3. Add `nodeSelector` to InferenceCluster.nodePools[] so the composer can constrain pods to a pool's nodes. Convention: `modelplane.ai/pool: <pool-name>` — auto-applied on Modelplane- provisioned pools, set by operators on BYO pools. Side effect: `requires.engineFeatures` becomes implicitly cluster-only (matched against the cluster's KServeBackend, never against providers). Documented in the ModelDeployment XRD comment. Renaming to make this explicit deferred until Nic weighs in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…Service Major rev landing the alignment from PR #63 review + slack thread. API shape changes: 1. Three-level claim cascade per #56: clusterClaim / nodeClaim / deviceClaim. Replaces the flat `requires` and `topology.{nodes, devicesPerNode}`. deviceClaim is DRA-shaped (count, perNode, selector, constraints). 2. Replica == placement. ModelDeployment carries spec.replicas + the K8s scale subresource. Each MP composes one LLMInferenceService.spec.replicas: 1. KEDA writes spec.replicas via a stock ScaledObject — no custom scaler. v1: same-cluster constraint (matcher decides on first MP, reuses). v2: cross-cluster spread. 3. InferenceProvider renamed to ModelService. Namespace-scoped. Routing-only — never a placement target. Matcher considers only InferenceCluster candidates. (Dedicated-SaaS placement is a separate concept Nic owns.) 4. ModelEndpoint route discriminator: Deployment | ModelService | External. Routes target Deployment by ref/selector and fan across all its placements. 5. CapabilityVocabulary scoped Namespaced. Cluster default ships in modelplane-system; per-namespace overrides in user namespaces. 6. environmentClaim → clusterClaim (DRA naming consistency). 7. Drop fan-out (`environments: N` removed); multi-region = multiple MDs + ModelEndpoint route entries. Lifecycle: namespace = environment. Each namespace holds one ModelEndpoint, 0..N ModelDeployment / ModelService / ModelPlacement, optional vocab override. Pushing an MD revision triggers lifecycle reconciliation in that namespace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Stale references cleaned across XRDs, examples, README, and 1-pager: - ModelDeployment/ModelPlacement comments still framed placements as "per fleet member" — corrected to per-replica - ModelService comments still said "matcher matches against providers" / "skipped for ModelService" — stale; matcher considers only InferenceCluster candidates now - Workload examples and 1-pager wording about "fleet member" / "(clusters + providers)" tightened where ModelService is no longer a placement target Stripped v1 vs v2 implementation hedges from API surface (XRDs, examples, 1-pager architectural decisions). v1/v2 milestone language stays confined to the explicit "what ships v1 vs v2" / fleet-level capabilities sections; API design docs describe the abstraction without prescribing implementation phasing. Reverted CapabilityVocabulary to cluster-scoped singleton. Per Bassam's Slack feedback: namespace-scoped vocab overrides create a coordination problem because InferenceCluster (cluster-scoped) declares attributes against a vocabulary, and per-namespace overrides would evaluate the same cluster's hardware semantics differently from each namespace. Namespaces customize via Compositions and pass-through user-defined keys (acme.example/*), not vocab redefinition. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…cleanup Per Bassam's review: 1. Combined nodeClaim + deviceClaim into one DRA-shaped deviceClaim. Device attributes are uniform across devices on a node, so the conceptual node/device split wasn't load-bearing. The deviceClaim selector now matches both node-level (modelplane.ai/interNodeFabric) and device-level (gpu.nvidia.com/architecture) attributes uniformly. Cluster cascade: clusterClaim + deviceClaim (was three-level). 2. Made in-cluster scheduler delegation explicit. Modelplane decides which cluster a workload runs on; bin-packing, gang scheduling, fractional GPU, NVLink-aware placement, and capacity tracking are delegated to whatever in-cluster scheduler is installed (KAI, Kueue, Volcano, vanilla K8s scheduler). Modelplane reads cluster-level capacity signal where present; never replaces in-cluster scheduling logic. 3. Stripped v1/v2 implementation hedging from the API design surface. Dropped "When" column from fleet-level capabilities table — those are design-level capabilities the architecture supports, not phasing commitments. Cleared v1/v2 markers from API skeleton comments and risk mitigation prose. Project-plan section at the bottom retains v1/v2 themes as the explicit phasing artifact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…questions 1. In-cluster scheduler stance: Modelplane ships Kueue as the default substrate (`managed-kueue` mode, mirroring `managed-kserve`); BYO schedulers (KAI, Volcano, existing Kueue installs) are supported via a capacity-signal contract (ClusterQueue.status or equivalent). Replaced the prior "agnostic about scheduler" framing with this opinionated-default-with-BYO-escape position. 2. Dual-path matching: deviceClaim.selector now supports both `matchLabels` (plain node-label matching, no DRA required) and `matchAttributes` (DRA-typed; richer constraints like NVLink-domain co-location). The composer picks output based on the cluster's provisioning.mode. Customers who don't want DRA complication use labels; DRA stays optional. 3. Updated gpt-oss-20b example to demonstrate the simpler matchLabels path (single GPU, hopper family). kimi-k2 + qwen3-coder remain on the matchAttributes / DRA path showing typed constraints. 4. New "Open questions (Nic to call)" section in the 1-pager collecting design decisions still up for alignment: scheduler default, label-vs-DRA dual path, requires.engineFeatures rename, dedicated-SaaS placement, ModelObjective intent layer, vLLM recipe consumption, WG-DM engagement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both the in-cluster scheduler and the inference backend are pluggable under Modelplane. Spelled out the contracts: 1. InferenceCluster.spec.scheduler.type: managed-kueue | kueue | kai | volcano | none. Modelplane composes admission CRs per scheduler (Workload for Kueue, PodGroup for KAI / Volcano) and reads capacity-signal status fields. 2. InferenceCluster.spec.backend.{type, version}: managed-kserve | kserve | dynamo | raw-vllm + version pin. A backend adapter watches ModelPlacement (the IR) and renders backend-specific upstream objects: LLMInferenceService for KServe, DynamoGraphDeployment for Dynamo, Deployment+Service for raw-vllm. Adapter writes back to ModelPlacement.status.rendered. Both follow the same pattern: opinionated default install (managed-X) + BYO contract for customers with existing investments. v1 ships Kueue + KServe adapters; KAI / Volcano / Dynamo are future contributions. Doc updates: - New "Pluggable substrate" section in 1-pager (between Architecture and Fleet-level capabilities) with the symmetry table - New scheduler / backend fields on InferenceCluster XRD + example - ModelPlacement.status.rendered docstring expanded — clarifies it's the seam between IR and backend adapters - README claim-cascade section updated with the pluggable-substrate framing - Open questions: added "BYO contract details" entry Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…presentation' The IR isn't a new abstraction we invented for the scheduler design. It's the role the existing ModelPlacement CRD (already in apis/modelplacements/) plays — the seam between the matcher's output and the version-pinned backend adapter's input. Made this explicit across the doc: - modelplacement.yaml header: "existing CRD ... expanded here to play the role of the **intermediate representation (IR)** ... isn't a new abstraction — it's the role this existing CRD plays" - modeldeployment.yaml header: notes ModelPlacement is the existing CRD playing the IR role - 1-pager architectural decisions: spells out "intermediate representation (IR) — the seam between the matcher and the version-pinned backend adapter" with attribution to apis/modelplacements/ - 1-pager Modelplane-ships list: notes ModelPlacement is existing in apis/modelplacements/ - 1-pager pluggable substrate: spells out "intermediate representation" on first use in this section - README claim-cascade section: spells out IR + notes the existing CRD - README directory listing: notes "existing CRD; plays the IR role" - inferencecluster.yaml backend comment: spells out "intermediate representation" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… notes Bassam's three latest comments (on gpt-oss-20b.yaml): 1. Drop spec.replicas: 0 — KEDA-managed via the scale subresource; user authors scaling block only. 2. Confirm labels-first / DRA-as-break-glass framing — already demonstrated in this example; reword the doc to match. 3. Drop requiredEngineFeatures — not needed for the simple case. Broader refactor — doc is now framed as overall Modelplane API design (scheduler still the heaviest piece) rather than scheduler-only: - Renamed design/scheduler-deliverables/ → design/modelplane-api/ - Renamed design/scheduler-1pager.md → design/modelplane-api.md - Title: "Modelplane API Design — 1-pager" - TL;DR expanded with adapter/plugin substrate + CapabilityVocab managed-catalog as first-class points - Labels-first / DRA-break-glass made explicit throughout (XRD selector docstring, 1-pager architectural decisions, README) - Capability vocabulary section expanded — Modelplane ships the canonical catalog (chip generations, engine versions, quantization, KV tiers, fabric ordering); customers override per-cluster for bespoke. Flagged as a candidate for an Upbound-managed commercial offering (keeping the catalog current is bounded high-leverage work) - Pluggable substrate section renamed "Adapter / plugin substrate" and tightened — managed defaults (managed-kserve, managed-kueue) ship with Modelplane; BYO contracts let customers plug in KAI / Volcano / Dynamo via the IR seam. The IR (ModelPlacement, the existing apis/modelplacements/ CRD) is the contract between matcher and adapter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… appendix The README inside modelplane-api/ duplicated the design doc's framing. Absorb the still-useful bits ("what's deliberately incomplete" and the "where each XRD lands" mapping) into the design appendix so the design doc is the single source of truth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@bassam

…iagram Doc cleanup: - Drop redundant 'Why this matters for BYO' paragraph; fold into prior two. - Architecture box: surface scheduler/backend defaults, drop the hard-coded 'composes LLMIS' line (that's a backend concern). - Open questions: drop two that are already resolved (BYO contract shape is in the substrate table; requires.engineFeatures rename happened — field is already requiredEngineFeatures). - Risks: byo-kserve -> kserve to match the InferenceCluster.spec.backend.type enum. XRD cleanup: - capabilityvocabulary.yaml: requires.engineFeatures -> requiredEngineFeatures (matches the ModelDeployment field name). Diagram: - diagram.excalidraw adapted from @bassam's whiteboard. Same overall layout (APIs / Matching / Example fleet topology) with current naming applied: drop nodeClaim (collapsed to two-level cascade); CapabilityVocabulary cluster-only (not namespace); add scheduler.type and backend.{type,version} on InferenceCluster; flag dual-path matching (matchLabels primary, matchAttributes DRA break-glass); ModelService labelled routing-only. Credit attribution at top. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Directory rename to make the 'this is a proposal, not the API yet' framing explicit. All links in the design doc updated. - Diagram: rebuilt from Bassam's whiteboard with much tighter text in the API/Matching boxes (fits the original rectangle widths now, smaller font on the YAML), CamelCase on the example resource labels to match our actual CRD names. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reverting my edits to the excalidraw — putting Bassam's original in place of my fiddled version. Doc reference updated to credit him as the source. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…acro Make the DRA distinction explicit. Modelplane is a federation planner — it evaluates predicates against declared pool capacity before any nodes exist. DRA is a runtime allocator — drivers introspect real hardware post-provisioning. We borrow DRA's vocabulary (typed attributes, domain- prefixed keys, CEL, the device.attributes[domain].name access pattern) but not its Kinds (DeviceClass / ResourceSlice / ResourceClaim) at the federation layer. Specifically: - New "Two-stage scheduling" section in the design doc, with a borrow/drop table and the BYOC-vs-Modelplane-provisioned grounding contract for when the backend adapter emits real ResourceClaims. - Rename clusterClaim -> clusterSelector and deviceClaim -> deviceSelector across the doc, XRDs, and examples. "Claim" implies allocation, which is wrong for what we do pre-provisioning. - Flatten the redundant inner `selector:` nesting on both selectors — matchLabels / matchAttributes / cel are now top-level fields on the selectors. Cleaner reads after the rename. - Drop "DRA-shaped" / "DRA-typed" framing. Replace with "typed attribute predicates evaluated against declared pool attributes." - Open questions: re-pose label-vs-attribute matching path (was DRA-vs- label); add the DRA grounding contract question. Instance-type macro: - New `instanceTypes` field on CapabilityVocabulary. Each entry has a canonical name (H100-NVL-8x-IB400), `expands` to a set of typed attributes, plus `aliases` for per-cloud SKU strings (aws:p5.48xlarge). - Customers match on either dimension — high-level string for the common case, unpacked attributes for unusual constraints. Same predicate engine, vocab does the unpacking. - Default catalog seeded with H100/H200/B200/MI300X. Per-cloud SKU taxonomy is exactly the bounded ongoing work that fits the managed catalog offering. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ndex discipline Every field on the user-facing API now has at least one named consumer (matcher / composer / backend adapter / gateway). Where it wasn't obvious before, comments now spell out who reads what for what purpose. Where there's no consumer, the field's gone. Concrete changes: - Drop ModelDeployment.spec.requiredEngineFeatures. Required features are derived from a more concrete declaration: roles present -> prefill-decode-disagg engine.quantization.target=kv -> fp8-kv-cache engine.speculation.type=NGram -> ngram-speculation engine.optimizations.* -> chunked-prefill / prefix-caching / kv-cache-routing adapters[] non-empty -> multi-lora parallelism.expert: enabled -> expert-parallelism Matcher unions these at federation time and matches against each InferenceCluster's KServeBackend.spec.engine.features. Single source of truth: declare what you want; matcher derives what backend features that requires. - New typed engine.optimizations field — chunkedPrefill, prefixCaching, kvCacheRouting. Promotes commonly-used names from the advanced[] break-glass to a typed shape. Backend adapter translates to engine flags (vLLM --enable-chunked-prefill, etc.). - ModelDeployment XRD: top-level Field-level consumer index spelling out who reads each field. parallelism, roles, adapters get explicit "Modelplane-canonical, backend adapter translates" framing. - ModelService.spec.supportedModels: documented as consumed by ModelEndpoint route filtering, with an auto-refresh plan (controller polls the SaaS provider's /v1/models catalog API). - ModelDeployment.spec.adapters: documented two consumers — backend adapter (engine LoRA load via vLLM --lora-modules) and gateway (LoRA-aware request routing). - ModelEndpoint XRD: explicit framing that a Deployment route covers ALL of an MD's ModelPlacements transitively (every replica on every cluster the matcher placed them on). Cross-cluster spread = multiple MDs + multiple route entries. - ModelPlacement.spec.requiredEngineFeatures -> derivedFeatures (carried on the IR for the backend adapter to verify support before rendering; not user-authored). - Examples (kimi-k2, qwen3-coder, gpt-oss-20b, modelservice-together) re-commented inline pointing at consumers; required-feature derivation spelled out in kimi-k2 and qwen3-coder headers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…loud SKUs Bassam's "GPU hardware survey and unified taxonomy" (Notion, 2026-05-07) proposes a four-layer model — Cluster / Pool / Device / dynamic state — with capability sets (not boolean columns), predicates over equality, architecture as metadata while capability flags do the matching work, and rack-scale (NVL72) as its own addressable unit. This patch lands the taxonomy in the default CapabilityVocabulary and in the existing examples, adds a reference-clusters/ directory with pre-generated InferenceClusters for known cloud SKUs, and reframes the canonical-catalog work as the wedge for a managed-catalog commercial offering layered with continuous testing & benchmarking. Vocabulary changes (default CapabilityVocabulary): Cluster layer: cloud.{provider, region}, network.{fabric, bandwidthGbps, airgapped}, cluster.scaleUnit (independent-nodes | superpod | nvl72) Pool layer: cloud.instanceType, gpuCount, interconnect.{type, bandwidthGBs}, cpu.{vendor, cores, platform}, memoryGiB, nics.{count, bandwidthGbps}, host.virtualization Device layer: vendor, product, architecture, formFactor, vramGiB, mig (bool), capabilities (set), parentProduct (for fractional GPUs / MIG entries) Drop the conflated `interNodeFabric` ordered-string in favor of network.fabric + network.bandwidthGbps (RoCE vs IB are distinct protocols on the same hardware — OCI's pattern shows why). Add `set` and `bool` types to attributeKeys.type enum. Instance-type macros reseeded with Bassam's catalog rows: H100-NVL-8x, H200-NVL-8x, B200-NVL-8x, B300-NVL72, GB200-NVL72, MI300X-8x, L40S-8x, A100-80GB-8x. Each includes per-cloud SKU aliases (aws/gcp/azure/oci/coreweave/lambda/dgx). Reference clusters (new): - aws-p5-48xlarge.yaml AWS H100, EFA - gke-a3-mega-8g.yaml GCP H100, RoCE - oci-bm-gpu-mi300x-8.yaml OCI MI300X bare metal, RoCE - coreweave-gb300-nvl72.yaml rack-scale Blackwell Ultra Customers copy-paste or compose; updated as new SKUs land. Anchors the managed-catalog commercial offering. Follow-up: a Crossplane provider that polls cloud SKU APIs and generates these programmatically (removes the "keep labels up to date by hand" burden). Existing examples updated to use the new vocab: - kimi-k2 / qwen3-coder: matchAttributes use vramGiB predicates + capabilities set instead of architecture enum (keeps AMD eligible where the workload doesn't actually depend on Hopper specifically) - gpt-oss-20b: --enable-prefix-caching pulled from engine.args into engine.optimizations.prefixCaching for consistency - inferencecluster-prod-coreweave: full 4-layer attribute split Doc updates: - New "Hardware taxonomy & reference clusters" section folding in Bassam's survey, the four-layer model, key design choices (capability sets, predicates over equality, rack-scale as own unit, RoCE vs IB), the instance-type macro pattern, the static-vs-provider rollout for reference clusters. - Commercial-offering framing extended: tracking + reference clusters + continuous testing & benchmarking. Each reference cluster paired with a tested-and-benchmarked workload run on every supported model family — costly to maintain, exactly what customers will pay for. - Open Qs: rack-scale capacity-unit question, reference-cluster rollout ordering. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…l errors Cross-checked every example end-to-end against the unified taxonomy. The new vocab landed cleanly but the existing examples weren't fully re-aligned. Fixed all the friction points before another round of review: clusterSelector / deviceSelector layering: - kimi-k2: network.fabric and network.bandwidthGbps are Cluster-layer attributes (live on InferenceCluster.spec.attributes), not Pool/Device. Moved from deviceSelector → clusterSelector. - kimi-k2 / qwen3-coder / gpt-oss-20b: tier was matched via matchLabels, but it's an attribute (lives in spec.attributes, not metadata.labels). Switched to matchAttributes. Misleading or wrong constraints: - kimi-k2 / qwen3-coder: dropped the gpu.nvidia.com/nvlinkDomain constraint. For HGX baseboards (8 GPUs / node), NVLink-domain co-location is implied by interconnect.type: nvswitch + perNode: 8; the explicit constraint was redundant. Also fixed the misleading "keeps AMD MI300X eligible" comment that was contradicted by the NVIDIA-specific constraint key. Stale vocab keys: - modelservice-together: replaced modelplane.ai/region → cloud.region, modelplane.ai/provider → cloud.provider, dropped modelplane.ai/networkAccess (not in vocab). - inferencecluster-prod-coreweave: dropped modelplane.ai/failureDomain (not in vocab and not used by any matcher). - cloud.provider enum extended with SaaS providers (togetherai, baseten, bedrock, fireworks, modal) so ModelService.attributes validates. Factual errors in reference clusters: - coreweave-gb300-nvl72: gpuCount 2 → 4 (gb300-4x is "2 superchips, 1 Grace + 2 B300 each" = 2 Grace + 4 GPUs per Bassam's survey). Fixed interconnect.type from nvlink-c2c (Grace↔Hopper coherent memory only) to nvswitch (NVLink Switch, GPU↔GPU at rack scale). Added cpu.cores: 144 (2× 72-core Grace) and memoryGiB: 960. Cross-namespace routing: - assistant-endpoint referenced MDs in three different namespaces (research, dev-tools, app-team) without explicit namespace on the refs. Per "namespace = environment / lifecycle scope", routing across namespaces breaks the model. Consolidated all MDs + ModelService + ModelEndpoint into the same namespace (app-team). Vocab macro fixes: - B300-NVL72 / GB200-NVL72 macros conflated per-instance and per-rack (had cluster.scaleUnit: nvl72 inside Pool-layer macros). Macros are Pool-layer (per-host); rack-scale belongs on Cluster-layer (cluster.scaleUnit on InferenceCluster.attributes). Renamed to B300-Grace-4x / B200-Grace-4x with gpuCount: 4. The fact they sit in NVL72 racks is captured in the InferenceCluster's cluster.scaleUnit attribute, not the macro. InferenceCluster XRD comment cleanup: - "level 1 of three-level cascade" → "Cluster-layer attributes"; same for Pool / Device layers. Aligned with the unified taxonomy framing. Examples in description text updated to current vocab keys. Doc: - Appendix example descriptions updated to match the new selectors (Kimi K2 demonstrates predicates not "DRA break-glass"; qwen3-coder headlines its acme.example/* user-defined attributes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Doc: - Trim wordiness throughout. Down from 432 → ~385 lines without losing substance. Mostly cuts of repetition between TL;DR and body, collapsing prose to tables, removing orphan bullets. - Promote ModelService-is-Nic-sketch framing. Was buried in parentheticals; now a named "ModelService is a sketch" subsection in the API shape and a TL;DR bullet. Examples (covering gaps the existing set didn't): - examples/reference-clusters/eks-h100-no-dra.yaml: BYOC EKS on K8s 1.31 without a DRA driver. provisioning.mode: device-plugin; backend adapter constrains pods via nodeSelector + nvidia.com/gpu, no ResourceClaim emission. Demonstrates the labels-first path concretely and pairs with gpt-oss-20b.yaml's matchLabels usage. - examples/kimi-k2-eu.yaml: EU-region sibling of kimi-k2.yaml. Pinned via cloud.region + modelplane.ai/compliance: [gdpr]. Concrete multi-region pattern. - examples/multi-region-endpoint.yaml: ME routing across kimi-k2 + kimi-k2-eu + together-prod with weighted SaaS spillover. Vocab: - New L40S-4x macro for SKUs with 4 L40S (oci:BM.GPU.L40S.4, aws:g6e.{12,24}xlarge). Was incorrectly aliased under L40S-8x before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Nic Cope <nicc@rk0n.org>

Refine the design/api-update.md sketch following discussion of the scaling model, disaggregated serving, and the routing surface: - Rename ModelPlacement to ModelReplica throughout. Each replica is one complete serving instance — single-node, multi-node via LWS, or full prefill/decode disagg. Mirrors Deployment -> Pod naming. - Drop spec.scaling from ModelDeployment. Autoscaling is opt-in via a separate KEDA ScaledObject targeting the MD's scale subresource — same pattern as Deployment + HPA. Add a worked Mixtral example with ScaledObject alongside. - Add a discriminated-union pattern for disaggregated prefill/decode. A serving profile is either unified (root poolSelector / parallelism / engine) or disagg (explicit decode and prefill blocks, each self-contained — no inheritance from the root). Decode and prefill must land on the same InferenceCluster but can target different pools. - Move inter-node networking onto InferenceClass instead of cluster- level capabilities. Different networking implies a different class (h200-nvl-8x-ib vs h200-nvl-8x); networking belongs to the pool that uses it. Drop spec.capabilities from InferenceCluster — cluster-level metadata is captured as standard Kubernetes labels. - Lift clusterSelector to deployment level. Profiles only carry pool selection and per-pool composition, since the cluster intent doesn't change between fallback profiles. - Switch ModelService routing to a single spec.endpoints[] pattern (was separate selector vs routes paths). One mechanism for both simple and weighted routing. - Drop spec.model.name in favor of metadata.name as the served model identifier. The HuggingFace repo (or other source) is purely where weights come from, not the model's identity. - Add YAML comments throughout the examples explaining what each field does — what gets matched, what gets composed, what's optional. Signed-off-by: Nic Cope <nicc@rk0n.org>

Different hardware targets typically require different model weight checkpoints (FP8 vs BF16 are different HuggingFace repos). That makes fallback profiles within one deployment the wrong abstraction — they're genuinely different deployments. Silent degradation (falling back to a config with lower context length or different quantization) is also arguably worse than explicit failure. This commit flattens the serving profile array. poolSelector, parallelism, and engine are now top-level fields on ModelDeployment.spec. Different hardware configurations are separate ModelDeployments behind one ModelService. The deployer makes explicit decisions about which configurations to run. The disaggregated prefill/decode pattern is now a discriminated union on ModelDeployment itself: either root-level poolSelector/parallelism/ engine (unified) or explicit decode/prefill blocks (disaggregated). If preferential scheduling is needed later, it would be a coordination mechanism between ModelDeployments, not inline profiles. Signed-off-by: Nic Cope <nicc@rk0n.org>

The parallelism block describes more than just parallelism — it's the complete compute topology of a role within one ModelReplica. Renaming to topology makes room for the per-role instance count (instances) which describes replication rather than sharding. For disaggregated prefill/decode, the P:D ratio (e.g., 5P3D) is the number of independent instances per role within one ModelReplica. This is a topology parameter — fixed per deployment, not a scaling knob. It maps to KServe's LLMInferenceService.spec.replicas (decode) and spec.prefill.replicas (prefill). For unified serving, instances defaults to 1 and can be omitted. Other changes in this commit: - Require DRA on all InferenceClusters. Drop nodeSelector from pool declarations — DRA handles device-to-node binding. Pools are now just name, class, and maxNodes. - Rename poolSelector to nodeSelector on ModelDeployment. With DRA required and nodeSelector gone from pools, the naming collision is resolved. - Replace driver.version with cuda.toolkit as the typed capability example on InferenceClass. Driver version is a runtime property of the cluster, not the hardware SKU. CUDA toolkit version is a better example of where {type: version} decoration matters. Signed-off-by: Nic Cope <nicc@rk0n.org>

…nancy - scheduler.type: auto (default), managed-kai, managed-kueue added to InferenceCluster. NVIDIA pools auto-resolve to managed-kai; non-NVIDIA to managed-kueue. BYOC detects existing install. - New section: ModelDeployment placement walkthroughs — single-node TP, multi-node TP+PP via LWS, P/D disaggregation, KEDA + matcher loop. - New section: Multi-tenancy — bin-packing, MIG, time-slicing all enabled at the pool layer; MD spec stays portable across sharing modes. - New example: managed-gke-a3-kai.yaml (explicit managed-kai pin).

…shape to #64 - Rename ModelPlacement → ModelReplica everywhere (XRD, status fields, printer columns, doc, examples). Aligns with the "replica == placement" mental model. Pure rename + role expansion. - Add Federation-layer scheduling section: composer / matcher / backend adapter / capacity adapter contracts, the actual matcher pseudocode, out-of-scope items with effort + uncertainty, ordering reasoning. - Add Plugin/adapter system section: six adapter axes, version-pinned KServe absorbing schema churn, end-to-end managed + BYOC examples. - Add BYOC scheduling section: onboarding flow, what "managed" means axis-by-axis, edge cases (no DRA, multiple schedulers, RBAC). - Expand autoscaling walkthrough: KEDA / composer / matcher / backend loop with concrete scale-up + scale-down sequences. - Replace v1/v2 framing with effort sizes + ordering + uncertainty. - Defer API shape, hardware taxonomy, engine-features detail to #64.

- Remove xrds/ (and lint.py / LINT.md) — API shape lives in #64. Examples stay as illustrative scheduling-relevant YAML. - New section: "What we treat as IR" — three IRs (ModelReplica explicit; cluster substrate + endpoint binding implicit today). Argues why naming IR seams is what makes BYO-* cheap. - New section: "Crossplane lifecycle layers" — per-layer XR ownership is what enables pause/resume, GitOps drift, RBAC boundaries, version-skew handling per cluster. - Reframe plugin/adapter axes: be honest the count is contingent. Two user-visible axes (scheduler, backend); the other four are internal / collapsible. Don't read into the number. - New section: "User-facing surface preview" with Quickstart (4 CRs, ~60 lines of YAML to a working curl) + 5 Advanced scenarios as deltas (multi-region, BYOC+KAI, P/D disagg, custom InferenceClass, spillover). Goal: gauge complexity of the proposed scheduling design from the user's seat.

Replaces single 1291-line modelplane-api.md with four focused docs: - README.md — index / TOC pointing at the right doc per audience - quickstart.md — 4 CRs to a working curl (~120 lines) - advanced.md — 5 common scenarios as deltas (~210 lines) - scheduling.md — operator's reference: two-stage scheduling, federation matcher, in-cluster KAI/Kueue, multi-tenancy, BYOC behavior, placement walkthroughs (~500 lines) - design.md — architectural decisions: principles, plugin/adapter system, IRs, Crossplane lifecycle layers, risks, open questions, roadmap (~490 lines) User docs link to design.md for "why"; design.md links to scheduling.md for "what shows up to users". Same content as before, organized by audience instead of one monolith.

…nter Pivots this PR from design preview to implementation sketch. The code under functions/ doesn't run (Nic's #64 protos aren't generated yet) but the shape, dependencies, and use cases are real. Implementation: - functions/compose-model-deployment/scheduling.py — federation matcher. Plain Python, no Crossplane imports — testable in isolation. Filters ICs by clusterSelector.matchLabels, filters pools by nodeSelector.cel against class capabilities, capacity check with sticky-placement accounting, scores and picks per replica. Topology strategies map to (nodes_per_inst, gpus_per_node). Disagg requires same-cluster decode + prefill pools. - functions/compose-model-deployment/main.py — Crossplane glue. Required- resources for clusters/classes/owned MRs, calls scheduling.match(), emits ModelReplica + ModelEndpoint per spec.replicas, sets MD conditions. - functions/compose-model-placement/main.py — renderer. Reads MR + matched IC + class(es), composes KServe LLMInferenceService (decode + optional prefill) + DRA ResourceClaim(s) on the target cluster via remote-object provider. Lifts cold-start conditions back as MR.status. Docs: - design.md compressed to ~180 lines: architecture diagram, what-lives-where table, dependencies per function, use cases traced through the code. - README.md is now a 2-section pointer at design.md + the code. - Deleted quickstart.md / advanced.md / scheduling.md — served the design phase; the code is the new source of truth. - examples/README.md maps each example to the matcher/renderer code path it exercises. Adapter functions (_load_md, _resolve_clusters, _load_mr, _cel_from_capabilities) raise NotImplementedError — they're the wiring points that fill in once #64 lands and protos are generated.

Adds stage-2 scheduler integration to the renderer + a sketch of the capacity feedback loop. Same sketch quality as the rest of the PR — doesn't run, but the dispatch shape, per-scheduler differences, and capacity-status pipeline are honest. - functions/compose-model-placement/scheduler.py — per-scheduler wrap functions. KAI: schedulerName + PodGroup CRD wrapping the LWS gang (minMember = total pods). Kueue: kueue.x-k8s.io/queue-name label + suspend gate (Kueue's webhook creates the Workload). none: pass-through. Single dispatch table; new schedulers (Volcano, etc.) plug in here. - functions/compose-model-placement/main.py — wired to call scheduler.wrap() after building the base LLM-IS spec. Emits the wrapped spec + any scheduler-companion objects (PodGroup) onto the same target cluster via the existing remote-object provider. - lib/capacity_adapter/{__init__,common,kai,kueue}.py — sketch of the per-scheduler status pullers. Reads the scheduler's status CRDs (KAI Queue/ResourcePool, Kueue ClusterQueue.flavorsUsage[]), normalizes into the shared CapacitySnapshot type, writes to IC.status.capacity. NOT a Crossplane composition function — runs as a separate controller, one per IC. Sketch shows the projection logic; K8s client wiring is NotImplementedError stubs. - design.md updated with a KAI/Kueue section: per-scheduler differences table, dispatch wiring diagram, capacity feedback loop diagram, how to add a new scheduler. Notes the small API extension needed on Nic's #64 (IC.spec.scheduler.type).

Splits each composition function into pure modules (algorithm, dict builders, dispatch tables — no Crossplane imports) and an orchestrator main.py with phase-banner comments distinguishing scheduling logic from Crossplane glue from status / error handling. compose-model-deployment/: scheduling.py federation matcher (pure) adapters.py proto ⇄ scheduling types (boundary) emitters.py pure dict builders for ModelReplica / ModelEndpoint main.py orchestrator — six labeled phases: REQUIRE → LOAD → MATCH → BUILD → EMIT → STATUS compose-model-placement/: rendering.py pure LLM-IS / DRA / selector-CEL builders scheduler.py per-scheduler wrap dispatch (KAI / Kueue / none) adapters.py proto ⇄ rendering types (boundary) main.py orchestrator — seven labeled phases: REQUIRE-cluster → REQUIRE-classes → LOAD → RENDER → WRAP → EMIT → STATUS Each main.py opens with the lifecycle diagram, condition state machines (Scheduled / ReplicasReady / Ready+cold-start sub-states), and the error-handling table per phase. Tests (tests/unit/, runs in ~20ms): test_scheduling.py topology shapes, cluster/pool filters, capacity reservation across replicas, sticky placement, disagg same-cluster, matchTrace shape (28 cases) test_scheduler.py KAI / Kueue / none dispatch, PodGroup minMember math across single-node / multi-node / disagg, queue label, suspend gate (16 cases) test_rendering.py LLM-IS shape (Tensor, TensorPipeline, disagg), DRA selector CEL derivation from class capabilities (15 cases) test_capacity_adapter.py ResourceCount math, write_status round-trip, KAI / Kueue projections, NotImplementedError sanity (10 cases) Plus pyproject.toml gets pytest config + per-file ARG001 ignores for intentional dispatch-contract signatures. Static health: ruff + pyright clean over functions/ + lib/. 69/69 unit tests pass.

…uler Renames + minor refactor for clearer scheduling semantics: - match() → schedule(), MatchResult → ScheduleResult. Both align with K8s SIG-Scheduling's `Schedule(workload) -> ScheduleResult` contract and with the existing repo's schedule() name on main. - _candidates_for_replica → _filter (Filter phase) - _pick → _score_and_select (Score phase) - _to_placement → _bind (Bind phase) - Added explicit "── Filter ──" / "── Score ──" / "── Bind ──" comment banners in schedule() so the K8s parallel is visible. - Strengthened the module docstring to lead with "META-FLEET SCHEDULER — fleet-level placement, NOT cluster-level scheduling" so the meta-fleet positioning is the first thing a reader sees. - main.py orchestrator: phase_match → phase_schedule, Phase 3 banner renamed MATCH → SCHEDULE. Tests: 9 occurrences of scheduling.match() → scheduling.schedule(). All 69 still pass. Plus a new "Delta from existing scheduling on main" section in design.md — concept-by-concept comparison table covering mental model, unit of placement, capacity input, pool eligibility, topology, disaggregation, engine matching, scaling, stickiness, multi-replica accounting, algorithm structure, result shape, matchTrace, cluster source, in-cluster integration, lines of code. Also cleans up __pycache__/ that slipped through git tracking before the gitignore landed.

Won't merge until after #64 (or later) lands, so the "delta from main" table was documenting terminology nobody will see by then. Replaces it with a "Scheduler properties" section that pins down the load-bearing behavior in K8s SIG-Scheduling terms — what the scheduler actually does, no comparison column. No code changes; doc only.

Drops the banner-heavy phase orchestration in favor of Nic's existing style on main (compose-model-deployment is the gold standard): - Short, declarative module docstrings — no ═══ banner blocks, no big lifecycle diagrams. - `Composer` / `Renderer` classes mirror Nic's shape: __init__ parses XR via adapters; compose()/render() chains verb-named methods directly; no bool-return phase pattern. - Methods named after what they do: resolve_inputs, schedule, compose_replicas, compose_endpoints, write_status, derive_conditions (composer); resolve_inputs, compose_llmis, compose_resource_claims, derive_conditions (renderer). - conditions.set_condition takes a bool (was passing strings — actual bug; "False" is truthy and would have set TRUE). - response.warning / response.normal events on transitions: "Scheduled N replica(s): r0→east/h200, r1→west/h200" "No InferenceClusters in the fleet" "Cluster fleet config invalid: ..." - libresource.update_status writes MD.status.modelReplicas + matchTrace. - self.rsp.desired.composite.ready = fnv1.READY_FALSE when nothing scheduled (matches Nic's pattern of explicit unready). - Drops manual ownerReferences from emitters — Crossplane sets them automatically. - Drops md_uid / engine / source extra params from build_replica; emitters take just (md, placement) now. - Pure modules (scheduling.py, scheduler.py, rendering.py) get tighter docstrings without ═══ banners. K8s SIG-Scheduling vocabulary + "fleet" terminology preserved. - Adds naming.llmis_name / claim_name to lib/naming.py. Algorithm + behavior unchanged. 69/69 tests still pass. ruff/pyright clean.

Pulls the per-scheduler dispatch system + capacity adapters out of this MR so the federation algorithm + IR boundary are the focus of review. Removed: - functions/compose-model-placement/scheduler.py (KAI/Kueue/none dispatch) - lib/capacity_adapter/{kai,kueue,common,__init__}.py (status pullers) - tests/unit/test_scheduler.py - tests/unit/test_capacity_adapter.py - ClusterView.scheduler_type (no dispatch needed) Added (inline, in rendering.py): - rendering.with_kai_gang(spec, mr_name, ns) → KaiBundle Stamps schedulerName + pod-group label, emits the PodGroup with minMember = total pod count (decode + prefill if disagg). - rendering.gang_size(spec) → int (extracted from the inline helper) - KAI tests folded into tests/unit/test_rendering.py. Renderer's main.py drops scheduler import; calls rendering.with_kai_gang directly. POD_GROUP_KEY composed alongside LLMIS_KEY and DRA ResourceClaim(s) on the target cluster. design.md: replaced "KAI / Kueue integration" section with "KAI integration (in-cluster, this MR)" + an explicit "Follow-up MR (plugin/dispatch system)" callout listing what comes next. 49/49 tests still green. Ruff + pyright clean.

The topology block describes the shape of each worker, but workers.count (how many of that shape) is a sibling concern at the same level — not a property of the topology itself. Group them together under workers: workers: count: 3 topology: strategy: TensorPipeline tensor: 8 pipeline: 2 This reads as "3 workers, each TensorPipeline TP=8 PP=2." The topology describes one worker. The count says how many. nodeSelector and engine stay alongside workers as separate concerns — what hardware each worker needs and what engine it runs. For unified serving, workers just contains topology (count defaults to 1). For disaggregated P/D, workers.count on each role is the P:D ratio — the "5" and "3" in 5P3D. It's a topology parameter (fixed per deployment), not a scaling knob. Signed-off-by: Nic Cope <nicc@rk0n.org>

The InferenceClass becomes a tested recipe that bundles both capabilities (for scheduling) and optionally cloud-specific provisioning config (for cluster composition). When Modelplane provisions a GKE cluster, the composition function reads class.provisioning.gke to get the machineType, accelerator config, and networking — guaranteed consistent with the capabilities the scheduler uses for matching. The provisioning block is optional. Classes without it are capabilities-only, used for BYO clusters where the pool already exists. The provisioning.provider discriminator selects the cloud-specific sibling block (gke, eks, aks). Modelplane ships a default catalog: cloud-specific classes for provisioned clusters (gke-h200-8x-a3-ib, gke-l4-1x-g2) and cloud-agnostic classes for BYO (h200-8x-ib, l4-1x). The InferenceCluster section now shows both a GKE-provisioned cluster (pools reference cloud-specific classes) and a BYO cluster (pools reference capabilities-only classes). Cluster-level config (project, region, K8s version) stays on the InferenceCluster. Pool-level config (machineType, GPU, networking) moves to the class. Pool sizing (maxNodes, nodeCount) stays on the InferenceCluster pool — it's a per-cluster capacity decision, not a property of the hardware SKU. Signed-off-by: Nic Cope <nicc@rk0n.org>

Pulled origin/pages and adjusted code for Nic's new shape: Old: spec.topology = {strategy, tensor, pipeline, instances} New: spec.workers = {count, topology: {strategy, tensor, pipeline}} Mechanical rename across the algorithm + IR + tests: - scheduling.py: split Topology dataclass; new Workers dataclass holding topology + count. RoleSpec now carries `workers` (was `topology`). RolePlacement.instances → RolePlacement.workers. - emitters.py: ModelReplica spec writes spec.{decode,prefill}.workers.{count,topology} instead of nested instances+topology. - rendering.py: RoleView.workers replaces .topology + .instances. Drops the unused `image` carry-through that was inside topology. - 49 unit tests updated to use Workers(topology=..., count=N) construction pattern. Algorithm semantics unchanged — workers.count plays the role topology.instances did. Also includes the merge of origin/pages bringing in design/api-update.md (Nic's design doc, untouched).

bassam reviewed May 6, 2026

View reviewed changes

Comment thread design/scheduler-deliverables/examples/inferencecluster-prod-coreweave.yaml Outdated

bassam reviewed May 6, 2026

View reviewed changes

Comment thread design/proposed-modelplane-api/examples/clusters/byoc-coreweave-h200-dra.yaml

bassam reviewed May 6, 2026

View reviewed changes

Comment thread design/scheduler-deliverables/examples/inferenceprovider-together.yaml Outdated