Skip to content

Federation scheduler + KServe renderer (managed-kai)#63

Closed
dennis-upbound wants to merge 56 commits into
mainfrom
dennis/scheduler-1pager
Closed

Federation scheduler + KServe renderer (managed-kai)#63
dennis-upbound wants to merge 56 commits into
mainfrom
dennis/scheduler-1pager

Conversation

@dennis-upbound

@dennis-upbound dennis-upbound commented May 6, 2026

Copy link
Copy Markdown
Collaborator

What this is

The federation scheduling algorithm + KServe rendering composition functions, scoped to managed-kai as the in-cluster scheduler. The plugin/dispatch system (Kueue, Volcano, none) and per-scheduler capacity adapters land in a follow-up MR — kept out of here so the algorithm + IR boundary are the focus of review.

API shape is owned by #64; this branch implements against that shape.

Read order

  1. design/proposed-modelplane-api/design.md — architecture, what-lives-where, dependencies, scheduler properties, KAI integration, use-case traces.
  2. functions/compose-model-deployment/scheduling.py — federation scheduler. Plain Python, no Crossplane imports. schedule(md, clusters, existing) → ScheduleResult. Filter → Score → Bind.
  3. functions/compose-model-deployment/main.py — composer. Required-resources → scheduling.schedule() → emit ModelReplica × spec.replicas + ModelEndpoint × spec.replicas.
  4. functions/compose-model-placement/rendering.py — pure builders for KServe LLM-IS, DRA ResourceClaim, KAI PodGroup.
  5. functions/compose-model-placement/main.py — renderer. MR → LLMInferenceService + PodGroup + ResourceClaim(s) on the target cluster.

What's in vs out

In: federation scheduler (Filter → Score → Bind), ModelReplica IR, sticky placement, multi-replica capacity reservation, disaggregation (decode + prefill same-cluster), KServe v0.18 LLM-IS rendering, DRA selector CEL derivation, managed-kai wrap (schedulerName + PodGroup), 49 unit tests.

Out (separate MRs):

  • Per-scheduler dispatch (Kueue/Volcano/none) + IC.spec.scheduler.type enum
  • Per-scheduler capacity adapter controllers
  • Cluster onboarding controller (auto-detect installed scheduler)
  • Eviction controller (cluster-degraded re-placement)
  • Per-version KServe adapters (v0.16/v0.17 dispatch)
  • Real CEL evaluator (placeholder + monkeypatched in tests today)

Tests

uv venv .venv-test
uv pip install --python .venv-test/bin/python pytest ruff pyright
.venv-test/bin/python -m pytest tests/unit -v
.venv-test/bin/ruff check functions/ lib/

49/49 tests pass in ~20ms. Ruff + pyright clean. Doesn't run end-to-end yet — adapters raise NotImplementedError until #64's protos are generated; algorithm and shape are testable in isolation.

🤖 Generated with Claude Code

A 1-pager + design-time preview of the CRDs and example resources for the
scheduler & capability model. The 1-pager (design/scheduler-1pager.md) is the
source of truth; the deliverables directory is a copy of what the API would
look like once aligned. Nothing here is wired up — no CRDs installed, no
controller code, no CI hooks.

The 1-pager covers architecture (control plane + workload planes), who owns
what (ML/App team, Platform team, Modelplane), API shape, capability
vocabulary tiers, risks, and v1/v2 themes. The deliverables directory
includes proposed XRDs for InferenceCluster, InferenceProvider,
CapabilityVocabulary, ModelDeployment, ModelEndpoint, and ModelPlacement
(the IR), plus example resources covering platform substrate (Coreweave
cluster, Together provider, default vocabulary) and workloads (Kimi K2 5P3D
disaggregation, Qwen3-Coder n-gram + multi-LoRA, gpt-oss-20b scale-to-zero,
weighted ModelEndpoint).

Once the API is finalized, XRDs move into apis/ and examples into the
repo-root examples/. Nic owns the final API design; this PR is meant to
support the scheduler discussion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread design/scheduler-deliverables/examples/inferencecluster-prod-coreweave.yaml Outdated
Comment thread design/scheduler-deliverables/examples/inferenceprovider-together.yaml Outdated
Three fixes from PR #63 review:

1. Drop `engine` from InferenceCluster XRD + example. Engine + features
   live on KServeBackend (apis/kservebackends/), the existing internal XR
   that represents the inference stack on a cluster. Substrate / runtime
   split — InferenceCluster is hardware; KServeBackend is runtime. Allows
   multiple engines per cluster.

2. Drop `engine` from InferenceProvider XRD + example. Providers are
   opaque routes; declaring features pretends we know what's inside.
   Match is now supportedModels[] + env-level attributes. Workloads
   requiring engine features are excluded automatically (matchTrace
   surfaces "skipped: provider doesn't expose engine features").

3. Add `nodeSelector` to InferenceCluster.nodePools[] so the composer
   can constrain pods to a pool's nodes. Convention:
   `modelplane.ai/pool: <pool-name>` — auto-applied on Modelplane-
   provisioned pools, set by operators on BYO pools.

Side effect: `requires.engineFeatures` becomes implicitly cluster-only
(matched against the cluster's KServeBackend, never against providers).
Documented in the ModelDeployment XRD comment. Renaming to make this
explicit deferred until Nic weighs in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread design/scheduler-deliverables/examples/gpt-oss-20b.yaml Outdated
Comment thread design/scheduler-deliverables/examples/gpt-oss-20b.yaml Outdated
Comment thread design/scheduler-deliverables/examples/inferenceprovider-together.yaml Outdated
Comment thread design/scheduler-deliverables/examples/kimi-k2.yaml Outdated
Comment thread design/scheduler-deliverables/examples/kimi-k2.yaml Outdated
Comment thread design/scheduler-deliverables/examples/assistant-endpoint.yaml Outdated
Dennis Ramdass and others added 2 commits May 6, 2026 19:10
…Service

Major rev landing the alignment from PR #63 review + slack thread.

API shape changes:
1. Three-level claim cascade per #56: clusterClaim / nodeClaim / deviceClaim.
   Replaces the flat `requires` and `topology.{nodes, devicesPerNode}`.
   deviceClaim is DRA-shaped (count, perNode, selector, constraints).
2. Replica == placement. ModelDeployment carries spec.replicas + the K8s
   scale subresource. Each MP composes one LLMInferenceService.spec.replicas: 1.
   KEDA writes spec.replicas via a stock ScaledObject — no custom scaler.
   v1: same-cluster constraint (matcher decides on first MP, reuses).
   v2: cross-cluster spread.
3. InferenceProvider renamed to ModelService. Namespace-scoped. Routing-only
   — never a placement target. Matcher considers only InferenceCluster
   candidates. (Dedicated-SaaS placement is a separate concept Nic owns.)
4. ModelEndpoint route discriminator: Deployment | ModelService | External.
   Routes target Deployment by ref/selector and fan across all its placements.
5. CapabilityVocabulary scoped Namespaced. Cluster default ships in
   modelplane-system; per-namespace overrides in user namespaces.
6. environmentClaim → clusterClaim (DRA naming consistency).
7. Drop fan-out (`environments: N` removed); multi-region = multiple MDs +
   ModelEndpoint route entries.

Lifecycle: namespace = environment. Each namespace holds one ModelEndpoint,
0..N ModelDeployment / ModelService / ModelPlacement, optional vocab override.
Pushing an MD revision triggers lifecycle reconciliation in that namespace.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stale references cleaned across XRDs, examples, README, and 1-pager:
- ModelDeployment/ModelPlacement comments still framed placements as
  "per fleet member" — corrected to per-replica
- ModelService comments still said "matcher matches against providers"
  / "skipped for ModelService" — stale; matcher considers only
  InferenceCluster candidates now
- Workload examples and 1-pager wording about "fleet member" /
  "(clusters + providers)" tightened where ModelService is no longer
  a placement target

Stripped v1 vs v2 implementation hedges from API surface (XRDs,
examples, 1-pager architectural decisions). v1/v2 milestone language
stays confined to the explicit "what ships v1 vs v2" / fleet-level
capabilities sections; API design docs describe the abstraction
without prescribing implementation phasing.

Reverted CapabilityVocabulary to cluster-scoped singleton. Per
Bassam's Slack feedback: namespace-scoped vocab overrides create a
coordination problem because InferenceCluster (cluster-scoped) declares
attributes against a vocabulary, and per-namespace overrides would
evaluate the same cluster's hardware semantics differently from each
namespace. Namespaces customize via Compositions and pass-through
user-defined keys (acme.example/*), not vocab redefinition.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread design/scheduler-deliverables/examples/gpt-oss-20b.yaml Outdated
Comment thread design/scheduler-deliverables/examples/gpt-oss-20b.yaml Outdated
Comment thread design/scheduler-deliverables/examples/gpt-oss-20b.yaml Outdated
Dennis Ramdass and others added 6 commits May 6, 2026 22:15
…cleanup

Per Bassam's review:

1. Combined nodeClaim + deviceClaim into one DRA-shaped deviceClaim.
   Device attributes are uniform across devices on a node, so the
   conceptual node/device split wasn't load-bearing. The deviceClaim
   selector now matches both node-level (modelplane.ai/interNodeFabric)
   and device-level (gpu.nvidia.com/architecture) attributes uniformly.
   Cluster cascade: clusterClaim + deviceClaim (was three-level).

2. Made in-cluster scheduler delegation explicit. Modelplane decides
   which cluster a workload runs on; bin-packing, gang scheduling,
   fractional GPU, NVLink-aware placement, and capacity tracking are
   delegated to whatever in-cluster scheduler is installed (KAI, Kueue,
   Volcano, vanilla K8s scheduler). Modelplane reads cluster-level
   capacity signal where present; never replaces in-cluster scheduling
   logic.

3. Stripped v1/v2 implementation hedging from the API design surface.
   Dropped "When" column from fleet-level capabilities table — those
   are design-level capabilities the architecture supports, not phasing
   commitments. Cleared v1/v2 markers from API skeleton comments and
   risk mitigation prose. Project-plan section at the bottom retains
   v1/v2 themes as the explicit phasing artifact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…questions

1. In-cluster scheduler stance: Modelplane ships Kueue as the default
   substrate (`managed-kueue` mode, mirroring `managed-kserve`); BYO
   schedulers (KAI, Volcano, existing Kueue installs) are supported via
   a capacity-signal contract (ClusterQueue.status or equivalent).
   Replaced the prior "agnostic about scheduler" framing with this
   opinionated-default-with-BYO-escape position.

2. Dual-path matching: deviceClaim.selector now supports both
   `matchLabels` (plain node-label matching, no DRA required) and
   `matchAttributes` (DRA-typed; richer constraints like NVLink-domain
   co-location). The composer picks output based on the cluster's
   provisioning.mode. Customers who don't want DRA complication use
   labels; DRA stays optional.

3. Updated gpt-oss-20b example to demonstrate the simpler matchLabels
   path (single GPU, hopper family). kimi-k2 + qwen3-coder remain on
   the matchAttributes / DRA path showing typed constraints.

4. New "Open questions (Nic to call)" section in the 1-pager
   collecting design decisions still up for alignment: scheduler
   default, label-vs-DRA dual path, requires.engineFeatures rename,
   dedicated-SaaS placement, ModelObjective intent layer, vLLM
   recipe consumption, WG-DM engagement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both the in-cluster scheduler and the inference backend are pluggable
under Modelplane. Spelled out the contracts:

1. InferenceCluster.spec.scheduler.type:
   managed-kueue | kueue | kai | volcano | none.
   Modelplane composes admission CRs per scheduler (Workload for Kueue,
   PodGroup for KAI / Volcano) and reads capacity-signal status fields.

2. InferenceCluster.spec.backend.{type, version}:
   managed-kserve | kserve | dynamo | raw-vllm + version pin.
   A backend adapter watches ModelPlacement (the IR) and renders
   backend-specific upstream objects: LLMInferenceService for KServe,
   DynamoGraphDeployment for Dynamo, Deployment+Service for raw-vllm.
   Adapter writes back to ModelPlacement.status.rendered.

Both follow the same pattern: opinionated default install (managed-X)
+ BYO contract for customers with existing investments. v1 ships Kueue
+ KServe adapters; KAI / Volcano / Dynamo are future contributions.

Doc updates:
- New "Pluggable substrate" section in 1-pager (between Architecture
  and Fleet-level capabilities) with the symmetry table
- New scheduler / backend fields on InferenceCluster XRD + example
- ModelPlacement.status.rendered docstring expanded — clarifies it's
  the seam between IR and backend adapters
- README claim-cascade section updated with the pluggable-substrate
  framing
- Open questions: added "BYO contract details" entry

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…presentation'

The IR isn't a new abstraction we invented for the scheduler design. It's
the role the existing ModelPlacement CRD (already in apis/modelplacements/)
plays — the seam between the matcher's output and the version-pinned
backend adapter's input. Made this explicit across the doc:

- modelplacement.yaml header: "existing CRD ... expanded here to play the
  role of the **intermediate representation (IR)** ... isn't a new
  abstraction — it's the role this existing CRD plays"
- modeldeployment.yaml header: notes ModelPlacement is the existing CRD
  playing the IR role
- 1-pager architectural decisions: spells out "intermediate representation
  (IR) — the seam between the matcher and the version-pinned backend
  adapter" with attribution to apis/modelplacements/
- 1-pager Modelplane-ships list: notes ModelPlacement is existing in
  apis/modelplacements/
- 1-pager pluggable substrate: spells out "intermediate representation"
  on first use in this section
- README claim-cascade section: spells out IR + notes the existing CRD
- README directory listing: notes "existing CRD; plays the IR role"
- inferencecluster.yaml backend comment: spells out "intermediate
  representation"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… notes

Bassam's three latest comments (on gpt-oss-20b.yaml):

1. Drop spec.replicas: 0 — KEDA-managed via the scale subresource;
   user authors scaling block only.
2. Confirm labels-first / DRA-as-break-glass framing — already
   demonstrated in this example; reword the doc to match.
3. Drop requiredEngineFeatures — not needed for the simple case.

Broader refactor — doc is now framed as overall Modelplane API design
(scheduler still the heaviest piece) rather than scheduler-only:

- Renamed design/scheduler-deliverables/ → design/modelplane-api/
- Renamed design/scheduler-1pager.md → design/modelplane-api.md
- Title: "Modelplane API Design — 1-pager"
- TL;DR expanded with adapter/plugin substrate + CapabilityVocab
  managed-catalog as first-class points
- Labels-first / DRA-break-glass made explicit throughout (XRD
  selector docstring, 1-pager architectural decisions, README)
- Capability vocabulary section expanded — Modelplane ships the
  canonical catalog (chip generations, engine versions, quantization,
  KV tiers, fabric ordering); customers override per-cluster for
  bespoke. Flagged as a candidate for an Upbound-managed commercial
  offering (keeping the catalog current is bounded high-leverage work)
- Pluggable substrate section renamed "Adapter / plugin substrate"
  and tightened — managed defaults (managed-kserve, managed-kueue)
  ship with Modelplane; BYO contracts let customers plug in KAI /
  Volcano / Dynamo via the IR seam. The IR (ModelPlacement, the
  existing apis/modelplacements/ CRD) is the contract between
  matcher and adapter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… appendix

The README inside modelplane-api/ duplicated the design doc's framing.
Absorb the still-useful bits ("what's deliberately incomplete" and the
"where each XRD lands" mapping) into the design appendix so the design
doc is the single source of truth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dennis-upbound dennis-upbound changed the title design: scheduler & capability model 1-pager + deliverables (preview) design: Modelplane API 1-pager + design-time preview May 7, 2026
Dennis Ramdass and others added 3 commits May 7, 2026 07:48
…iagram

Doc cleanup:
- Drop redundant 'Why this matters for BYO' paragraph; fold into prior two.
- Architecture box: surface scheduler/backend defaults, drop the
  hard-coded 'composes LLMIS' line (that's a backend concern).
- Open questions: drop two that are already resolved (BYO contract shape
  is in the substrate table; requires.engineFeatures rename happened —
  field is already requiredEngineFeatures).
- Risks: byo-kserve -> kserve to match the InferenceCluster.spec.backend.type
  enum.

XRD cleanup:
- capabilityvocabulary.yaml: requires.engineFeatures -> requiredEngineFeatures
  (matches the ModelDeployment field name).

Diagram:
- diagram.excalidraw adapted from @bassam's whiteboard. Same overall
  layout (APIs / Matching / Example fleet topology) with current naming
  applied: drop nodeClaim (collapsed to two-level cascade); CapabilityVocabulary
  cluster-only (not namespace); add scheduler.type and backend.{type,version}
  on InferenceCluster; flag dual-path matching (matchLabels primary,
  matchAttributes DRA break-glass); ModelService labelled routing-only.
  Credit attribution at top.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Directory rename to make the 'this is a proposal, not the API yet'
  framing explicit. All links in the design doc updated.
- Diagram: rebuilt from Bassam's whiteboard with much tighter text in
  the API/Matching boxes (fits the original rectangle widths now,
  smaller font on the YAML), CamelCase on the example resource labels
  to match our actual CRD names.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverting my edits to the excalidraw — putting Bassam's original in
place of my fiddled version. Doc reference updated to credit him as
the source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread design/proposed-modelplane-api/examples/kimi-k2.yaml Outdated
Comment thread design/proposed-modelplane-api/examples/kimi-k2.yaml Outdated
Comment thread design/proposed-modelplane-api/examples/modelservice-together.yaml Outdated
Comment thread design/proposed-modelplane-api/examples/qwen3-coder.yaml Outdated
Dennis Ramdass and others added 5 commits May 7, 2026 09:41
…acro

Make the DRA distinction explicit. Modelplane is a federation planner —
it evaluates predicates against declared pool capacity before any nodes
exist. DRA is a runtime allocator — drivers introspect real hardware
post-provisioning. We borrow DRA's vocabulary (typed attributes, domain-
prefixed keys, CEL, the device.attributes[domain].name access pattern)
but not its Kinds (DeviceClass / ResourceSlice / ResourceClaim) at the
federation layer.

Specifically:
- New "Two-stage scheduling" section in the design doc, with a
  borrow/drop table and the BYOC-vs-Modelplane-provisioned grounding
  contract for when the backend adapter emits real ResourceClaims.
- Rename clusterClaim -> clusterSelector and deviceClaim -> deviceSelector
  across the doc, XRDs, and examples. "Claim" implies allocation, which
  is wrong for what we do pre-provisioning.
- Flatten the redundant inner `selector:` nesting on both selectors —
  matchLabels / matchAttributes / cel are now top-level fields on the
  selectors. Cleaner reads after the rename.
- Drop "DRA-shaped" / "DRA-typed" framing. Replace with "typed attribute
  predicates evaluated against declared pool attributes."
- Open questions: re-pose label-vs-attribute matching path (was DRA-vs-
  label); add the DRA grounding contract question.

Instance-type macro:
- New `instanceTypes` field on CapabilityVocabulary. Each entry has a
  canonical name (H100-NVL-8x-IB400), `expands` to a set of typed
  attributes, plus `aliases` for per-cloud SKU strings (aws:p5.48xlarge).
- Customers match on either dimension — high-level string for the common
  case, unpacked attributes for unusual constraints. Same predicate
  engine, vocab does the unpacking.
- Default catalog seeded with H100/H200/B200/MI300X. Per-cloud SKU
  taxonomy is exactly the bounded ongoing work that fits the managed
  catalog offering.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndex discipline

Every field on the user-facing API now has at least one named consumer
(matcher / composer / backend adapter / gateway). Where it wasn't
obvious before, comments now spell out who reads what for what purpose.
Where there's no consumer, the field's gone.

Concrete changes:

- Drop ModelDeployment.spec.requiredEngineFeatures. Required features
  are derived from a more concrete declaration:
    roles present                  -> prefill-decode-disagg
    engine.quantization.target=kv  -> fp8-kv-cache
    engine.speculation.type=NGram  -> ngram-speculation
    engine.optimizations.*         -> chunked-prefill / prefix-caching /
                                      kv-cache-routing
    adapters[] non-empty           -> multi-lora
    parallelism.expert: enabled    -> expert-parallelism
  Matcher unions these at federation time and matches against each
  InferenceCluster's KServeBackend.spec.engine.features. Single source
  of truth: declare what you want; matcher derives what backend features
  that requires.

- New typed engine.optimizations field — chunkedPrefill, prefixCaching,
  kvCacheRouting. Promotes commonly-used names from the advanced[]
  break-glass to a typed shape. Backend adapter translates to engine
  flags (vLLM --enable-chunked-prefill, etc.).

- ModelDeployment XRD: top-level Field-level consumer index spelling
  out who reads each field. parallelism, roles, adapters get explicit
  "Modelplane-canonical, backend adapter translates" framing.

- ModelService.spec.supportedModels: documented as consumed by
  ModelEndpoint route filtering, with an auto-refresh plan (controller
  polls the SaaS provider's /v1/models catalog API).

- ModelDeployment.spec.adapters: documented two consumers — backend
  adapter (engine LoRA load via vLLM --lora-modules) and gateway
  (LoRA-aware request routing).

- ModelEndpoint XRD: explicit framing that a Deployment route covers
  ALL of an MD's ModelPlacements transitively (every replica on every
  cluster the matcher placed them on). Cross-cluster spread = multiple
  MDs + multiple route entries.

- ModelPlacement.spec.requiredEngineFeatures -> derivedFeatures (carried
  on the IR for the backend adapter to verify support before rendering;
  not user-authored).

- Examples (kimi-k2, qwen3-coder, gpt-oss-20b, modelservice-together)
  re-commented inline pointing at consumers; required-feature derivation
  spelled out in kimi-k2 and qwen3-coder headers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…loud SKUs

Bassam's "GPU hardware survey and unified taxonomy" (Notion, 2026-05-07)
proposes a four-layer model — Cluster / Pool / Device / dynamic state —
with capability sets (not boolean columns), predicates over equality,
architecture as metadata while capability flags do the matching work,
and rack-scale (NVL72) as its own addressable unit.

This patch lands the taxonomy in the default CapabilityVocabulary and
in the existing examples, adds a reference-clusters/ directory with
pre-generated InferenceClusters for known cloud SKUs, and reframes the
canonical-catalog work as the wedge for a managed-catalog commercial
offering layered with continuous testing & benchmarking.

Vocabulary changes (default CapabilityVocabulary):

  Cluster layer:  cloud.{provider, region}, network.{fabric,
                  bandwidthGbps, airgapped}, cluster.scaleUnit
                  (independent-nodes | superpod | nvl72)
  Pool layer:     cloud.instanceType, gpuCount, interconnect.{type,
                  bandwidthGBs}, cpu.{vendor, cores, platform},
                  memoryGiB, nics.{count, bandwidthGbps},
                  host.virtualization
  Device layer:   vendor, product, architecture, formFactor, vramGiB,
                  mig (bool), capabilities (set), parentProduct
                  (for fractional GPUs / MIG entries)

Drop the conflated `interNodeFabric` ordered-string in favor of
network.fabric + network.bandwidthGbps (RoCE vs IB are distinct
protocols on the same hardware — OCI's pattern shows why).

Add `set` and `bool` types to attributeKeys.type enum.

Instance-type macros reseeded with Bassam's catalog rows:
H100-NVL-8x, H200-NVL-8x, B200-NVL-8x, B300-NVL72, GB200-NVL72,
MI300X-8x, L40S-8x, A100-80GB-8x. Each includes per-cloud SKU aliases
(aws/gcp/azure/oci/coreweave/lambda/dgx).

Reference clusters (new):
  - aws-p5-48xlarge.yaml          AWS H100, EFA
  - gke-a3-mega-8g.yaml           GCP H100, RoCE
  - oci-bm-gpu-mi300x-8.yaml      OCI MI300X bare metal, RoCE
  - coreweave-gb300-nvl72.yaml    rack-scale Blackwell Ultra
Customers copy-paste or compose; updated as new SKUs land. Anchors
the managed-catalog commercial offering. Follow-up: a Crossplane
provider that polls cloud SKU APIs and generates these programmatically
(removes the "keep labels up to date by hand" burden).

Existing examples updated to use the new vocab:
  - kimi-k2 / qwen3-coder: matchAttributes use vramGiB predicates +
    capabilities set instead of architecture enum (keeps AMD eligible
    where the workload doesn't actually depend on Hopper specifically)
  - gpt-oss-20b: --enable-prefix-caching pulled from engine.args into
    engine.optimizations.prefixCaching for consistency
  - inferencecluster-prod-coreweave: full 4-layer attribute split

Doc updates:
  - New "Hardware taxonomy & reference clusters" section folding in
    Bassam's survey, the four-layer model, key design choices
    (capability sets, predicates over equality, rack-scale as own unit,
    RoCE vs IB), the instance-type macro pattern, the static-vs-provider
    rollout for reference clusters.
  - Commercial-offering framing extended: tracking + reference clusters
    + continuous testing & benchmarking. Each reference cluster paired
    with a tested-and-benchmarked workload run on every supported model
    family — costly to maintain, exactly what customers will pay for.
  - Open Qs: rack-scale capacity-unit question, reference-cluster
    rollout ordering.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l errors

Cross-checked every example end-to-end against the unified taxonomy.
The new vocab landed cleanly but the existing examples weren't fully
re-aligned. Fixed all the friction points before another round of
review:

clusterSelector / deviceSelector layering:
  - kimi-k2: network.fabric and network.bandwidthGbps are Cluster-layer
    attributes (live on InferenceCluster.spec.attributes), not Pool/Device.
    Moved from deviceSelector → clusterSelector.
  - kimi-k2 / qwen3-coder / gpt-oss-20b: tier was matched via matchLabels,
    but it's an attribute (lives in spec.attributes, not metadata.labels).
    Switched to matchAttributes.

Misleading or wrong constraints:
  - kimi-k2 / qwen3-coder: dropped the gpu.nvidia.com/nvlinkDomain
    constraint. For HGX baseboards (8 GPUs / node), NVLink-domain
    co-location is implied by interconnect.type: nvswitch + perNode: 8;
    the explicit constraint was redundant. Also fixed the misleading
    "keeps AMD MI300X eligible" comment that was contradicted by the
    NVIDIA-specific constraint key.

Stale vocab keys:
  - modelservice-together: replaced modelplane.ai/region →
    cloud.region, modelplane.ai/provider → cloud.provider, dropped
    modelplane.ai/networkAccess (not in vocab).
  - inferencecluster-prod-coreweave: dropped modelplane.ai/failureDomain
    (not in vocab and not used by any matcher).
  - cloud.provider enum extended with SaaS providers (togetherai,
    baseten, bedrock, fireworks, modal) so ModelService.attributes
    validates.

Factual errors in reference clusters:
  - coreweave-gb300-nvl72: gpuCount 2 → 4 (gb300-4x is "2 superchips,
    1 Grace + 2 B300 each" = 2 Grace + 4 GPUs per Bassam's survey).
    Fixed interconnect.type from nvlink-c2c (Grace↔Hopper coherent
    memory only) to nvswitch (NVLink Switch, GPU↔GPU at rack scale).
    Added cpu.cores: 144 (2× 72-core Grace) and memoryGiB: 960.

Cross-namespace routing:
  - assistant-endpoint referenced MDs in three different namespaces
    (research, dev-tools, app-team) without explicit namespace on the
    refs. Per "namespace = environment / lifecycle scope", routing
    across namespaces breaks the model. Consolidated all MDs +
    ModelService + ModelEndpoint into the same namespace (app-team).

Vocab macro fixes:
  - B300-NVL72 / GB200-NVL72 macros conflated per-instance and per-rack
    (had cluster.scaleUnit: nvl72 inside Pool-layer macros). Macros are
    Pool-layer (per-host); rack-scale belongs on Cluster-layer
    (cluster.scaleUnit on InferenceCluster.attributes). Renamed to
    B300-Grace-4x / B200-Grace-4x with gpuCount: 4. The fact they
    sit in NVL72 racks is captured in the InferenceCluster's
    cluster.scaleUnit attribute, not the macro.

InferenceCluster XRD comment cleanup:
  - "level 1 of three-level cascade" → "Cluster-layer attributes"; same
    for Pool / Device layers. Aligned with the unified taxonomy framing.
    Examples in description text updated to current vocab keys.

Doc:
  - Appendix example descriptions updated to match the new selectors
    (Kimi K2 demonstrates predicates not "DRA break-glass"; qwen3-coder
    headlines its acme.example/* user-defined attributes).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Doc:
- Trim wordiness throughout. Down from 432 → ~385 lines without
  losing substance. Mostly cuts of repetition between TL;DR and body,
  collapsing prose to tables, removing orphan bullets.
- Promote ModelService-is-Nic-sketch framing. Was buried in
  parentheticals; now a named "ModelService is a sketch" subsection
  in the API shape and a TL;DR bullet.

Examples (covering gaps the existing set didn't):
- examples/reference-clusters/eks-h100-no-dra.yaml: BYOC EKS on K8s 1.31
  without a DRA driver. provisioning.mode: device-plugin; backend
  adapter constrains pods via nodeSelector + nvidia.com/gpu, no
  ResourceClaim emission. Demonstrates the labels-first path concretely
  and pairs with gpt-oss-20b.yaml's matchLabels usage.
- examples/kimi-k2-eu.yaml: EU-region sibling of kimi-k2.yaml. Pinned
  via cloud.region + modelplane.ai/compliance: [gdpr]. Concrete
  multi-region pattern.
- examples/multi-region-endpoint.yaml: ME routing across kimi-k2 +
  kimi-k2-eu + together-prod with weighted SaaS spillover.

Vocab:
- New L40S-4x macro for SKUs with 4 L40S (oci:BM.GPU.L40S.4,
  aws:g6e.{12,24}xlarge). Was incorrectly aliased under L40S-8x before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
negz and others added 8 commits May 7, 2026 16:56
Signed-off-by: Nic Cope <nicc@rk0n.org>
Refine the design/api-update.md sketch following discussion of the
scaling model, disaggregated serving, and the routing surface:

- Rename ModelPlacement to ModelReplica throughout. Each replica is one
  complete serving instance — single-node, multi-node via LWS, or full
  prefill/decode disagg. Mirrors Deployment -> Pod naming.

- Drop spec.scaling from ModelDeployment. Autoscaling is opt-in via a
  separate KEDA ScaledObject targeting the MD's scale subresource —
  same pattern as Deployment + HPA. Add a worked Mixtral example with
  ScaledObject alongside.

- Add a discriminated-union pattern for disaggregated prefill/decode.
  A serving profile is either unified (root poolSelector / parallelism
  / engine) or disagg (explicit decode and prefill blocks, each
  self-contained — no inheritance from the root). Decode and prefill
  must land on the same InferenceCluster but can target different pools.

- Move inter-node networking onto InferenceClass instead of cluster-
  level capabilities. Different networking implies a different class
  (h200-nvl-8x-ib vs h200-nvl-8x); networking belongs to the pool that
  uses it. Drop spec.capabilities from InferenceCluster — cluster-level
  metadata is captured as standard Kubernetes labels.

- Lift clusterSelector to deployment level. Profiles only carry pool
  selection and per-pool composition, since the cluster intent doesn't
  change between fallback profiles.

- Switch ModelService routing to a single spec.endpoints[] pattern (was
  separate selector vs routes paths). One mechanism for both simple and
  weighted routing.

- Drop spec.model.name in favor of metadata.name as the served model
  identifier. The HuggingFace repo (or other source) is purely where
  weights come from, not the model's identity.

- Add YAML comments throughout the examples explaining what each field
  does — what gets matched, what gets composed, what's optional.

Signed-off-by: Nic Cope <nicc@rk0n.org>
Different hardware targets typically require different model weight
checkpoints (FP8 vs BF16 are different HuggingFace repos). That makes
fallback profiles within one deployment the wrong abstraction — they're
genuinely different deployments. Silent degradation (falling back to a
config with lower context length or different quantization) is also
arguably worse than explicit failure.

This commit flattens the serving profile array. poolSelector,
parallelism, and engine are now top-level fields on
ModelDeployment.spec. Different hardware configurations are separate
ModelDeployments behind one ModelService. The deployer makes explicit
decisions about which configurations to run.

The disaggregated prefill/decode pattern is now a discriminated union
on ModelDeployment itself: either root-level poolSelector/parallelism/
engine (unified) or explicit decode/prefill blocks (disaggregated).

If preferential scheduling is needed later, it would be a coordination
mechanism between ModelDeployments, not inline profiles.

Signed-off-by: Nic Cope <nicc@rk0n.org>
The parallelism block describes more than just parallelism — it's the
complete compute topology of a role within one ModelReplica. Renaming
to topology makes room for the per-role instance count (instances)
which describes replication rather than sharding.

For disaggregated prefill/decode, the P:D ratio (e.g., 5P3D) is the
number of independent instances per role within one ModelReplica. This
is a topology parameter — fixed per deployment, not a scaling knob. It
maps to KServe's LLMInferenceService.spec.replicas (decode) and
spec.prefill.replicas (prefill). For unified serving, instances
defaults to 1 and can be omitted.

Other changes in this commit:

- Require DRA on all InferenceClusters. Drop nodeSelector from pool
  declarations — DRA handles device-to-node binding. Pools are now
  just name, class, and maxNodes.

- Rename poolSelector to nodeSelector on ModelDeployment. With DRA
  required and nodeSelector gone from pools, the naming collision is
  resolved.

- Replace driver.version with cuda.toolkit as the typed capability
  example on InferenceClass. Driver version is a runtime property of
  the cluster, not the hardware SKU. CUDA toolkit version is a better
  example of where {type: version} decoration matters.

Signed-off-by: Nic Cope <nicc@rk0n.org>
…nancy

- scheduler.type: auto (default), managed-kai, managed-kueue added to
  InferenceCluster. NVIDIA pools auto-resolve to managed-kai; non-NVIDIA
  to managed-kueue. BYOC detects existing install.
- New section: ModelDeployment placement walkthroughs — single-node TP,
  multi-node TP+PP via LWS, P/D disaggregation, KEDA + matcher loop.
- New section: Multi-tenancy — bin-packing, MIG, time-slicing all enabled
  at the pool layer; MD spec stays portable across sharing modes.
- New example: managed-gke-a3-kai.yaml (explicit managed-kai pin).
…shape to #64

- Rename ModelPlacement → ModelReplica everywhere (XRD, status fields,
  printer columns, doc, examples). Aligns with the "replica == placement"
  mental model. Pure rename + role expansion.
- Add Federation-layer scheduling section: composer / matcher / backend
  adapter / capacity adapter contracts, the actual matcher pseudocode,
  out-of-scope items with effort + uncertainty, ordering reasoning.
- Add Plugin/adapter system section: six adapter axes, version-pinned
  KServe absorbing schema churn, end-to-end managed + BYOC examples.
- Add BYOC scheduling section: onboarding flow, what "managed" means
  axis-by-axis, edge cases (no DRA, multiple schedulers, RBAC).
- Expand autoscaling walkthrough: KEDA / composer / matcher / backend
  loop with concrete scale-up + scale-down sequences.
- Replace v1/v2 framing with effort sizes + ordering + uncertainty.
- Defer API shape, hardware taxonomy, engine-features detail to #64.
- Remove xrds/ (and lint.py / LINT.md) — API shape lives in #64.
  Examples stay as illustrative scheduling-relevant YAML.
- New section: "What we treat as IR" — three IRs (ModelReplica explicit;
  cluster substrate + endpoint binding implicit today). Argues why naming
  IR seams is what makes BYO-* cheap.
- New section: "Crossplane lifecycle layers" — per-layer XR ownership is
  what enables pause/resume, GitOps drift, RBAC boundaries, version-skew
  handling per cluster.
- Reframe plugin/adapter axes: be honest the count is contingent. Two
  user-visible axes (scheduler, backend); the other four are internal /
  collapsible. Don't read into the number.
- New section: "User-facing surface preview" with Quickstart (4 CRs,
  ~60 lines of YAML to a working curl) + 5 Advanced scenarios as deltas
  (multi-region, BYOC+KAI, P/D disagg, custom InferenceClass, spillover).
  Goal: gauge complexity of the proposed scheduling design from the
  user's seat.
Replaces single 1291-line modelplane-api.md with four focused docs:

- README.md — index / TOC pointing at the right doc per audience
- quickstart.md — 4 CRs to a working curl (~120 lines)
- advanced.md — 5 common scenarios as deltas (~210 lines)
- scheduling.md — operator's reference: two-stage scheduling, federation
  matcher, in-cluster KAI/Kueue, multi-tenancy, BYOC behavior, placement
  walkthroughs (~500 lines)
- design.md — architectural decisions: principles, plugin/adapter system,
  IRs, Crossplane lifecycle layers, risks, open questions, roadmap (~490 lines)

User docs link to design.md for "why"; design.md links to scheduling.md
for "what shows up to users". Same content as before, organized by
audience instead of one monolith.
@dennis-upbound dennis-upbound changed the title design: Modelplane API 1-pager + design-time preview design: Modelplane scheduling & placement (split into quickstart / advanced / scheduling / design) May 8, 2026
…nter

Pivots this PR from design preview to implementation sketch. The code under
functions/ doesn't run (Nic's #64 protos aren't generated yet) but the
shape, dependencies, and use cases are real.

Implementation:
- functions/compose-model-deployment/scheduling.py — federation matcher.
  Plain Python, no Crossplane imports — testable in isolation. Filters ICs
  by clusterSelector.matchLabels, filters pools by nodeSelector.cel against
  class capabilities, capacity check with sticky-placement accounting,
  scores and picks per replica. Topology strategies map to (nodes_per_inst,
  gpus_per_node). Disagg requires same-cluster decode + prefill pools.
- functions/compose-model-deployment/main.py — Crossplane glue. Required-
  resources for clusters/classes/owned MRs, calls scheduling.match(),
  emits ModelReplica + ModelEndpoint per spec.replicas, sets MD conditions.
- functions/compose-model-placement/main.py — renderer. Reads MR + matched
  IC + class(es), composes KServe LLMInferenceService (decode + optional
  prefill) + DRA ResourceClaim(s) on the target cluster via remote-object
  provider. Lifts cold-start conditions back as MR.status.

Docs:
- design.md compressed to ~180 lines: architecture diagram, what-lives-where
  table, dependencies per function, use cases traced through the code.
- README.md is now a 2-section pointer at design.md + the code.
- Deleted quickstart.md / advanced.md / scheduling.md — served the design
  phase; the code is the new source of truth.
- examples/README.md maps each example to the matcher/renderer code path
  it exercises.

Adapter functions (_load_md, _resolve_clusters, _load_mr, _cel_from_capabilities)
raise NotImplementedError — they're the wiring points that fill in once #64
lands and protos are generated.
@dennis-upbound dennis-upbound changed the title design: Modelplane scheduling & placement (split into quickstart / advanced / scheduling / design) Federation matcher + renderer composition function sketch (against #64 API) May 8, 2026
Adds stage-2 scheduler integration to the renderer + a sketch of the
capacity feedback loop. Same sketch quality as the rest of the PR —
doesn't run, but the dispatch shape, per-scheduler differences, and
capacity-status pipeline are honest.

- functions/compose-model-placement/scheduler.py — per-scheduler wrap
  functions. KAI: schedulerName + PodGroup CRD wrapping the LWS gang
  (minMember = total pods). Kueue: kueue.x-k8s.io/queue-name label +
  suspend gate (Kueue's webhook creates the Workload). none: pass-through.
  Single dispatch table; new schedulers (Volcano, etc.) plug in here.

- functions/compose-model-placement/main.py — wired to call scheduler.wrap()
  after building the base LLM-IS spec. Emits the wrapped spec + any
  scheduler-companion objects (PodGroup) onto the same target cluster
  via the existing remote-object provider.

- lib/capacity_adapter/{__init__,common,kai,kueue}.py — sketch of the
  per-scheduler status pullers. Reads the scheduler's status CRDs
  (KAI Queue/ResourcePool, Kueue ClusterQueue.flavorsUsage[]),
  normalizes into the shared CapacitySnapshot type, writes to
  IC.status.capacity. NOT a Crossplane composition function — runs as
  a separate controller, one per IC. Sketch shows the projection logic;
  K8s client wiring is NotImplementedError stubs.

- design.md updated with a KAI/Kueue section: per-scheduler differences
  table, dispatch wiring diagram, capacity feedback loop diagram, how
  to add a new scheduler. Notes the small API extension needed on
  Nic's #64 (IC.spec.scheduler.type).
Dennis Ramdass added 2 commits May 8, 2026 15:26
Splits each composition function into pure modules (algorithm, dict
builders, dispatch tables — no Crossplane imports) and an orchestrator
main.py with phase-banner comments distinguishing scheduling logic from
Crossplane glue from status / error handling.

compose-model-deployment/:
  scheduling.py    federation matcher (pure)
  adapters.py      proto ⇄ scheduling types (boundary)
  emitters.py      pure dict builders for ModelReplica / ModelEndpoint
  main.py          orchestrator — six labeled phases:
                   REQUIRE → LOAD → MATCH → BUILD → EMIT → STATUS

compose-model-placement/:
  rendering.py     pure LLM-IS / DRA / selector-CEL builders
  scheduler.py     per-scheduler wrap dispatch (KAI / Kueue / none)
  adapters.py      proto ⇄ rendering types (boundary)
  main.py          orchestrator — seven labeled phases:
                   REQUIRE-cluster → REQUIRE-classes → LOAD → RENDER →
                   WRAP → EMIT → STATUS

Each main.py opens with the lifecycle diagram, condition state machines
(Scheduled / ReplicasReady / Ready+cold-start sub-states), and the
error-handling table per phase.

Tests (tests/unit/, runs in ~20ms):
  test_scheduling.py        topology shapes, cluster/pool filters,
                            capacity reservation across replicas,
                            sticky placement, disagg same-cluster,
                            matchTrace shape (28 cases)
  test_scheduler.py         KAI / Kueue / none dispatch, PodGroup
                            minMember math across single-node /
                            multi-node / disagg, queue label, suspend
                            gate (16 cases)
  test_rendering.py         LLM-IS shape (Tensor, TensorPipeline, disagg),
                            DRA selector CEL derivation from class
                            capabilities (15 cases)
  test_capacity_adapter.py  ResourceCount math, write_status round-trip,
                            KAI / Kueue projections, NotImplementedError
                            sanity (10 cases)

Plus pyproject.toml gets pytest config + per-file ARG001 ignores for
intentional dispatch-contract signatures.

Static health: ruff + pyright clean over functions/ + lib/.
69/69 unit tests pass.
@dennis-upbound dennis-upbound changed the title Federation matcher + renderer composition function sketch (against #64 API) WIP: Federation matcher + renderer composition function sketch (against #64 API) May 8, 2026
@dennis-upbound dennis-upbound marked this pull request as draft May 8, 2026 22:38
Dennis Ramdass added 4 commits May 8, 2026 15:50
…uler

Renames + minor refactor for clearer scheduling semantics:

- match() → schedule(), MatchResult → ScheduleResult.
  Both align with K8s SIG-Scheduling's `Schedule(workload) -> ScheduleResult`
  contract and with the existing repo's schedule() name on main.
- _candidates_for_replica → _filter (Filter phase)
- _pick → _score_and_select (Score phase)
- _to_placement → _bind (Bind phase)
- Added explicit "── Filter ──" / "── Score ──" / "── Bind ──" comment
  banners in schedule() so the K8s parallel is visible.
- Strengthened the module docstring to lead with "META-FLEET SCHEDULER —
  fleet-level placement, NOT cluster-level scheduling" so the meta-fleet
  positioning is the first thing a reader sees.
- main.py orchestrator: phase_match → phase_schedule, Phase 3 banner
  renamed MATCH → SCHEDULE.

Tests: 9 occurrences of scheduling.match() → scheduling.schedule(). All
69 still pass.

Plus a new "Delta from existing scheduling on main" section in
design.md — concept-by-concept comparison table covering mental model,
unit of placement, capacity input, pool eligibility, topology,
disaggregation, engine matching, scaling, stickiness, multi-replica
accounting, algorithm structure, result shape, matchTrace, cluster
source, in-cluster integration, lines of code.

Also cleans up __pycache__/ that slipped through git tracking before
the gitignore landed.
Won't merge until after #64 (or later) lands, so the "delta from main"
table was documenting terminology nobody will see by then. Replaces it
with a "Scheduler properties" section that pins down the load-bearing
behavior in K8s SIG-Scheduling terms — what the scheduler actually
does, no comparison column.

No code changes; doc only.
Drops the banner-heavy phase orchestration in favor of Nic's existing
style on main (compose-model-deployment is the gold standard):

- Short, declarative module docstrings — no ═══ banner blocks, no big
  lifecycle diagrams.
- `Composer` / `Renderer` classes mirror Nic's shape: __init__ parses XR
  via adapters; compose()/render() chains verb-named methods directly;
  no bool-return phase pattern.
- Methods named after what they do: resolve_inputs, schedule,
  compose_replicas, compose_endpoints, write_status, derive_conditions
  (composer); resolve_inputs, compose_llmis, compose_resource_claims,
  derive_conditions (renderer).
- conditions.set_condition takes a bool (was passing strings — actual
  bug; "False" is truthy and would have set TRUE).
- response.warning / response.normal events on transitions:
    "Scheduled N replica(s): r0→east/h200, r1→west/h200"
    "No InferenceClusters in the fleet"
    "Cluster fleet config invalid: ..."
- libresource.update_status writes MD.status.modelReplicas + matchTrace.
- self.rsp.desired.composite.ready = fnv1.READY_FALSE when nothing
  scheduled (matches Nic's pattern of explicit unready).
- Drops manual ownerReferences from emitters — Crossplane sets them
  automatically.
- Drops md_uid / engine / source extra params from build_replica;
  emitters take just (md, placement) now.
- Pure modules (scheduling.py, scheduler.py, rendering.py) get tighter
  docstrings without ═══ banners. K8s SIG-Scheduling vocabulary +
  "fleet" terminology preserved.
- Adds naming.llmis_name / claim_name to lib/naming.py.

Algorithm + behavior unchanged. 69/69 tests still pass. ruff/pyright
clean.
Pulls the per-scheduler dispatch system + capacity adapters out of this
MR so the federation algorithm + IR boundary are the focus of review.

Removed:
  - functions/compose-model-placement/scheduler.py (KAI/Kueue/none dispatch)
  - lib/capacity_adapter/{kai,kueue,common,__init__}.py (status pullers)
  - tests/unit/test_scheduler.py
  - tests/unit/test_capacity_adapter.py
  - ClusterView.scheduler_type (no dispatch needed)

Added (inline, in rendering.py):
  - rendering.with_kai_gang(spec, mr_name, ns) → KaiBundle
    Stamps schedulerName + pod-group label, emits the PodGroup with
    minMember = total pod count (decode + prefill if disagg).
  - rendering.gang_size(spec) → int (extracted from the inline helper)
  - KAI tests folded into tests/unit/test_rendering.py.

Renderer's main.py drops scheduler import; calls
rendering.with_kai_gang directly. POD_GROUP_KEY composed alongside
LLMIS_KEY and DRA ResourceClaim(s) on the target cluster.

design.md: replaced "KAI / Kueue integration" section with "KAI
integration (in-cluster, this MR)" + an explicit "Follow-up MR
(plugin/dispatch system)" callout listing what comes next.

49/49 tests still green. Ruff + pyright clean.
@dennis-upbound dennis-upbound changed the title WIP: Federation matcher + renderer composition function sketch (against #64 API) Federation scheduler + KServe renderer (managed-kai) May 8, 2026
negz and others added 4 commits May 8, 2026 16:55
The topology block describes the shape of each worker, but workers.count
(how many of that shape) is a sibling concern at the same level — not
a property of the topology itself. Group them together under workers:

  workers:
    count: 3
    topology:
      strategy: TensorPipeline
      tensor: 8
      pipeline: 2

This reads as "3 workers, each TensorPipeline TP=8 PP=2." The topology
describes one worker. The count says how many. nodeSelector and engine
stay alongside workers as separate concerns — what hardware each worker
needs and what engine it runs.

For unified serving, workers just contains topology (count defaults to
1). For disaggregated P/D, workers.count on each role is the P:D
ratio — the "5" and "3" in 5P3D. It's a topology parameter (fixed per
deployment), not a scaling knob.

Signed-off-by: Nic Cope <nicc@rk0n.org>
The InferenceClass becomes a tested recipe that bundles both
capabilities (for scheduling) and optionally cloud-specific
provisioning config (for cluster composition). When Modelplane
provisions a GKE cluster, the composition function reads
class.provisioning.gke to get the machineType, accelerator config,
and networking — guaranteed consistent with the capabilities the
scheduler uses for matching.

The provisioning block is optional. Classes without it are
capabilities-only, used for BYO clusters where the pool already
exists. The provisioning.provider discriminator selects the
cloud-specific sibling block (gke, eks, aks).

Modelplane ships a default catalog: cloud-specific classes for
provisioned clusters (gke-h200-8x-a3-ib, gke-l4-1x-g2) and
cloud-agnostic classes for BYO (h200-8x-ib, l4-1x). The
InferenceCluster section now shows both a GKE-provisioned cluster
(pools reference cloud-specific classes) and a BYO cluster (pools
reference capabilities-only classes).

Cluster-level config (project, region, K8s version) stays on the
InferenceCluster. Pool-level config (machineType, GPU, networking)
moves to the class. Pool sizing (maxNodes, nodeCount) stays on the
InferenceCluster pool — it's a per-cluster capacity decision, not a
property of the hardware SKU.

Signed-off-by: Nic Cope <nicc@rk0n.org>
Pulled origin/pages and adjusted code for Nic's new shape:

  Old: spec.topology = {strategy, tensor, pipeline, instances}
  New: spec.workers   = {count, topology: {strategy, tensor, pipeline}}

Mechanical rename across the algorithm + IR + tests:
- scheduling.py: split Topology dataclass; new Workers dataclass holding
  topology + count. RoleSpec now carries `workers` (was `topology`).
  RolePlacement.instances → RolePlacement.workers.
- emitters.py: ModelReplica spec writes spec.{decode,prefill}.workers.{count,topology}
  instead of nested instances+topology.
- rendering.py: RoleView.workers replaces .topology + .instances. Drops
  the unused `image` carry-through that was inside topology.
- 49 unit tests updated to use Workers(topology=..., count=N) construction
  pattern. Algorithm semantics unchanged — workers.count plays the role
  topology.instances did.

Also includes the merge of origin/pages bringing in design/api-update.md
(Nic's design doc, untouched).
@dennis-upbound dennis-upbound deleted the dennis/scheduler-1pager branch June 19, 2026 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants