Skip to content

Implement the updated API shape#75

Merged
negz merged 35 commits into
mainfrom
demonstration
May 20, 2026
Merged

Implement the updated API shape#75
negz merged 35 commits into
mainfrom
demonstration

Conversation

@negz

@negz negz commented May 13, 2026

Copy link
Copy Markdown
Collaborator

This PR implements most of the API shape we aligned on in #64.

The exceptions are:

  • No CEL (no spec.nodeSelector)
  • No DRA support - we don't set pod requirements at all, except number of GPUs needed
  • ModelEndpoint doesn't support specifying addresses as FQDNs - only IPs
  • Prefill/decode (no spec.prefill)
  • We're still built on KServe only

negz added 14 commits May 18, 2026 11:57
The team decided to focus on the kubectl / YAML workflow for the initial
demo. The web UI (Go proxy + React SPA) is not needed right now and adds
build complexity (Go, Node.js, npm, Vite, container image).

This commit removes ui/ and all Nix infrastructure that supported it:
the frontend and proxy build derivations, the container image builder,
the Go and frontend CI checks, the dev-proxy / dev-frontend / load-image
apps, and Go / Node.js from the dev shell. The lint app now runs ruff on
Python composition functions instead of golangci-lint on Go.

Signed-off-by: Nic Cope <nicc@rk0n.org>
… ModelReplica

The PR #64 design replaces InferenceEnvironment with InferenceCluster and
ModelPlacement with ModelReplica. The new vocabulary lines up with how
the rest of the design is shifting: ModelDeployment fans out to
ModelReplicas (one per cluster), the way a Kubernetes Deployment fans
out to Pods. "Environment" was always overloaded - it meant both "GPU
cluster" and "organizational stage" - so dropping it in favour of
"cluster" tightens the model.

This commit applies both renames mechanically:

- apis/inferenceenvironments  -> apis/inferenceclusters (kind, plural,
  short name ic)
- apis/modelplacements        -> apis/modelreplicas (kind, plural,
  short name mr)
- functions/compose-inference-env   -> compose-inference-cluster
- functions/compose-model-placement -> compose-model-replica
- tests/test-inference-env{,-existing} -> tests/test-inference-cluster{,-existing}
- tests/test-model-placement{,-autoscaling,-multinode}
  -> tests/test-model-replica{,-autoscaling,-multinode}
- tests/test-model-deployment-incompatible-env
  -> tests/test-model-deployment-incompatible-cluster
- examples/platform/inference-environment-*.yaml
  -> inference-cluster-*.yaml

Field renames that follow from the kind renames:

- ModelReplica.spec.inferenceEnvironmentRef -> inferenceClusterRef
- ModelDeployment.spec.environmentSelector  -> clusterSelector
- ModelDeployment.spec.environments         -> clusters
- ModelDeployment.status.placements         -> replicas
- ModelDeployment printer column ENVS       -> CLUSTERS
- ModelReplica printer column ENVIRONMENT   -> CLUSTER

Label renames:

- modelplane.ai/environment -> modelplane.ai/cluster
- modelplane.ai/placement   -> modelplane.ai/replica

Condition type / reason renames track the same vocabulary (e.g.
PlacementsScheduled -> ReplicasScheduled, NoEnvironments -> NoClusters).
The composed resource keys in compose-model-deployment also move from
"placement-<name>" to "replica-<name>".

The ClusterModel/Model serving[].environmentSelector field is left
alone here - those resources are being removed entirely in the next
commit.

Signed-off-by: Nic Cope <nicc@rk0n.org>
The catalog split between ClusterModel/Model and ModelDeployment didn't
hold up in practice. Engine args are inherently model-specific (the
model name lives in --model=...), different quantization variants
reference different weight checkpoints, and the "platform team curates
a catalog" responsibility is real ongoing engineering work that most
organizations don't have a team for. This commit folds engine and
topology config inline on ModelDeployment, following the PR #64 design.

ClusterModel and Model are removed entirely along with their
composition function (compose-model). Source / huggingFace blocks are
gone too - engines fetch their own weights via their native
--model=<repo> argument. Auth for private repos goes through standard
PodSpec env (HF_TOKEN, NGC_API_KEY).

The new ModelDeployment spec:

- spec.replicas        How many ModelReplicas to fan out to.
- spec.clusterSelector Label selector against InferenceCluster.
- spec.workers         Compute shape of one worker:
    .count               (default 1)
    .topology.strategy   Tensor | TensorPipeline
    .topology.tensor     GPUs per node
    .topology.pipeline   Nodes per worker (TensorPipeline only)
    .resources.cpu       Required, no default
    .resources.memory    Required, no default
- spec.engine          Engine config:
    .image               Container image
    .args                Engine args (opaque)
    .env                 PodSpec-shaped env vars (HF_TOKEN etc.)
    .imagePullSecrets    For NGC and similar registries.

ModelReplica mirrors this shape minus spec.replicas and plus
spec.inferenceClusterRef.

DataExpert topology, disaggregated prefill/decode (spec.prefill), and
real CEL evaluation on nodeSelector are deferred. The scheduler reads
the topology shape directly (no more VRAM math): a pool fits when its
countPerNode >= topology.tensor and its nodes >= topology.pipeline.

Autoscaling drops out. The XRD declares the standard /scale subresource
so kubectl scale works, but the KEDA ScaledObject and Prometheus query
plumbing are removed. KEDA-via-scale-subresource opt-in lands later.

The status.model.name field is gone - the model identity now lives in
opaque engine args, and best-effort parsing would be brittle.

Naming on the remote cluster shifts from "model name sanitized to a DNS
label" to just the ModelDeployment name. Each remote cluster gets one
LLMInferenceService per deployment with this name, so the control plane
HTTPRoute can rewrite to a uniform path on every backend.

The compose-model-deployment scheduler still prefers clusters that
already have a replica for this deployment (stability). Capacity
accounting subtracts GPUs consumed by other deployments' replicas
based on each replica's own workers.topology.

13 composition tests pass.

Signed-off-by: Nic Cope <nicc@rk0n.org>
InferenceClass is the bridge between hardware capabilities and
provisioning recipes. Platform teams author InferenceClass resources
describing the shape of a GPU node pool (resources block: GPUs per
node, per-GPU memory) and optionally how to provision one on a cloud
(provisioning block: machine type, accelerator, disk size). Each
InferenceCluster.nodePools[] entry references a class by name and
declares only cluster-specific counts (nodeCount, minNodeCount,
maxNodeCount, zones).

This replaces per-pool inline hardware fields and converges the GKE
and BYO (Existing) cluster shapes. The same node pool schema works for
both: classes with provisioning describe pools Modelplane creates,
classes without provisioning describe pools that already exist on a
BYO cluster.

The system node pool that hosts control-plane components (Envoy
Gateway, KEDA, etc.) is no longer in the user-facing API. The
composition function injects it automatically for GKE clusters
(e2-standard-4, 1-2 nodes). Users only declare GPU pools.

PR #64's design used DRA-shaped attributes and capacity on the class
specifically so that ModelDeployment.spec.nodeSelector CEL could
evaluate against them. With nodeSelector dropped from this branch and
pod-shape moved to workers.resources, the DRA shape adds verbosity
without a consumer. spec.resources.gpu carries the count and per-GPU
memory the scheduler and composition function actually use. The
nvidia.com/gpu device plugin name remains an internal detail of the
composition function rather than a user-facing key.

The scheduler is untouched: compose-inference-cluster still populates
status.capacity.gpuPools[] in the same shape, just sourced from the
referenced classes instead of inline pool config.

InferenceClass itself has no composed children. compose-inference-class
just marks the XR Ready.

lib/resource.py now serialises with by_alias=True so the generated
class_ alias field renders as "class" in YAML.

14 composition tests pass.

Signed-off-by: Nic Cope <nicc@rk0n.org>
PR #64 splits routing apart from deployment so that fan-out (replicas
on clusters) and exposure (where requests land) can evolve
independently. ModelEndpoint is a reachable inference endpoint;
ModelService selects endpoints by label and composes the Gateway-API
HTTPRoute that exposes them.

This commit introduces both kinds, moves the Envoy Backend composition
out of ModelReplica and into ModelEndpoint, and moves the HTTPRoute
composition out of ModelDeployment and into ModelService. The pattern
mirrors Kubernetes Deployment + Service: applying a ModelDeployment
alone gets you running replicas; you author a ModelService to make
them reachable.

ModelEndpoint (namespaced, short me): carries the informational URL,
the api protocol, and the rewritePath that ModelService consumes when
composing the URLRewrite filter. compose-model-endpoint parses
spec.url, composes an Envoy Backend on the control plane, and
surfaces the Backend's name in status.routing.backendName.

ModelService (namespaced, short ms): carries spec.endpoints, each a
label selector. compose-model-service fetches the InferenceGateway and
all matching ModelEndpoints, then composes an HTTPRoute that matches
the service's namespace/name path prefix and rewrites to the first
matched endpoint's rewritePath, with all matched endpoints as
backendRefs (equal weighting; weight as a field is deferred). The
service's public address surfaces on status.address.

ModelDeployment changes: stops composing the HTTPRoute, composes one
ModelEndpoint per matched cluster (labeled
modelplane.ai/deployment: <name>, with rewritePath pointing at the
remote LLMInferenceService path), and drops status.endpoint.url. The
URL surface lives on ModelService now.

ModelReplica changes: stops composing the Envoy Backend (that moves
to ModelEndpoint) and drops both status.endpoint.url and
status.routing.backendName. The replica becomes purely about
composing the LLMInferenceService on the remote cluster.

External / SaaS endpoint support (fqdn-style Backends) is deferred.
spec.url is expected to be an http://<ip>:<port>/... shape today; the
schema doesn't enforce that yet.

16 composition tests pass.

Signed-off-by: Nic Cope <nicc@rk0n.org>
The Pydantic code generator turns a single-value enum (enum: [OpenAI])
into a Literal with the sole value as its default. The SDK's
resource.update uses exclude_defaults=True, which silently drops the
field from the serialized resource. The CRD then rejects it because
api is required.

Nothing reads spec.api today. Drop it rather than work around the
generator/SDK interaction. We can reintroduce it later once we sort
out exclude_defaults vs exclude_unset with the SDK.

Signed-off-by: Nic Cope <nicc@rk0n.org>
KServe v0.16's LLMInferenceService requires a non-empty model.uri.
This was previously set explicitly but was lost during the API
reshape. Restore it by extracting the model name from the --model=
engine arg.

This is an interim fix — the plan is to stop using KServe altogether.

Signed-off-by: Nic Cope <nicc@rk0n.org>
KServe's LLMInferenceService handles model fetching via model.uri
and invokes vLLM with the local model path. Passing --model= as a
container arg conflicts — vLLM v0.7.3 rejects it when invoked via
`vllm serve`.

Extract the model name from --model= to populate model.uri, then
strip it from the args passed to the container.

Signed-off-by: Nic Cope <nicc@rk0n.org>
Signed-off-by: Nic Cope <nicc@rk0n.org>
ModelDeployment had an ENDPOINT column on status.endpoint.url that's
no longer written — the URL surface moved to ModelService when the
routing layer was split. Drop the column and the status.endpoint
schema block.

ModelReplica had the same dead ENDPOINT column plus status.endpoint
and status.routing schema blocks. The replica function no longer
writes either. Replace the column with STRATEGY (Tensor or
TensorPipeline) and drop the unused schema.

Add a SOURCE column to InferenceCluster so kubectl get ic shows
whether the cluster was provisioned (GKE) or BYO (Existing).

Signed-off-by: Nic Cope <nicc@rk0n.org>
Signed-off-by: Nic Cope <nicc@rk0n.org>
Every XR kind showed READY, SYNCED, AGE, and COMPOSITION as the last
four columns in kubectl get, plus a duplicate READY (and sometimes AGE)
earlier in the row. Crossplane v2 appends those built-in columns to
every XR automatically, so the hand-defined ones were always
duplicates.

This commit removes the duplicate READY and AGE columns from all nine
XRDs (six public, three internal). It keeps the columns that aren't
built-in: SOURCE and GATEWAY on InferenceCluster, BACKEND and ADDRESS
on InferenceGateway, REPLICAS on ModelDeployment, URL on ModelEndpoint,
CLUSTER and STRATEGY on ModelReplica, ADDRESS on ModelService, PROJECT
and REGION on GKECluster, KSERVE and GATEWAY on KServeBackend.
InferenceClass had no non-duplicate columns, so its
additionalPrinterColumns block is removed entirely.

Signed-off-by: Nic Cope <nicc@rk0n.org>
ModelDeployment composes one ModelReplica and one ModelEndpoint per
target InferenceCluster. The MR and ME both carry a
modelplane.ai/deployment label so ModelService can select all
endpoints for a deployment. They don't carry any label identifying
which InferenceCluster they belong to, so a ModelService can't narrow
its routing to a specific subset of clusters - the selector either
matches every endpoint of the deployment or none.

This commit adds a modelplane.ai/cluster label to MR and ME carrying
the target cluster's name. A ModelService can now select on both
deployment and cluster, e.g. to drop replicas on clusters that aren't
serving traffic.

Signed-off-by: Nic Cope <nicc@rk0n.org>
The script was using `kubectl run -i --rm` to invoke curl in an
ephemeral pod. That mode attaches to the pod's stdout after creation;
if the container exits before the attach binds, the curl response is
lost. The script printed an empty body roughly half the time, which is
unsuitable for a live demo.

This commit reworks the script to create the pod with `kubectl run`,
wait for it to reach Succeeded, then read its logs. Logs survive after
container exit, so there's no race. A trap cleans up the pod on any
exit path.

The script also now reads the ModelService address from
status.address rather than hard-coding the gateway IP and route path,
and prints what it's testing before issuing the request.

Signed-off-by: Nic Cope <nicc@rk0n.org>
@negz negz force-pushed the demonstration branch from 4d90225 to 7801b9b Compare May 18, 2026 18:57
negz added 6 commits May 18, 2026 12:41
This was left behind while refactoring to the new API design.

Signed-off-by: Nic Cope <nicc@rk0n.org>
Signed-off-by: Nic Cope <nicc@rk0n.org>
The fleet scheduler matched clusters based solely on GPU capacity,
ignoring whether the cluster was actually ready. A cluster that was
still provisioning or hadn't established its gateway could be selected,
causing the deployment function to compose ModelEndpoints with
placeholder URLs that produced invalid Envoy Backends.

This commit adds a readiness gate to the scheduler: clusters must have
a Ready=True condition and a gateway address to be schedulable. Since
every matched cluster now has a valid gateway address, the endpoint
composition no longer needs a fallback path. The redundant
gateway_address field on Candidate is removed — the address is read
from clusters_by_name when needed.

The stale InferenceGateway fixtures are also removed from all
deployment tests. The function no longer requires the InferenceGateway
as of the previous commit.

Signed-off-by: Nic Cope <nicc@rk0n.org>
The ModelDeployment and ModelReplica APIs had a flat spec.engine block
and a strategy discriminator on spec.workers.topology that diverged
from the design doc. The engine configuration was separate from the
worker template, topology required an explicit Tensor/TensorPipeline
strategy enum, and CPU/memory resources lived in a top-level
workers.resources block that was pre-DRA scaffolding.

This commit restructures both XRDs to match the design:

- spec.engine moves into spec.workers.template, a curated subset of
  PodTemplateSpec. The template has metadata (labels, annotations for
  service mesh injection etc.) and spec (containers, imagePullSecrets).
  The container named "engine" is the inference engine; additional
  containers pass through as sidecars. A CEL validation rule on the
  containers array enforces exactly one container named "engine".

- spec.workers.topology.strategy is removed. Multi-node serving is
  now derived from pipeline > 1 (default 1). The topology axes compose
  multiplicatively without a discriminator.

- spec.workers.resources is removed. It was only used to set CPU and
  memory limits on the engine container. DRA will handle device
  binding and resource requirements in a future version. Until then,
  pods are created with only nvidia.com/gpu in resource limits.

Signed-off-by: Nic Cope <nicc@rk0n.org>
ModelService composed a single rule-level URLRewrite filter using
the first matched endpoint's rewritePath. When a service selected
endpoints with different rewrite targets (e.g. composed replicas
rewriting to /default/qwen-demo/ alongside a manual SaaS endpoint
rewriting to /v1), all traffic was rewritten to the first path
regardless of which backend handled the request. This silently
broke routing for any endpoint whose rewritePath differed.

Gateway API's HTTPBackendRef supports per-backendRef filters,
including URLRewrite (Extended support, confirmed in Envoy Gateway's
route processing). This commit moves the URLRewrite filter from
the rule level to each individual backendRef, derived from that
endpoint's spec.rewritePath. Endpoints with different rewrite
targets now coexist correctly in the same ModelService.

The EndpointsResolved condition message now also reports how many
matched endpoints are still waiting for their Backend to be
composed, rather than silently excluding them from the HTTPRoute.

Signed-off-by: Nic Cope <nicc@rk0n.org>
The user-facing docs described the previous iteration of the API:
ClusterModel/Model catalog entries, InferenceEnvironments, serving
profiles, and concurrency-based per-cluster autoscaling. None of
those resources or behaviors exist in the current implementation.
The README hero snippet used fields (modelRef, clusters) that no
XRD defines, and getting-started had a "Register a model" step
pointing at a deleted example file.

This commit rewrites the three documents to match what Modelplane
actually does:

- README's snippet uses the current workers.topology and worker
  template shape. Prose describes the InferenceCluster /
  InferenceClass split and the ModelDeployment -> ModelReplica ->
  ModelEndpoint -> ModelService flow.

- concepts.md is rewritten around the seven resources that exist
  today, with a diagram showing how ModelService routes across the
  endpoints composed per replica.

- getting-started.md drops the broken "Register a model" step,
  adds an InferenceClass step before the cluster, and uses
  ModelService for routing in the final curl and status checks.

Signed-off-by: Nic Cope <nicc@rk0n.org>
@negz negz changed the title WIP: Implement the updated API shape Implement the updated API shape May 19, 2026
@negz negz marked this pull request as ready for review May 19, 2026 00:26
Copilot AI review requested due to automatic review settings May 19, 2026 00:26
negz added 4 commits May 19, 2026 10:23
Several small issues accumulated during the API reshape.

compose-model-replica derived the parent deployment name by stripping
the cluster suffix from its own name — a string-parsing contract across
function boundaries that would silently break if the naming scheme
changed (e.g. truncation on long names). The deployment name is already
on the modelplane.ai/deployment label that compose-model-deployment
sets on every replica. The function now reads it from there. Test XR
YAMLs gain the label to match reality.

compose-model-deployment carried a clusters_by_name dict solely so
compose_endpoints could look up gateway addresses by cluster name. The
scheduler already had the address in hand (it gates on it via
_cluster_ready) but didn't surface it on Candidate. Candidate now
carries gateway_address, eliminating clusters_by_name entirely.

Other cleanup:

- CONDITION_REASON_MODEL_STARTING was duplicated across
  compose-model-deployment and compose-model-replica. Hoisted to
  lib/conditions.py.
- compose-model-replica called _engine_container() twice (once to
  compose, once for the event message). Cached in self.engine.
- compose-inference-cluster used hasattr() to guard an Optional
  Pydantic field that is None when absent. Removed.
- compose-inference-cluster round-tripped backend secrets through
  dicts then back to typed kssv1alpha1.Secret objects. Callers now
  construct the typed objects directly.
- Hardcoded "http://" in URL construction replaced with a
  GATEWAY_SCHEME constant in lib/metadata.py.

Signed-off-by: Nic Cope <nicc@rk0n.org>
Signed-off-by: Nic Cope <nicc@rk0n.org>
Signed-off-by: Nic Cope <nicc@rk0n.org>
Signed-off-by: Nic Cope <nicc@rk0n.org>
Comment thread lib/naming.py Outdated
The Python code generator aliases `class` to `class_` with
Field(alias='class') because `class` is a Python keyword. The
function-sdk-python's resource.update() calls model_dump() without
by_alias=True, so any Pydantic model with an aliased field silently
serializes under the Python name (class_) instead of the JSON name
(class). The CRD expects `class`, so the composed resource would be
rejected.

This codebase worked around the problem by adding by_alias=True to the
three model_dump() calls in lib/resource.py. But that only covers the
local helpers — every direct resource.update(rsp.desired.resources[k],
model) call goes through the SDK path, which doesn't pass by_alias.
The field doesn't bite us today because compose-inference-cluster only
reads nodePools[].class from the input XR and never re-emits it. But
it's a landmine: the moment anyone composes an InferenceCluster or adds
another Python-keyword field to a composed type, it breaks silently.

This commit renames the field to className, following the Kubernetes
convention (storageClassName, ingressClassName, runtimeClassName).
With no aliased fields in our schemas, by_alias=True is removed from
lib/resource.py — the SDK's resource.update() works correctly without
it.

Signed-off-by: Nic Cope <nicc@rk0n.org>
Comment thread functions/compose-inference-cluster/main.py Outdated
Comment thread functions/compose-inference-cluster/main.py Outdated
Comment thread functions/compose-model-deployment/main.py Outdated
Comment thread functions/compose-model-endpoint/main.py Outdated
negz added 6 commits May 19, 2026 20:36
The compose-inference-class function only sets Ready and an Accepted
condition — it doesn't compose any resources or write meaningful status.
The test only verified the XR spec round-tripped unchanged, which is
low value.

InferenceClass XRD/composition wiring is still covered transitively by
test-inference-cluster, which loads an InferenceClass as a required
resource fixture.

Signed-off-by: Nic Cope <nicc@rk0n.org>
Composed resource names are built by concatenating user-supplied
components (e.g. deployment name + cluster name, or XR name + fixed
suffix). When the result exceeds the 63-character DNS label limit, the
previous code silently truncated with [:63]. Two distinct inputs that
share a long prefix would truncate to the same name, silently
colliding.

This commit introduces dns_name() in lib/naming.py. Every composed
name now carries a 5-character SHA-256 hash suffix, regardless of
length. Short names get the suffix too, so all composed names are
visually consistent and the naming scheme is uniform. When the name
would exceed 63 characters, the prefix is truncated to make room for
the hash.

Every name-construction site in the codebase now goes through
dns_name(). This covers lib/naming.py, compose-inference-cluster,
compose-gke-cluster, and compose-kserve-backend.

Signed-off-by: Nic Cope <nicc@rk0n.org>
The paragraph explaining that the function does not compose the
HTTPRoute reads like an explanation of how things changed rather than
what the function does. The first paragraph already covers the
function's purpose.

Signed-off-by: Nic Cope <nicc@rk0n.org>
Per Google Python style, import modules rather than individual
functions.

Signed-off-by: Nic Cope <nicc@rk0n.org>
…gke-cluster

The system node pool (e2-standard-4 for Envoy Gateway, KEDA, etc.) is
a GKE provisioning detail. compose-inference-cluster was prepending it
to the GKECluster XR's nodePools before composing the XR, which
leaked the implementation detail into the intermediate API surface.

This commit moves the system pool constants and injection into
compose-gke-cluster, where they belong alongside the other GKE node
pool provisioning logic. compose-inference-cluster now passes only
GPU pools to the GKECluster XR. compose-gke-cluster injects the
system pool when creating the actual GCP node pool resources.

Signed-off-by: Nic Cope <nicc@rk0n.org>
When an InferenceCluster with source=GKE had a node pool referencing
an InferenceClass without a GKE provisioning block, the function
silently skipped the pool. A misconfigured node pool is a user-fixable
error, and silent partial provisioning is confusing — the cluster
appears ready but is missing GPU capacity.

This commit replaces the silent skip with a ClusterReady=False
condition and a warning. The function returns early, gating all
downstream composition until the user fixes the InferenceClass
reference.

Signed-off-by: Nic Cope <nicc@rk0n.org>

@dennis-upbound dennis-upbound left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Looks great! A few minor comments/questions

Comment thread apis/modeldeployments/definition.yaml
Comment thread docs/concepts.md
Comment thread docs/concepts.md
Comment thread functions/compose-model-replica/main.py
Comment thread apis/modeldeployments/definition.yaml
Comment thread functions/compose-model-replica/main.py Outdated
Comment thread docs/concepts.md Outdated
Comment thread apis/modelendpoints/definition.yaml Outdated
dennis-upbound pushed a commit that referenced this pull request May 20, 2026
Composes a PVC + a one-shot hydration Job per matched InferenceCluster.
v0.1 scope: Weights kind, PVC backend, HuggingFace + S3 sources,
replication = AllMatchingClusters. ContentAddressed / Custom backends,
Tokenizer / Bytes / Adapter / Engine kinds, BYO ExistingPVC, and
per-cluster selector overrides are deferred.

Out of scope here: ModelDeployment integration. The mount-injection
that attaches a cache's PVC to a model serving pod lives in
compose-model-replica and is deferred until the new ModelDeployment
shape (PR #75) stabilizes.

Adds:
- apis/modelcaches/{definition,composition}.yaml
- functions/compose-model-cache/main.py
- examples/cache/model-cache-basic.yaml

Design: #76.
dennis-upbound pushed a commit that referenced this pull request May 20, 2026
PR #75 deleted ui/ entirely. An earlier commit on this branch swept
ui/frontend/node_modules into the index via git add -A, so the rebase
faithfully re-added ~10k files (~2.5M lines, ~181M) on top of the
deletion. Drop them.
negz added 4 commits May 20, 2026 12:33
The Scaling section claimed the ModelDeployment XRD declares a
Kubernetes scale subresource. It does not — Crossplane XRDs do not
support the scale subresource until v2.3, which has not shipped yet.

Signed-off-by: Nic Cope <nicc@rk0n.org>
The description called spec.url "informational" and said ModelService
does not read it. The URL is used to configure routing to the
endpoint.

Signed-off-by: Nic Cope <nicc@rk0n.org>
The leader template included pod metadata (labels, annotations) from
workers.template.metadata, but the multi-node worker template used the
raw pod_spec without it. If a user set template metadata for service
mesh injection or similar, leader pods got the metadata but worker
pods did not.

This was not a regression from main — the old compose-model-placement
had no pod metadata support at all. But the new code introduced
template metadata and then applied it inconsistently.

Signed-off-by: Nic Cope <nicc@rk0n.org>
Three bugs in compose-model-replica, all related to how the
LLMInferenceService manifest is built.

First, spec.replicas was hardcoded to 1. The XRD has workers.count
(default 1) meaning "number of workers per replica", and the scheduler
correctly accounts for it when reserving GPU capacity, but the
composition function ignored it. spec.replicas is now set from
workers.count.

Second, the multi-node LLMIS shape did not match KServe's v1alpha1
API. The function emitted worker as {size, template} but KServe
expects worker to be a PodSpec directly — the LWS group size is
derived from parallelism.pipeline, not a separate field. The function
also set parallelism.tensor to tensor × pipeline (total GPUs) instead
of the actual tensor parallelism per node, and never set
parallelism.pipeline at all.

Third, pod metadata (labels, annotations) from the worker template
was placed inside the PodSpec at template.metadata. KServe's
WorkloadSpec.template is a PodSpec, which has no metadata field. The
KServe-blessed location for pod labels and annotations is at the
WorkloadSpec level (siblings of template), where KServe applies them
to both leader and worker pods.

The multi-node test now sets workers.count: 2 and asserts the correct
LLMIS shape: replicas=2, parallelism with both tensor and pipeline
axes, and worker as a bare PodSpec.

Signed-off-by: Nic Cope <nicc@rk0n.org>
@negz negz merged commit cfc0fad into main May 20, 2026
3 checks passed
@negz negz deleted the demonstration branch June 16, 2026 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants