v0.1 ModelCache + multi-node LWS unblock by dennis-upbound · Pull Request #78 · modelplaneai/modelplane

dennis-upbound · 2026-05-14T17:01:48Z

Warm-cluster run: cache pre-hydrated, LWS gang Ready in 65s, engine answers 86s after gang Ready. 4× speed.

Adds the v0.1 ModelCache primitive (XRD + composition function + ModelDeployment.spec.caches[] wiring) and turns multi-node TensorPipeline serving from broken-by-default into a working path. See Multi-node LWS — what was broken and what we did below for the specific issues we untangled.

Base: demonstration (#75) so the diff stays scoped. GitHub will rebase to main automatically when #75 merges.

What ships

ModelCache — composes a per-cluster RWX PVC + a one-shot hydration Job on every matched InferenceCluster. Pods that reference the cache via ModelDeployment.spec.caches[] get the PVC mounted at the engine's expected path; no per-replica HuggingFace pull at boot.

Aspect	v0.1 values
Artifact kinds	`Weights`, `Tokenizer`, `Bytes`
Sources	`huggingFace`, `s3`, `http`, `inline` implemented; `oci`, `configMap` discriminators reserved
Storage backends	`PVC` (Modelplane-managed), `ExistingPVC` (customer-managed)
Replication	`AllMatchingClusters`
Deployment wiring	`ModelDeployment.spec.caches: [{ name }]` → `LLMInferenceService.spec.model.uri = pvc://...`

Cloud-agnostic CSI capability plumbing. InferenceCluster.spec.storage.csiDrivers: [SharedFilesystem | ObjectStorageMount | BlockDevice] declares the cluster's semantic capability. The GKE composition branch reads the underlying VPC name from GKECluster.status.network.name and auto-composes a modelplane-rwx StorageClass on the workload cluster with parameters.network pinned. Same pattern will hook into EKS / AKS with their respective knobs.

GCP API auto-enable. compose-gke-cluster emits a ProjectService MR for file.googleapis.com whenever the user opts into SharedFilesystem.

1-pager design: design/modelcache/README.md. Replaces the earlier verbose design doc (PR #76, closed in favor of this focused page).

End-to-end demo: examples/qwen-cached-demo/ deploys Qwen 2.5 0.5B on a 2-pod LWS gang with a real chat-completion response. See TOPOLOGY.md for the composed XR / MR layout.

Tests: model-cache × 3 + model-replica-with-cache + the full existing suite — 20/20 pass.

Multi-node LWS — what was broken and what we did

Bringing the demo up needed fixes layered across the function, the LIS shape, the cache hydration command, the engine wrapper, and the cluster's StorageClass. Each row below is something an OSS user would hit on a fresh cluster; calling them out so reviewers know what each fix buys us.

Problem	Fix
No Ray bootstrap on the gang. vLLM with `--pipeline-parallel-size > 1` needs a Ray cluster spanning the gang's pods. KServe's `ServingRuntime` injects this for you; the custom-template path we use doesn't. Worker pod never joined the leader's Ray cluster, vLLM's placement group hung forever.	`compose-model-replica` injects a 5-line shell wrapper as the container `command` when `topology.strategy == TensorPipeline`. Leader runs `ray start --head` then execs the engine; worker runs `ray start --address=$LWS_LEADER_ADDRESS:6379 --block`. Bootstrap lives as a Python constant — when we swap KServe for our own `LeaderWorkerSet` composition later, the same constant moves verbatim.
KServe v0.17+ flat worker schema. The existing function generated `worker: {size, template}` (v0.16 shape); v0.17+ wants `worker` as a flat `PodSpec` with `containers[]`. `parallelism.tensor` is now per-pod GPU count and `parallelism.pipeline` is required when `worker` is present.	Switch the generated LIS shape to flat `worker:` + set `parallelism.{tensor,pipeline}` from the topology.
KServe v0.17+ stopped injecting `--model` into engine argv. `model.uri = pvc://...` mounts the PVC at `/mnt/models` but the engine container's args don't get `--model=<path>` anymore. vLLM defaulted to `facebook/opt-125m` and silently downloaded it from HF instead of using the mounted cache.	Append `--model=/mnt/models` to the engine args when `spec.caches[]` is set.
`vllm serve` vs `python -m vllm.entrypoints.openai.api_server`. `vllm serve` takes the model as a positional argument; the api_server entrypoint (the image's default `ENTRYPOINT`) takes `--model=`. The first wrapper invoked `vllm serve "$@"`, which silently exited on argparse mismatch.	Wrapper invokes the same `python -m vllm.entrypoints.openai.api_server` the default `ENTRYPOINT` uses. Same arg shape for single-node and multi-node.
Filestore CSI defaults to the GCP `default` VPC. GKE's chart-shipped `standard-rwx` StorageClass has no `network` parameter, so Filestore instances landed on `default` and were unreachable from cluster nodes — PVCs sat Pending forever with NFS mount timeouts.	`compose-inference-cluster` (GKE branch) auto-composes a workload-cluster StorageClass (`modelplane-rwx`) with `parameters.network = <our VPC>`. Network surfaced via `GKECluster.status.network.name`.
`lost+found` triggered a false-positive skip. The hydrate Job's "already populated?" check did `ls -A /mnt/model`. Filestore's ext4-style filesystem puts `lost+found` at mount root, so a fresh PVC always looked non-empty; the Job exited with `artifact already hydrated, skipping` and the cache stayed Ready with no weights on disk.	Filter `lost+found` out of the emptiness check.
`huggingface-cli` was deprecated in `huggingface-hub 1.x` and the `[cli]` extra removed. The hydrate Job's `pip install huggingface_hub[cli]` printed a warning and `huggingface-cli download` exited with a deprecation notice, no bytes written.	Switch to `hf download` (new CLI, same flags).
Filestore API isn't enabled by default on fresh GCP projects. PVCs sit `Pending` indefinitely with `SERVICE_DISABLED` in their workload-cluster events; the GKE cluster itself looks healthy.	`compose-gke-cluster` emits a `ProjectService` MR for `file.googleapis.com` whenever the Filestore CSI addon is on.
T4 / Turing doesn't support bfloat16 — Qwen 2.5 ships in bf16. vLLM exits with `ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0`. Not a Modelplane-layer issue; documented as an example concern.	Demo deployment passes `--dtype=half`.

Net result: a fresh kubectl apply -f examples/qwen-cached-demo/ provisions a GKE cluster with Filestore CSI enabled + a VPC-pinned StorageClass, hydrates Qwen 2.5 0.5B onto a 100 GiB RWX PVC, boots a 2-pod LWS gang on two T4 nodes, both pods mount the same PVC, the leader's Ray head and the worker's Ray client form one cluster, vLLM's pipeline-parallel placement group satisfies, and curl /v1/chat/completions returns a real Qwen response over HTTP 200.

Out of scope for v0.1

Adapter / Engine artifact kinds — composition shape needs more design before locking
Storage backends beyond PVC / ExistingPVC — the backend enum is the extension point
AllMatchingNodes replication granularity — per-node staging is a separate problem
Signal-bus emission (hydration latency, bytes staged) — depends on Fleet signal bus: capture layer for cross-cluster signals (v0.1) #74

Workarounds / hacks (cleanup before / after merge)

Real composition-code workarounds shipping in this PR. The demo scripts under examples/qwen-cached-demo/ have their own pile of provider-gcp / provider-helm / provider-kubernetes workarounds — those will get gitignored and graduate into a smoke-test fixture, so they're not called out here.

Composition-code hacks

Where	What	Upstream root cause	Cleanup
`functions/compose-gke-cluster/main.py` (TODO L129)	`ZONE_RESOURCE_POOL_EXHAUSTED` isn't surfaced through the nodepool MR's `conditions`; GCE keeps the pool in `PROVISIONING` forever.	GCE/GKE not propagating instance-level capacity failures up to the pool's status.	Add an IGM-status check or gcloud query in the function (bigger lift, defer).
`functions/compose-kserve-backend/main.py` (LLMInferenceServiceConfig presets)	Composes five empty `spec: {}` `LLMInferenceServiceConfig` instances on the workload cluster because the `kserve-llmisvc-resources` chart defines the CRD but doesn't ship the controller-required defaults.	KServe chart packaging gap — chart should land these.	File upstream KServe issue. Once they ship the defaults, drop the preset compose code.
`functions/compose-kserve-backend/main.py` (router-route exclusion)	We deliberately skip `kserve-config-llm-router-route` because the chart DOES include it, and naive composition leads to Helm "exists and cannot be imported" failures. The list-overlap is fragile across KServe versions.	Same chart packaging gap.	Same as above.

Mid-session debugging interventions captured so they don't repeat

Bug	What we fixed	Test that now catches it
Workload-cluster `Object` MR for `storage-class-rwx` composed without `metadata.namespace` set; Crossplane reconcile errored with "an empty namespace may not be set when a resource name is provided".	Added `metadata.namespace=modelplane-system` to the composed Object.	`test-inference-cluster-csi-shared-filesystem` asserts on the namespace.
`storage-class-rwx` Object used `policy: DeriveFromObject` readiness, but StorageClass has no Ready condition — Object MR stayed `Ready=False` forever.	Switched to `policy: SuccessfulCreate`.	Same test.
Function composed the SC Object but didn't propagate observed Ready into `rsp.desired.resources["storage-class-rwx"].ready` — XR aggregator left the InferenceCluster `Ready=False`.	Read observed condition and set `READY_TRUE` on the response.	Same test (regression-friendly; the compositiontest framework doesn't expose `.ready` for direct assertion).
`compose-gke-cluster` didn't track `projectservice-filestore` in `mark_readiness`; XR stuck on "Unready resources: projectservice-filestore" even after the API was enabled.	Added it to the tracked-resource list.	`test-gkecluster-filestore-addon`.
Preset Object MRs pointed at `<xr-name>-cluster` (the Helm ProviderConfig) instead of `<inferencecluster>-cluster-kubeconfig` (the ClusterProviderConfig).	Strip the `-kserve` suffix from the KServeBackend XR name to derive the parent + the right CPC name.	`test-backend-presets` asserts on `providerConfigRef.name`.

What needs to happen to merge

Must

Rebase to main. PR is currently based on demonstration (Implement the updated API shape #75) so the diff stays scoped. GitHub will fast-forward once Implement the updated API shape #75 lands.
Migrate spec.engine → spec.workers.template per Nic's merged commit on PR #64 (the engine field is now a curated subset of PodTemplateSpec). Affects apis/modeldeployments/definition.yaml, apis/modelreplicas/definition.yaml, functions/compose-model-replica/main.py.
Gitignore the examples/qwen-cached-demo/ demo scripts (or relocate under an ignored path). They'll graduate into a smoke-test fixture; not part of the merged surface area.
Review.

Should

File upstream KServe issue for the chart packaging gap (missing default LLMInferenceServiceConfig instances + kserve-llmisvc-resources Helm release blocked by chart-shipped vs externally-created preset overlap).

Nice to have

kind-cluster smoke test in CI that reconciles Crossplane against the package and asserts XR Ready aggregation. Would have caught half the bugs we hit during this PR before they hit a live demo.

Demo flow

export GCP_PROJECT=my-gcp-project
./setup.sh         # ~5-10 min on first run: GKE provision + Filestore CSI install + stack
./demo.sh          # cache hydrate + LWS gang up + curl
./cleanup-demo.sh  # iterate: deletes workload, cluster + infra stay
./cleanup.sh       # full teardown

Recordings of the warm-cluster demo and the full cold-start flow attached in a follow-up comment.

Known issue: control-plane envoy proxy gets stuck on xds during long demos

Hit this twice now during cold-start demo recordings — flagging so it's not lost.

Symptom: the control-plane envoy proxy pod (envoy-modelplane-system-modelplane-* in the envoy-gateway-system namespace) goes to 1/2 Ready and the Gateway reports PROGRAMMED=False even though the InferenceCluster + ModelService composition is fine. Curl from inside the cluster gets Connection refused on 172.18.255.200:80, not a 5xx.

Root cause (suspected): the proxy's envoy container can't reach its xds source. Container logs show repeated:

[warning][config] DeltaAggregatedResources gRPC config stream to xds_cluster closed: 14,
  upstream connect error or disconnect/reset before headers.
  reset reason: connection timeout

The long-running gRPC config stream from the envoy proxy to the envoy-gateway control plane goes stale during multi-hour sessions, and the proxy never recovers on its own.

Workaround (what unblocked both demos):

kubectl delete pod -n envoy-gateway-system envoy-modelplane-system-modelplane-<hash> --wait=false

The Deployment recreates the pod, the new proxy gets a fresh xds stream, the Gateway flips to PROGRAMMED=True within ~15s, and routing works again.

Why this matters here: not something this PR introduces, but it bites the demo flow when the recording session is long enough — by the time we get to the curl step the proxy has already drifted. Worth either:

Filing upstream on envoy-gateway (long-running xds gRPC stream not self-healing) — the symptom + fix matches several open issues there.
Adding a defensive controller on Modelplane's side that watches the Gateway condition and bounces the proxy pod when PROGRAMMED flips False for >N minutes.
(Demo-only) kubectl rollout restart of the envoy proxy as the first step of demo.sh so the recording can't catch it mid-drift. Trivial but not addressing the underlying issue.

Not blocking for this PR.

Composes a PVC + a one-shot hydration Job per matched InferenceCluster. v0.1 scope: Weights kind, PVC backend, HuggingFace + S3 sources, replication = AllMatchingClusters. ContentAddressed / Custom backends, Tokenizer / Bytes / Adapter / Engine kinds, BYO ExistingPVC, and per-cluster selector overrides are deferred. Out of scope here: ModelDeployment integration. The mount-injection that attaches a cache's PVC to a model serving pod lives in compose-model-replica and is deferred until the new ModelDeployment shape (PR #75) stabilizes. Adds: - apis/modelcaches/{definition,composition}.yaml - functions/compose-model-cache/main.py - examples/cache/model-cache-basic.yaml Design: #76.

Apply patterns from skills/crossplane-python-functions: - Cast XRD int fields with int() — protobuf delivers Quantity sizes as Python float (`200.0Gi` ≠ valid Kubernetes Quantity) - Split per-source hydration into _hf_hydration / _s3_hydration module functions so the discriminator dispatch is one line - Separate composition from observation: compose_cluster_resources() only emits Objects; derive_cluster_phase() reads observed state; mark_ready_resources() flips ready flags AFTER resource.update() - Add transition events on first compose and on first full readiness (one-shot, not steady-state, to keep `kubectl describe` quiet) - Extract _wrap_remote / _observed_remote_status helpers and a HydrationSpec dataclass to replace the tuple return New composition test: tests/test-model-cache/{main.py,xr.yaml}. Mocks a single ready InferenceCluster via extraResources and asserts the PVC + Job Objects compose with the expected manifests on the workload cluster.

datamodel-codegen names inline array item types after the singular property name; the generated class is `Cluster`, not `ClustersItem`. Caught by `up test run tests/test-model-cache`.

Stops these from polluting `git status` and accidentally getting committed: - `__pycache__/` and `*.pyc` — Python bytecode caches, regenerated by every test run - `.DS_Store` — macOS Finder metadata - `.venv-test/` — local test virtualenv (mirrors existing `.venv`) - `opencode.json` — per-user opencode tool config; contains a local endpoint URL, no shared value

XRD now matches the full v0.1 design surface so we don't have to churn the API shape later: - artifact.kind: + Tokenizer + Bytes (same hydration path as Weights) - artifact.source: + http, oci, inline, configMap (in addition to huggingFace + s3) - storage.backend: + ExistingPVC (customer-managed PVC, no Job) - status: + resolvedDigest, + lastHydratedAt, + bytesStaged, + references Implementations: - Tokenizer / Bytes: route through the same builder as Weights - ExistingPVC: compose no Objects; report Ready immediately per matched cluster; "Adopted" event on first match - http: curl-based fetch into the PVC; optional Authorization header from a Secret - inline: write content to a file inside the PVC via env-passed value - lastHydratedAt: captured from the remote Job's completionTime - oci, configMap: discriminator locked, surface ImplementationPending condition + warning until wired Two new tests cover the new paths: - test-model-cache-existing-pvc: no Objects composed, 1/1 ready - test-model-cache-pending-source: oci source surfaces empty summary rather than crashing

Mirror the doc trim on PR #76: the kind / source descriptions just name the partition axis ("fetch protocol", "wiring discriminator not content partition") rather than declaring the field MECE. Substance is the same; phrasing matches the design doc.

Mirrors the design-doc edit that replaces the flat RWX-CSI list with the four-category framing (NFS / parallel FS / object-backed FUSE / replicated block). Description-only — surfaces a choice customers should actually be making rather than implying all CSIs are equivalent.

Four curated examples covering the impl's working v0.1 paths: - model-cache-basic.yaml — HuggingFace Weights, basic case. Tidied the header comment from the original scaffold. - model-cache-nim-mode-2b.yaml — pre-seed the NIM profile cache dir via http source. The demoable NIM Mode 2b case once the cluster has NGC creds. Notes that ORAS / oci source is locked in the XRD but impl-pending; that follow-up swaps http for oci against nvcr.io/nim/... directly. - model-cache-existing-pvc.yaml — ExistingPVC backend; customer- managed PVC adoption with no Modelplane-composed PVC or Job. - model-cache-private-s3.yaml — private S3 with access-key Secret; compliance / GDPR scenario. Each example has a tight header explaining the use case, expected speedup, and what cluster/Secret prerequisites need to be in place.

Two examples were technically schema-correct but assumed the user already knew the implicit Secret-key contract: - model-cache-basic.yaml: tokenSecretRef.name: hf-token expects a Secret with key HF_TOKEN. Added a one-line kubectl example so the user can wire it up before applying. - model-cache-private-s3.yaml: the s3 hydration Job reads fixed keys access_key / secret_key from the referenced Secret. Made that explicit so the user doesn't accidentally use AWS_ACCESS_KEY_ID etc. Validated all four examples against the generated Pydantic XRD model (.up/python/models/ai/modelplane/modelcache/v1alpha1.py) — every required field present, every enum and pattern matches.

End-to-end ModelCache integration so a ModelDeployment that references a cache mounts the pre-staged PVC at engine boot instead of fetching weights from the source. API: - ModelDeployment.spec.caches: [{ name }] — single-item list in v0.1, references a ModelCache in the same namespace - ModelReplica.spec.caches mirrors and inherits verbatim Composition: - compose-model-deployment passes caches through to each ModelReplica - compose-model-replica sets model.uri = pvc://<cache-pvc-name> when a cache is referenced; otherwise falls back to hf://<repo> from --model= as before - lib/naming.modelcache_pvc_name() centralizes the PVC naming so compose-model-cache (creator) and compose-model-replica (consumer) agree on the convention Test: tests/test-model-replica-with-cache/ verifies the dispatch. Caught one bug along the way (llmis name derivation when the deploy prefix shares characters with the cluster name) — fixed in the test fixture. Demo: examples/qwen-cached-demo/ — three yamls + README showing the cold-start delta on Qwen 2.5 0.5B over the existing qwen-demo flow. Deploys onto the same GKE InferenceClusters from ../qwen-demo/ with spec.caches: [{ name: qwen-2-5-0-5b }] pointing at a pre-staged cache.

Four sequenced scripts so the demo is one-command runnable: - setup.sh: applies shared prereqs from ../qwen-demo (00-prereqs, 01-gateway, 02-class), provisions a GKE InferenceCluster via envsubst-templated infra/cluster.yaml ($GCP_PROJECT required), waits for Ready (~5-10 min). - demo.sh: applies cache → waits for ArtifactReady → applies deployment → waits for ReplicasReady → applies service → fetches the gateway address → curls a chat-completions request. Times each phase so the cold-start delta is visible. - cleanup-demo.sh: deletes service / deployment / cache only. Cluster + shared infra stay so demo.sh can re-run immediately. - cleanup.sh: calls cleanup-demo.sh then deletes the InferenceCluster (deprovisions GKE). Shared infra kept (might be reused by other demos in this repo). Every script uses kubectl apply / delete --ignore-not-found so re-running mid-state is safe. README rewritten to lead with the script-driven flow and a phase / script / what-it-does table. infra/cluster.yaml uses source: GKE so Modelplane provisions the GKE cluster directly — no external Secret / kubeconfig wiring required, the user just provides GCP_PROJECT and Crossplane GCP provider credentials. shfmt-clean, schema-validated against generated Pydantic models for ModelCache / ModelDeployment / ModelService / InferenceCluster.

Two cleanups: - Drop the made-up cache/replica timing numbers (18s/24s) — they're ballpark guesses and shouldn't read as commitments. Output snippet now shows the script's shape with placeholders. - Comparison section was instructing the reader to apply ../qwen-demo/04-deployment.yaml; that has replicas: 2 and selects by the broad cluster label, so on a single-cluster demo setup the scheduler would surface InsufficientCapacity. Simpler advice: copy 02-deployment.yaml, strip the caches block, change the name, and apply.

The original demo showed only the cached path — readers had to imagine the uncached cold-start cost. Now the demo applies both deployments side-by-side on the same cluster, polls each for readiness, and prints both timings so the delta is obvious. Changes: - infra/cluster.yaml: nodeCount 1 → 2 (one GPU per parallel deployment) - 02-deployment.yaml: replicas 2 → 1 (now paired with 02b) - 02b-deployment-uncached.yaml: new — same Qwen model, no spec.caches, --model=Qwen/... in engine args so the engine pulls weights from HuggingFace at boot - 03b-service-uncached.yaml: new — exposes the uncached deployment - demo.sh: applies both deployments + services, polls each via jsonpath against status.conditions[type=ReplicasReady], emits per-deployment ready times as they tick, prints a Cached / Uncached comparison summary, runs the curl sanity test against the cached endpoint - cleanup-demo.sh: deletes both deployments + services + the cache - README: rewritten to lead with the side-by-side framing

The InferenceClass + GKE source path requires zones (validated by compose-inference-cluster). Without it the function pipeline returns a pydantic validation error and the cluster never composes. Caught while bringing the demo up live against crossplane-playground.

Two storage-related changes the demo flushed out: (1) Cloud-agnostic CSI capability declaration. User-facing InferenceCluster gets a new spec.storage block with a storageClassName default and a csiDrivers list of semantic capability flags — SharedFilesystem / ObjectStorageMount / BlockDevice. The InferenceCluster composition maps these to cloud-specific CSI addons per source (GKE Filestore CSI for SharedFilesystem, GCS-FUSE for ObjectStorageMount, PD-CSI default for BlockDevice). Future EKS / AKS branches do the equivalent mapping with their own addon names. The internal GKECluster XR carries a GCP-specific spec.addons block (gcpFilestoreCsiDriver / gcsFuseCsiDriver / gcePersistentDiskCsiDriver) that compose-gke-cluster threads into the underlying container.Cluster resource's addonsConfig. The user-facing surface stays cloud-agnostic; the internal infrastructure XR keeps the GCP-specific knobs as an escape hatch for advanced cases. For BYO clusters (source: Existing) the csiDrivers field is descriptive only — Modelplane never installs drivers on customer-managed clusters. The qwen-cached-demo cluster.yaml uses the new shape: spec.storage.csiDrivers: [SharedFilesystem] spec.storage.storageClassName: standard-rwx (2) GCE stockout detection. compose-gke-cluster.detect_capacity_failures() watches nodepool managed-resource conditions for GCE_STOCKOUT / "does not have enough resources" and sets an InsufficientGPUCapacity condition naming the zone and accelerator type. compose-inference-cluster propagates the condition up to the user-facing InferenceCluster so the failure is visible in `kubectl describe inferencecluster` without needing to peel through to the internal GKECluster + nodepool MRs. Caught a real one while bringing up the demo against crossplane-playground: the InferenceCluster sat in Creating for 35 min before we found the stockout in `gcloud node-pools describe`. With this change a user sees the InsufficientGPUCapacity condition and the offending zone immediately. (3) Defensive HACKs in setup.sh / cleanup.sh. setup.sh detects zombie Network MRs (Crossplane reports Ready=True but gcloud returns 404 because the create-succeeded annotation sticks even after the underlying resource is gone) and clears the crossplane.io/external-create-{succeeded,pending} annotations so the provider retries the create. Also catches Synced=False MRs stuck on cached provider errors. cleanup.sh force-finalizes stuck workload-cluster Helm releases and provider-kubernetes Objects after the InferenceCluster delete is issued. provider-helm / provider-kubernetes hang on these because the target cluster is being torn down and the delete API calls fail. Both HACK blocks carry TODO comments pointing at the upstream crossplane provider issues — the right fixes belong in provider-gcp (don't trust the create-succeeded annotation forever; reconcile with GCP truth on observe) and provider-helm / provider-kubernetes (treat "target cluster gone" as delete success).

Mermaid flowchart of what the demo composes, from user-facing XRs down to live workload pods. ModelCache primitives (XR, PVC MR, Job MR, mounted PVC) highlighted in yellow to make the new piece visually obvious vs the shared cluster/KServe infrastructure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add ModelService, ModelEndpoint, ModelReplica, Backend, HTTPRoute and the per-replica composition path. Introduce a second highlight color (orange) for the external substrate we may later replace with Modelplane-internal primitives — KServe Release, LWS Release, LLMInferenceService MR, and the LWS gang itself. The yellow ModelCache path and the orange substrate path are now visually distinct, making clear that the cache work is independent of any future engine-substrate swap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

GCP Filestore Basic HDD (backs the standard-rwx storage class on GKE) takes 8–15 min to provision the first instance. The 10m wait was tight enough to time out on a cold cluster even when the hydration itself worked. 20m gives Filestore comfortable room without burying real failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Filestore CSI driver provisions Filestore Instances via the file.googleapis.com API. If the API is not enabled on the project, PVCs sit Pending indefinitely with SERVICE_DISABLED — the GKE cluster comes up fine, the CSI driver installs fine, the PVC is accepted, but every provisioning attempt fails 403. We hit this on crossplane-playground today and the symptom (PVC Pending forever) is annoying to diagnose without checking workload-cluster PVC events directly. Enabling the API is idempotent and cheap; do it as part of setup so the first PVC the demo creates can actually bind. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Filestore CSI addon on GKE installs the in-cluster driver fine on its own, but provisioning calls hit file.googleapis.com. If that API isn't enabled on the project, the symptom is silent: PVCs sit Pending forever with SERVICE_DISABLED in their workload-cluster events while the cluster itself reports healthy. Hit this on crossplane-playground today; took a workload-cluster describe to diagnose. Emit a ProjectService MR for file.googleapis.com when the user opts into Filestore via storage.csiDrivers: [SharedFilesystem]. ProjectService enable takes seconds and runs in parallel with the multi-minute cluster create, so it doesn't extend critical path. Deliberately not enabling container/compute APIs — if those aren't on, the user's GCP project setup is incomplete and they should fix it explicitly. The addon case is different because the user explicitly opted into Filestore via the user-facing API. disableOnDestroy=False so tearing down one InferenceCluster doesn't yank an API the rest of the project relies on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two-part change to fix Filestore PVCs landing on the wrong VPC: 1) compose-gke-cluster surfaces the observed Network MR's name in GKECluster.status.network.name (new field; XRD updated). 2) compose-inference-cluster, in the GKE branch, reads that status field and (when storage.csiDrivers contains SharedFilesystem) composes an Object → StorageClass on the workload cluster with provisioner=filestore.csi.storage.gke.io and parameters.network= <our VPC>. Default SC name is `modelplane-rwx`; user can override via spec.storage.storageClassName. The user-facing API stays cloud-agnostic — the cloud-specific knob (`network` for Filestore) is wired by the GKE-specific composition branch from the GKE-specific infra XR's status. Same pattern will apply when EKS / AKS branches need to wire fileSystemId / shareName. Demo's 01-cache.yaml switched to mp-filestore-rwx so today's running cluster (which got the GKE built-in standard-rwx by accident) can still consume the manually-created custom SC. New clusters will get the auto-composed `modelplane-rwx` from the InferenceCluster XR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closing #76 (the speculative v0.1/v0.2/v0.3 design doc) and replacing with a focused page that documents what shipped in this PR: shape, what gets composed, multi-node Ray bootstrap, scope boundaries. Demo proof at examples/qwen-cached-demo/. v0.2+ ideas (content-addressed substrate, lazy load, cross-cluster dedup) are explicitly out of scope here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Update the diagram and prose to match what actually runs: - LWS gang shown as distinct leader + worker pods (not one combined node) with the Ray cluster edge between them - Both pods mounting the cached PVC explicit - StorageClass MR carries the parameters.network detail - LIS MR notes the flat PodSpec + Ray-bootstrap command - Hydration Job notes hf download (not the deprecated CLI) - GKECluster.status.network.name → StorageClass edge shown Adds a one-paragraph "what this proves" prose section so a reviewer can read the diagram + caption and understand the v0.1 path end-to-end without spelunking through commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

OSS Modelplane doesn't name the future storage substrates beyond "additional backends may land in later versions" — concrete names (and any associated commercial framing) belong outside this repo. Updates: - apis/modelcaches/definition.yaml: drop the explicit "ContentAddressed and Custom backends are deferred to v0.2" sentence from the backend description; keep the extension point. - functions/compose-model-cache/main.py: same swap in the module docstring. - design/modelcache/README.md: rewrite the out-of-scope table and the forward-compatibility section to talk about generic extension points rather than naming v0.2 backends. Drop the content-addressed / cross-cluster-dedup framing entirely. Also fix a Mermaid lexer error in TOPOLOGY.md: the `status.network.name` edge label contains dots that the unquoted form parses as identifier separators; quote it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Keep the v0.1 1-pager focused on what shipped; don't enumerate future features by name in the out-of-scope table. The XRD's extension points (backend enum, replication enum, source union) speak for themselves. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

LWS reports the gang Ready as soon as both pods reach 1/1, but vLLM inside the leader takes another ~60-120s to load the model from the cached PVC and finish CUDA graph capture before Uvicorn opens. The previous curl ran the moment LWS gang reported Ready and got back `upstream connect error … Connection refused` from the gateway, even though everything was wired correctly. Move the readiness wait into the curl-test pod itself: it loops `curl -f /v1/models` until 200, then sends the chat completion. Avoids the kubectl-run-rm stdout-capture pitfall and keeps the single-pod ergonomic of the original demo.sh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ModelService.status.address is already a full URL with the path prefix (e.g. http://172.18.255.200/ml-team/qwen-cached-demo); the original curl line in demo.sh prepended "http://" and re-appended "/${NS}/qwen-cached-demo" — yielding http://http://172.18.255.200/ml-team/qwen-cached-demo/ml-team/qwen-cached-demo/v1/... DNS then tried to resolve hostname "http", curl failed silently (stderr suppressed for the readiness probe), and the retry loop spun forever. Use the address as-is and append the OpenAI path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The README still described the original side-by-side cached vs uncached demo; the current demo.sh runs a single multi-node TensorPipeline 1×2 LWS gang against the cached PVC. Rewrite to match: T4 quota, the actual phase output, the pod-name pattern, pointer to TOPOLOGY.md. demo.gif is a 4×-speed recording of a warm-cluster run captured via `asciinema rec --command "bash demo.sh"` then `agg --speed 4`. Cache pre-hydrated, LWS gang Ready in 65s, engine serving 86s later, real Qwen chat completion JSON in the final frame. GitHub autoplays the gif inline in the rendered README. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add three kubectl-get checkpoints to demo.sh so the recording shows what's actually happening on the cluster rather than just script milestones: - After cache hydration: print the ModelCache row + the Object MRs composing the PVC + Job on every matched cluster (with Synced / Ready columns). - After deployment apply: print the user-facing workload tree (ModelDeployment, ModelService, ModelReplica, ModelEndpoint) so a viewer sees how one ModelDeployment fans out to one ModelReplica per cluster plus a ModelEndpoint per replica. - After LWS gang Ready: print the Object MRs again — the LLMInferenceService MR (the orange-band substrate we may swap later) is now visible alongside the cache MRs. Same scripted flow, more visible payload for the recording. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The custom-columns table tells you the composed MRs exist; the JSON manifest dump shows exactly what got applied to the workload cluster — the artifact a viewer of the gif/cast wants to see. - After cache hydration: pretty-print the PVC manifest. Shows accessModes: [ReadWriteMany], the modelplane-rwx storage class reference, the size. - After LWS gang Ready: pretty-print the LLMInferenceService manifest. Shows model.uri=pvc://modelcache-..., the worker PodSpec with the Ray-bootstrap shell as container.command, parallelism.{tensor=1, pipeline=2}, and the engine args list. Uses kubectl's jsonpath to extract spec.forProvider.manifest then python3 -m json.tool for formatting — no extra deps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

compose_project_services composes the ProjectService MR for file.googleapis.com when the user opts into the Filestore CSI addon, but mark_readiness's tracked list didn't include it. Result: even when the MR observes Ready, the function never sets `rsp.desired.resources["projectservice-filestore"].ready`, so Crossplane's XR readiness aggregator keeps the GKECluster XR at Ready=False with the message "Unready resources: projectservice-filestore" forever. Caught on the cold-start full demo recording — setup.sh hit its 20m kubectl-wait timeout sitting on this single missing mark even though every underlying GCP resource was Ready. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

provider-kubernetes's Object is a namespaced resource. The InferenceCluster XR is cluster-scoped, so any composed namespaced resource needs metadata.namespace set explicitly — otherwise Crossplane's reconcile errors with cannot get composed resource: an empty namespace may not be set when a resource name is provided and the XR sits Synced=False. Following the pattern compose_kserve_backend / compose_cluster_ provider_config already use (modelplane-system namespace). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ruff reformat on compose-inference-cluster/main.py (CI lint was failing on collapse of the observed_gke_network_name dict-walk). Existing test golden updates: - test-model-cache: hydration command now filters lost+found from the emptiness check and uses `hf download` (huggingface-cli was deprecated in huggingface-hub 1.x). - test-model-replica-with-cache: container args carry --model=/mnt/models when spec.caches is set, so vLLM loads the cached weights instead of falling back to its hardcoded default. - test-model-replica-multinode: KServe v0.17+ flat-worker shape (worker is a PodSpec, not {size, template}); parallelism.tensor is per-pod count + parallelism.pipeline carries the pod count; container command carries the Ray-bootstrap shell (leader runs `ray start --head` + execs vLLM, worker runs `ray start --address=$LWS_LEADER_ADDRESS:6379 --block`). Two new tests for code paths introduced in this PR: - test-gkecluster-filestore-addon: asserts compose_project_services emits a ProjectService MR for file.googleapis.com when the user opts into the Filestore CSI addon. Catches regressions to the API-auto-enable path. - test-inference-cluster-csi-shared-filesystem: asserts the compose_gke_storage_classes path emits an Object MR wrapping a workload-cluster StorageClass with metadata.namespace set (regression we hit live: without namespace, Crossplane errors with "an empty namespace may not be set when a resource name is provided"). Also asserts parameters.network is pinned to the observed VPC from GKECluster.status.network.name. All 20 composition tests pass. ./nix.sh flake check also passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Earlier commit switched 01-cache.yaml from `standard-rwx` (GKE's built-in Filestore SC) to `modelplane-rwx` (the SC the InferenceCluster composition auto-creates with parameters.network pinned to our VPC), but cluster.yaml's spec.storage.storageClassName was missed and still said `standard-rwx`. That caused the composed Object MR to try to create a StorageClass named `standard-rwx`, which conflicts with the GKE-managed built-in (`parameters` is immutable on existing SCs), so the Object MR sat Synced=False forever and the InferenceCluster couldn't reach Ready. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

StorageClass doesn't carry a Ready condition of its own, so the DeriveFromObject readiness policy provider-kubernetes uses keeps the wrapping Object MR at Ready=False forever — blocking the InferenceCluster XR from going Ready even though the SC was applied successfully on the workload cluster. Switch the policy to SuccessfulCreate, which marks the MR Ready as soon as the SC apply succeeds. For a config-only resource that's the actual readiness signal we care about. Test golden updated to match. Also drop the unused gkecluster import in the new test-gkecluster-filestore-addon test that ruff flagged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

compose_gke_storage_classes composed the Object MR for the workload- cluster StorageClass with the correct manifest, but never set rsp.desired.resources["storage-class-rwx"].ready when the observed MR was Ready. Crossplane's auto-readiness aggregator left it at READY_UNSPECIFIED and the InferenceCluster XR sat Ready=False with "Unready resources: storage-class-rwx" even when the SC was applied successfully on the workload cluster. Read observed condition and mirror into the response — same pattern the function already uses for `gke-cluster`. Test backfill: add an observed storage-class-rwx Object with Ready=True to the inference-cluster-csi-shared-filesystem test so the propagation code path is at least exercised. The compositiontest framework asserts manifest shape but doesn't directly assert on the response-side ready flag, so this is regression-friendly more than regression-proof. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

If any other shell mutates ~/.kube/config (e.g. a `gcloud container clusters get-credentials` to peek at the workload cluster), every subsequent kubectl call in a running demo.sh / setup.sh / cleanup.sh silently retargets to whatever the new current-context is. We hit this live during a recording: the workload-cluster context took over mid-flight and `kubectl get modelcache` blew up because the workload cluster doesn't have the ModelCache CRD. Capture `current-context` at script start (or honor an explicit MODELPLANE_CONTEXT env var) and define a `kubectl` shell function that passes `--context=$KCTX` through to every call. Subsequent context flips elsewhere on the box don't affect the running script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The kserve-llmisvc-resources Helm chart ships the LLMInferenceServiceConfig CRD but does NOT create the six default instances the controller looks up via spec.baseRefs on every LLMInferenceService admission: kserve-config-llm-default kserve-config-llm-router-route kserve-config-llm-worker-tensor-parallel kserve-config-llm-worker-pipeline-parallel kserve-config-llm-decode kserve-config-llm-prefill Without them, every LIS admission fails with: PresetsCombined False CombineBaseError failed to get LLMInferenceServiceConfig "kserve-config-llm-worker-pipeline-parallel" … not found and the controller silently retries forever. We hit this twice during demo recordings; memory note from a prior debugging session called it out as something compose-kserve-backend should automate. Doing that now. Compose six Object MRs (one per preset) targeting the workload cluster with spec: {} — empty is enough to satisfy the lookup. Gated on the kserve-controller resource being observed so we don't race the CRD install. SuccessfulCreate readiness because LLMInferenceServiceConfig has no Ready condition of its own. Also: shellcheck fix on demo / setup / cleanup scripts — the kubectl shell function was being called before its definition. Use `command kubectl` for the bootstrap kubectl-config-read so shellcheck stops flagging SC2218. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Every ==> step in setup.sh and demo.sh now has: - A one-line explanation of what's happening behind the script in Modelplane terms (what's being composed, what the resource shape means architecturally) so a viewer of the gif understands the story without reading the demo source. - An elapsed timer that prints "✓ <phase> in Ns" when each major milestone completes (cache hydration, gang Ready, service address, engine serving) and a total time at the end. setup.sh tracks the InferenceCluster Ready transition time + total setup time. demo.sh already tracked per-phase times; this commit adds the prose context and a total at the end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add a section to demo.sh that fetches workload-cluster credentials via gcloud, lists the gang pods with their node placement, then execs into each pod to show: - the /mnt/models mount line (same NFS endpoint on both pods) - stat on model.safetensors (same inode + size on both pods) Together that proves the LWS leader and worker are reading from the exact same shared PVC backed by the same Filestore instance, on different nodes. Skips gracefully if gcloud / gke-gcloud-auth-plugin isn't available. Uses bracket jsonpath for the dot-containing annotation key (`crossplane\.io/external-name`) and a single-quoted sh -c body for the in-pod commands to avoid nested-quoting hell. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bug: the preset Object MRs were composed with providerConfigRef.name pointed at the helm/k8s ProviderConfig (`<xr-name>-cluster`, `qwen-cached-demo-kserve-cluster`), but Object MRs need a ClusterProviderConfig — and the one configured for the workload cluster lives at `<inferencecluster-name>-cluster-kubeconfig` (`qwen-cached-demo-cluster-kubeconfig`), composed by compose-inference-cluster. Result: every preset Object MR sat Synced=False with CannotConnectToProvider: cannot get provider config: ClusterProviderConfig … "qwen-cached-demo-kserve-cluster" not found which propagated up to InferenceCluster sitting Ready=False, which blew through setup.sh's kubectl wait timeout on the cold-start recording. The KServeBackend XR is named `<inferencecluster>-kserve` so we strip that suffix to derive the parent name + the correct CPC name. Test: new test-backend-presets case observes the kserve-controller Helm release (the gate condition) and asserts on each of the six composed preset Object MRs — manifest shape AND providerConfigRef target. Catches the wrong-PC-name copy-paste class of bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The kserve-llmisvc-resources Helm chart v0.16.0 already creates this particular preset. If our function composes it first via Object MR (no Helm ownership labels), the Helm install errors with: LLMInferenceServiceConfig "kserve-config-llm-router-route" in namespace "kserve" exists and cannot be imported into the current release: invalid ownership metadata …and the whole kserve-controller release sits Synced=False forever. Cut router-route from _LLMISVC_PRESET_NAMES — only compose the five the chart actually leaves blank. Test golden updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Second recording captures the demo.sh path where the InferenceCluster is already Ready but the cache hasn't been hydrated yet. End-to-end: ModelCache apply + hydration → ModelDeployment + ModelService apply + workload-tree dump → LWS gang spin-up → composed-MR + LIS-manifest inspection → the new "prove both gang pods do IO from the same PVC" block (pod placement on two nodes + matching NFS endpoint + matching safetensors inode) → real Qwen chat-completion response. Existing demo.gif (warm cluster, cache pre-hydrated, gang Ready in 65s) stays — they show different parts of the lifecycle. README now embeds both with captions noting what's pre-hydrated in each. GIF rendered with `agg --speed 4 --idle-time-limit 1` so dead time during long phases (Filestore provision wait, LWS gang image pull, vLLM CUDA-graph capture) collapses to 1s for readability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

agg --speed 2 --idle-time-limit 1 — keeps the dead-time collapse but lets a viewer actually follow the per-step output. File size goes 363 KB → 406 KB, well under GitHub's inline-autoplay budget. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previous renders used --speed 2 / --speed 4, which sped up the output stream globally — meant phase headers and kubectl-get blocks flashed by faster than a viewer could read. The real fix is to keep playback at real-time and only collapse the *between-phase* idle gaps (which are dominated by Filestore provisioning + LWS image pull + CUDA-graph capture — minutes of dead time that don't add value to the gif). agg --speed 1 --idle-time-limit 5: pauses longer than 5s clamp to 5s, everything else plays at the recorded rate. Each kubectl-get block now lingers long enough to actually read. File size 363 KB → 458 KB; under GitHub's inline-autoplay budget. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR #75 deleted ui/ entirely. An earlier commit on this branch swept ui/frontend/node_modules into the index via git add -A, so the rebase faithfully re-added ~10k files (~2.5M lines, ~181M) on top of the deletion. Drop them.

dennis-upbound mentioned this pull request May 14, 2026

WIP: ModelCache design doc + examples #76

Closed

dennis-upbound changed the title ~~WIP: Scaffold the ModelCache primitive~~ v0.1 ModelCache + multi-node LWS unblock May 16, 2026

dennis-upbound marked this pull request as ready for review May 16, 2026 01:09

negz force-pushed the demonstration branch from 4d90225 to 7801b9b Compare May 18, 2026 18:57

Dennis Ramdass and others added 25 commits May 20, 2026 10:22

Fix Cluster status item class name

3f50182

datamodel-codegen names inline array item types after the singular property name; the generated class is `Cluster`, not `ClustersItem`. Caught by `up test run tests/test-model-cache`.

Apply ruff format to compose-model-cache and its test

f3f24ae

Add TODO for ModelDeployment integration after #75 lands

e4555c6

Update docstring: ModelDeployment integration is now wired

963a1d9

Quote elapsed-call args in demo.sh (shellcheck SC2086)

fa51b8c

Dennis Ramdass and others added 25 commits May 20, 2026 10:31

Remove ui/ accidentally re-added during rebase

c61136c

PR #75 deleted ui/ entirely. An earlier commit on this branch swept ui/frontend/node_modules into the index via git add -A, so the rebase faithfully re-added ~10k files (~2.5M lines, ~181M) on top of the deletion. Drop them.

dennis-upbound force-pushed the dennis/modelcache-impl branch from 6eb1a25 to c61136c Compare May 20, 2026 17:40

dennis-upbound marked this pull request as draft May 20, 2026 17:40

dennis-upbound closed this May 21, 2026

dennis-upbound mentioned this pull request Jun 4, 2026

Drop KServe — dispatch to native + llm-d (Dynamo-ready) #99

Merged

dennis-upbound deleted the dennis/modelcache-impl branch June 19, 2026 17:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.1 ModelCache + multi-node LWS unblock#78

v0.1 ModelCache + multi-node LWS unblock#78
dennis-upbound wants to merge 59 commits into
demonstrationfrom
dennis/modelcache-impl

dennis-upbound commented May 14, 2026 •

edited

Loading

Uh oh!

dennis-upbound commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dennis-upbound commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What ships

Multi-node LWS — what was broken and what we did

Out of scope for v0.1

Workarounds / hacks (cleanup before / after merge)

Composition-code hacks

Mid-session debugging interventions captured so they don't repeat

What needs to happen to merge

Must

Should

Nice to have

Demo flow

Related

Uh oh!

dennis-upbound commented May 17, 2026

Known issue: control-plane envoy proxy gets stuck on xds during long demos

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dennis-upbound commented May 14, 2026 •

edited

Loading