Skip to content

v0.1 ModelCache + multi-node LWS unblock#78

Closed
dennis-upbound wants to merge 59 commits into
demonstrationfrom
dennis/modelcache-impl
Closed

v0.1 ModelCache + multi-node LWS unblock#78
dennis-upbound wants to merge 59 commits into
demonstrationfrom
dennis/modelcache-impl

Conversation

@dennis-upbound

@dennis-upbound dennis-upbound commented May 14, 2026

Copy link
Copy Markdown
Collaborator

Warm-cluster demo recording

Warm-cluster run: cache pre-hydrated, LWS gang Ready in 65s, engine answers 86s after gang Ready. 4× speed.


Adds the v0.1 ModelCache primitive (XRD + composition function + ModelDeployment.spec.caches[] wiring) and turns multi-node TensorPipeline serving from broken-by-default into a working path. See Multi-node LWS — what was broken and what we did below for the specific issues we untangled.

Base: demonstration (#75) so the diff stays scoped. GitHub will rebase to main automatically when #75 merges.

What ships

ModelCache — composes a per-cluster RWX PVC + a one-shot hydration Job on every matched InferenceCluster. Pods that reference the cache via ModelDeployment.spec.caches[] get the PVC mounted at the engine's expected path; no per-replica HuggingFace pull at boot.

Aspect v0.1 values
Artifact kinds Weights, Tokenizer, Bytes
Sources huggingFace, s3, http, inline implemented; oci, configMap discriminators reserved
Storage backends PVC (Modelplane-managed), ExistingPVC (customer-managed)
Replication AllMatchingClusters
Deployment wiring ModelDeployment.spec.caches: [{ name }]LLMInferenceService.spec.model.uri = pvc://...

Cloud-agnostic CSI capability plumbing. InferenceCluster.spec.storage.csiDrivers: [SharedFilesystem | ObjectStorageMount | BlockDevice] declares the cluster's semantic capability. The GKE composition branch reads the underlying VPC name from GKECluster.status.network.name and auto-composes a modelplane-rwx StorageClass on the workload cluster with parameters.network pinned. Same pattern will hook into EKS / AKS with their respective knobs.

GCP API auto-enable. compose-gke-cluster emits a ProjectService MR for file.googleapis.com whenever the user opts into SharedFilesystem.

1-pager design: design/modelcache/README.md. Replaces the earlier verbose design doc (PR #76, closed in favor of this focused page).

End-to-end demo: examples/qwen-cached-demo/ deploys Qwen 2.5 0.5B on a 2-pod LWS gang with a real chat-completion response. See TOPOLOGY.md for the composed XR / MR layout.

Tests: model-cache × 3 + model-replica-with-cache + the full existing suite — 20/20 pass.

Multi-node LWS — what was broken and what we did

Bringing the demo up needed fixes layered across the function, the LIS shape, the cache hydration command, the engine wrapper, and the cluster's StorageClass. Each row below is something an OSS user would hit on a fresh cluster; calling them out so reviewers know what each fix buys us.

Problem Fix
No Ray bootstrap on the gang. vLLM with --pipeline-parallel-size > 1 needs a Ray cluster spanning the gang's pods. KServe's ServingRuntime injects this for you; the custom-template path we use doesn't. Worker pod never joined the leader's Ray cluster, vLLM's placement group hung forever. compose-model-replica injects a 5-line shell wrapper as the container command when topology.strategy == TensorPipeline. Leader runs ray start --head then execs the engine; worker runs ray start --address=$LWS_LEADER_ADDRESS:6379 --block. Bootstrap lives as a Python constant — when we swap KServe for our own LeaderWorkerSet composition later, the same constant moves verbatim.
KServe v0.17+ flat worker schema. The existing function generated worker: {size, template} (v0.16 shape); v0.17+ wants worker as a flat PodSpec with containers[]. parallelism.tensor is now per-pod GPU count and parallelism.pipeline is required when worker is present. Switch the generated LIS shape to flat worker: + set parallelism.{tensor,pipeline} from the topology.
KServe v0.17+ stopped injecting --model into engine argv. model.uri = pvc://... mounts the PVC at /mnt/models but the engine container's args don't get --model=<path> anymore. vLLM defaulted to facebook/opt-125m and silently downloaded it from HF instead of using the mounted cache. Append --model=/mnt/models to the engine args when spec.caches[] is set.
vllm serve vs python -m vllm.entrypoints.openai.api_server. vllm serve takes the model as a positional argument; the api_server entrypoint (the image's default ENTRYPOINT) takes --model=. The first wrapper invoked vllm serve "$@", which silently exited on argparse mismatch. Wrapper invokes the same python -m vllm.entrypoints.openai.api_server the default ENTRYPOINT uses. Same arg shape for single-node and multi-node.
Filestore CSI defaults to the GCP default VPC. GKE's chart-shipped standard-rwx StorageClass has no network parameter, so Filestore instances landed on default and were unreachable from cluster nodes — PVCs sat Pending forever with NFS mount timeouts. compose-inference-cluster (GKE branch) auto-composes a workload-cluster StorageClass (modelplane-rwx) with parameters.network = <our VPC>. Network surfaced via GKECluster.status.network.name.
lost+found triggered a false-positive skip. The hydrate Job's "already populated?" check did ls -A /mnt/model. Filestore's ext4-style filesystem puts lost+found at mount root, so a fresh PVC always looked non-empty; the Job exited with artifact already hydrated, skipping and the cache stayed Ready with no weights on disk. Filter lost+found out of the emptiness check.
huggingface-cli was deprecated in huggingface-hub 1.x and the [cli] extra removed. The hydrate Job's pip install huggingface_hub[cli] printed a warning and huggingface-cli download exited with a deprecation notice, no bytes written. Switch to hf download (new CLI, same flags).
Filestore API isn't enabled by default on fresh GCP projects. PVCs sit Pending indefinitely with SERVICE_DISABLED in their workload-cluster events; the GKE cluster itself looks healthy. compose-gke-cluster emits a ProjectService MR for file.googleapis.com whenever the Filestore CSI addon is on.
T4 / Turing doesn't support bfloat16 — Qwen 2.5 ships in bf16. vLLM exits with ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Not a Modelplane-layer issue; documented as an example concern. Demo deployment passes --dtype=half.

Net result: a fresh kubectl apply -f examples/qwen-cached-demo/ provisions a GKE cluster with Filestore CSI enabled + a VPC-pinned StorageClass, hydrates Qwen 2.5 0.5B onto a 100 GiB RWX PVC, boots a 2-pod LWS gang on two T4 nodes, both pods mount the same PVC, the leader's Ray head and the worker's Ray client form one cluster, vLLM's pipeline-parallel placement group satisfies, and curl /v1/chat/completions returns a real Qwen response over HTTP 200.

Out of scope for v0.1

  • Adapter / Engine artifact kinds — composition shape needs more design before locking
  • Storage backends beyond PVC / ExistingPVC — the backend enum is the extension point
  • AllMatchingNodes replication granularity — per-node staging is a separate problem
  • Signal-bus emission (hydration latency, bytes staged) — depends on Fleet signal bus: capture layer for cross-cluster signals (v0.1) #74

Workarounds / hacks (cleanup before / after merge)

Real composition-code workarounds shipping in this PR. The demo scripts under examples/qwen-cached-demo/ have their own pile of provider-gcp / provider-helm / provider-kubernetes workarounds — those will get gitignored and graduate into a smoke-test fixture, so they're not called out here.

Composition-code hacks

Where What Upstream root cause Cleanup
functions/compose-gke-cluster/main.py (TODO L129) ZONE_RESOURCE_POOL_EXHAUSTED isn't surfaced through the nodepool MR's conditions; GCE keeps the pool in PROVISIONING forever. GCE/GKE not propagating instance-level capacity failures up to the pool's status. Add an IGM-status check or gcloud query in the function (bigger lift, defer).
functions/compose-kserve-backend/main.py (LLMInferenceServiceConfig presets) Composes five empty spec: {} LLMInferenceServiceConfig instances on the workload cluster because the kserve-llmisvc-resources chart defines the CRD but doesn't ship the controller-required defaults. KServe chart packaging gap — chart should land these. File upstream KServe issue. Once they ship the defaults, drop the preset compose code.
functions/compose-kserve-backend/main.py (router-route exclusion) We deliberately skip kserve-config-llm-router-route because the chart DOES include it, and naive composition leads to Helm "exists and cannot be imported" failures. The list-overlap is fragile across KServe versions. Same chart packaging gap. Same as above.

Mid-session debugging interventions captured so they don't repeat

Bug What we fixed Test that now catches it
Workload-cluster Object MR for storage-class-rwx composed without metadata.namespace set; Crossplane reconcile errored with "an empty namespace may not be set when a resource name is provided". Added metadata.namespace=modelplane-system to the composed Object. test-inference-cluster-csi-shared-filesystem asserts on the namespace.
storage-class-rwx Object used policy: DeriveFromObject readiness, but StorageClass has no Ready condition — Object MR stayed Ready=False forever. Switched to policy: SuccessfulCreate. Same test.
Function composed the SC Object but didn't propagate observed Ready into rsp.desired.resources["storage-class-rwx"].ready — XR aggregator left the InferenceCluster Ready=False. Read observed condition and set READY_TRUE on the response. Same test (regression-friendly; the compositiontest framework doesn't expose .ready for direct assertion).
compose-gke-cluster didn't track projectservice-filestore in mark_readiness; XR stuck on "Unready resources: projectservice-filestore" even after the API was enabled. Added it to the tracked-resource list. test-gkecluster-filestore-addon.
Preset Object MRs pointed at <xr-name>-cluster (the Helm ProviderConfig) instead of <inferencecluster>-cluster-kubeconfig (the ClusterProviderConfig). Strip the -kserve suffix from the KServeBackend XR name to derive the parent + the right CPC name. test-backend-presets asserts on providerConfigRef.name.

What needs to happen to merge

Must

  • Rebase to main. PR is currently based on demonstration (Implement the updated API shape #75) so the diff stays scoped. GitHub will fast-forward once Implement the updated API shape #75 lands.
  • Migrate spec.enginespec.workers.template per Nic's merged commit on PR #64 (the engine field is now a curated subset of PodTemplateSpec). Affects apis/modeldeployments/definition.yaml, apis/modelreplicas/definition.yaml, functions/compose-model-replica/main.py.
  • Gitignore the examples/qwen-cached-demo/ demo scripts (or relocate under an ignored path). They'll graduate into a smoke-test fixture; not part of the merged surface area.
  • Review.

Should

  • File upstream KServe issue for the chart packaging gap (missing default LLMInferenceServiceConfig instances + kserve-llmisvc-resources Helm release blocked by chart-shipped vs externally-created preset overlap).

Nice to have

  • kind-cluster smoke test in CI that reconciles Crossplane against the package and asserts XR Ready aggregation. Would have caught half the bugs we hit during this PR before they hit a live demo.

Demo flow

export GCP_PROJECT=my-gcp-project
./setup.sh         # ~5-10 min on first run: GKE provision + Filestore CSI install + stack
./demo.sh          # cache hydrate + LWS gang up + curl
./cleanup-demo.sh  # iterate: deletes workload, cluster + infra stay
./cleanup.sh       # full teardown

Recordings of the warm-cluster demo and the full cold-start flow attached in a follow-up comment.

Related

#74 Fleet signal bus (status emission, future) · #75 API shape parent · PR #76 (closed; design folded into design/modelcache/README.md)

@dennis-upbound dennis-upbound changed the title WIP: Scaffold the ModelCache primitive v0.1 ModelCache + multi-node LWS unblock May 16, 2026
@dennis-upbound dennis-upbound marked this pull request as ready for review May 16, 2026 01:09
@dennis-upbound

Copy link
Copy Markdown
Collaborator Author

Known issue: control-plane envoy proxy gets stuck on xds during long demos

Hit this twice now during cold-start demo recordings — flagging so it's not lost.

Symptom: the control-plane envoy proxy pod (envoy-modelplane-system-modelplane-* in the envoy-gateway-system namespace) goes to 1/2 Ready and the Gateway reports PROGRAMMED=False even though the InferenceCluster + ModelService composition is fine. Curl from inside the cluster gets Connection refused on 172.18.255.200:80, not a 5xx.

Root cause (suspected): the proxy's envoy container can't reach its xds source. Container logs show repeated:

[warning][config] DeltaAggregatedResources gRPC config stream to xds_cluster closed: 14,
  upstream connect error or disconnect/reset before headers.
  reset reason: connection timeout

The long-running gRPC config stream from the envoy proxy to the envoy-gateway control plane goes stale during multi-hour sessions, and the proxy never recovers on its own.

Workaround (what unblocked both demos):

kubectl delete pod -n envoy-gateway-system envoy-modelplane-system-modelplane-<hash> --wait=false

The Deployment recreates the pod, the new proxy gets a fresh xds stream, the Gateway flips to PROGRAMMED=True within ~15s, and routing works again.

Why this matters here: not something this PR introduces, but it bites the demo flow when the recording session is long enough — by the time we get to the curl step the proxy has already drifted. Worth either:

  1. Filing upstream on envoy-gateway (long-running xds gRPC stream not self-healing) — the symptom + fix matches several open issues there.
  2. Adding a defensive controller on Modelplane's side that watches the Gateway condition and bounces the proxy pod when PROGRAMMED flips False for >N minutes.
  3. (Demo-only) kubectl rollout restart of the envoy proxy as the first step of demo.sh so the recording can't catch it mid-drift. Trivial but not addressing the underlying issue.

Not blocking for this PR.

@negz negz force-pushed the demonstration branch from 4d90225 to 7801b9b Compare May 18, 2026 18:57
Dennis Ramdass and others added 25 commits May 20, 2026 10:22
Composes a PVC + a one-shot hydration Job per matched InferenceCluster.
v0.1 scope: Weights kind, PVC backend, HuggingFace + S3 sources,
replication = AllMatchingClusters. ContentAddressed / Custom backends,
Tokenizer / Bytes / Adapter / Engine kinds, BYO ExistingPVC, and
per-cluster selector overrides are deferred.

Out of scope here: ModelDeployment integration. The mount-injection
that attaches a cache's PVC to a model serving pod lives in
compose-model-replica and is deferred until the new ModelDeployment
shape (PR #75) stabilizes.

Adds:
- apis/modelcaches/{definition,composition}.yaml
- functions/compose-model-cache/main.py
- examples/cache/model-cache-basic.yaml

Design: #76.
Apply patterns from skills/crossplane-python-functions:
- Cast XRD int fields with int() — protobuf delivers Quantity sizes
  as Python float (`200.0Gi` ≠ valid Kubernetes Quantity)
- Split per-source hydration into _hf_hydration / _s3_hydration module
  functions so the discriminator dispatch is one line
- Separate composition from observation: compose_cluster_resources()
  only emits Objects; derive_cluster_phase() reads observed state;
  mark_ready_resources() flips ready flags AFTER resource.update()
- Add transition events on first compose and on first full readiness
  (one-shot, not steady-state, to keep `kubectl describe` quiet)
- Extract _wrap_remote / _observed_remote_status helpers and a
  HydrationSpec dataclass to replace the tuple return

New composition test: tests/test-model-cache/{main.py,xr.yaml}.
Mocks a single ready InferenceCluster via extraResources and asserts
the PVC + Job Objects compose with the expected manifests on the
workload cluster.
datamodel-codegen names inline array item types after the singular
property name; the generated class is `Cluster`, not `ClustersItem`.
Caught by `up test run tests/test-model-cache`.
Stops these from polluting `git status` and accidentally getting
committed:

- `__pycache__/` and `*.pyc` — Python bytecode caches, regenerated
  by every test run
- `.DS_Store` — macOS Finder metadata
- `.venv-test/` — local test virtualenv (mirrors existing `.venv`)
- `opencode.json` — per-user opencode tool config; contains a local
  endpoint URL, no shared value
XRD now matches the full v0.1 design surface so we don't have to churn
the API shape later:

- artifact.kind: + Tokenizer + Bytes (same hydration path as Weights)
- artifact.source: + http, oci, inline, configMap (in addition to
  huggingFace + s3)
- storage.backend: + ExistingPVC (customer-managed PVC, no Job)
- status: + resolvedDigest, + lastHydratedAt, + bytesStaged,
  + references

Implementations:
- Tokenizer / Bytes: route through the same builder as Weights
- ExistingPVC: compose no Objects; report Ready immediately per
  matched cluster; "Adopted" event on first match
- http: curl-based fetch into the PVC; optional Authorization header
  from a Secret
- inline: write content to a file inside the PVC via env-passed value
- lastHydratedAt: captured from the remote Job's completionTime
- oci, configMap: discriminator locked, surface
  ImplementationPending condition + warning until wired

Two new tests cover the new paths:
- test-model-cache-existing-pvc: no Objects composed, 1/1 ready
- test-model-cache-pending-source: oci source surfaces empty summary
  rather than crashing
Mirror the doc trim on PR #76: the kind / source descriptions just
name the partition axis ("fetch protocol", "wiring discriminator not
content partition") rather than declaring the field MECE. Substance
is the same; phrasing matches the design doc.
Mirrors the design-doc edit that replaces the flat RWX-CSI list with
the four-category framing (NFS / parallel FS / object-backed FUSE /
replicated block). Description-only — surfaces a choice customers
should actually be making rather than implying all CSIs are
equivalent.
Four curated examples covering the impl's working v0.1 paths:

- model-cache-basic.yaml — HuggingFace Weights, basic case. Tidied
  the header comment from the original scaffold.
- model-cache-nim-mode-2b.yaml — pre-seed the NIM profile cache dir
  via http source. The demoable NIM Mode 2b case once the cluster
  has NGC creds. Notes that ORAS / oci source is locked in the XRD
  but impl-pending; that follow-up swaps http for oci against
  nvcr.io/nim/... directly.
- model-cache-existing-pvc.yaml — ExistingPVC backend; customer-
  managed PVC adoption with no Modelplane-composed PVC or Job.
- model-cache-private-s3.yaml — private S3 with access-key Secret;
  compliance / GDPR scenario.

Each example has a tight header explaining the use case, expected
speedup, and what cluster/Secret prerequisites need to be in place.
Two examples were technically schema-correct but assumed the user
already knew the implicit Secret-key contract:

- model-cache-basic.yaml: tokenSecretRef.name: hf-token expects a
  Secret with key HF_TOKEN. Added a one-line kubectl example so the
  user can wire it up before applying.
- model-cache-private-s3.yaml: the s3 hydration Job reads fixed keys
  access_key / secret_key from the referenced Secret. Made that
  explicit so the user doesn't accidentally use AWS_ACCESS_KEY_ID etc.

Validated all four examples against the generated Pydantic XRD model
(.up/python/models/ai/modelplane/modelcache/v1alpha1.py) — every
required field present, every enum and pattern matches.
End-to-end ModelCache integration so a ModelDeployment that
references a cache mounts the pre-staged PVC at engine boot instead
of fetching weights from the source.

API:
- ModelDeployment.spec.caches: [{ name }] — single-item list in v0.1,
  references a ModelCache in the same namespace
- ModelReplica.spec.caches mirrors and inherits verbatim

Composition:
- compose-model-deployment passes caches through to each ModelReplica
- compose-model-replica sets model.uri = pvc://<cache-pvc-name> when
  a cache is referenced; otherwise falls back to hf://<repo> from
  --model= as before
- lib/naming.modelcache_pvc_name() centralizes the PVC naming so
  compose-model-cache (creator) and compose-model-replica (consumer)
  agree on the convention

Test: tests/test-model-replica-with-cache/ verifies the dispatch.
Caught one bug along the way (llmis name derivation when the deploy
prefix shares characters with the cluster name) — fixed in the test
fixture.

Demo: examples/qwen-cached-demo/ — three yamls + README showing the
cold-start delta on Qwen 2.5 0.5B over the existing qwen-demo flow.
Deploys onto the same GKE InferenceClusters from ../qwen-demo/ with
spec.caches: [{ name: qwen-2-5-0-5b }] pointing at a pre-staged cache.
Four sequenced scripts so the demo is one-command runnable:

- setup.sh: applies shared prereqs from ../qwen-demo (00-prereqs,
  01-gateway, 02-class), provisions a GKE InferenceCluster via
  envsubst-templated infra/cluster.yaml ($GCP_PROJECT required),
  waits for Ready (~5-10 min).
- demo.sh: applies cache → waits for ArtifactReady → applies
  deployment → waits for ReplicasReady → applies service →
  fetches the gateway address → curls a chat-completions request.
  Times each phase so the cold-start delta is visible.
- cleanup-demo.sh: deletes service / deployment / cache only.
  Cluster + shared infra stay so demo.sh can re-run immediately.
- cleanup.sh: calls cleanup-demo.sh then deletes the
  InferenceCluster (deprovisions GKE). Shared infra kept (might
  be reused by other demos in this repo).

Every script uses kubectl apply / delete --ignore-not-found so
re-running mid-state is safe. README rewritten to lead with the
script-driven flow and a phase / script / what-it-does table.

infra/cluster.yaml uses source: GKE so Modelplane provisions the
GKE cluster directly — no external Secret / kubeconfig wiring
required, the user just provides GCP_PROJECT and Crossplane GCP
provider credentials.

shfmt-clean, schema-validated against generated Pydantic models for
ModelCache / ModelDeployment / ModelService / InferenceCluster.
Two cleanups:

- Drop the made-up cache/replica timing numbers (18s/24s) — they're
  ballpark guesses and shouldn't read as commitments. Output snippet
  now shows the script's shape with placeholders.
- Comparison section was instructing the reader to apply
  ../qwen-demo/04-deployment.yaml; that has replicas: 2 and selects
  by the broad cluster label, so on a single-cluster demo setup the
  scheduler would surface InsufficientCapacity. Simpler advice:
  copy 02-deployment.yaml, strip the caches block, change the name,
  and apply.
The original demo showed only the cached path — readers had to
imagine the uncached cold-start cost. Now the demo applies both
deployments side-by-side on the same cluster, polls each for
readiness, and prints both timings so the delta is obvious.

Changes:
- infra/cluster.yaml: nodeCount 1 → 2 (one GPU per parallel
  deployment)
- 02-deployment.yaml: replicas 2 → 1 (now paired with 02b)
- 02b-deployment-uncached.yaml: new — same Qwen model, no
  spec.caches, --model=Qwen/... in engine args so the engine pulls
  weights from HuggingFace at boot
- 03b-service-uncached.yaml: new — exposes the uncached deployment
- demo.sh: applies both deployments + services, polls each via
  jsonpath against status.conditions[type=ReplicasReady], emits
  per-deployment ready times as they tick, prints a
  Cached / Uncached comparison summary, runs the curl sanity test
  against the cached endpoint
- cleanup-demo.sh: deletes both deployments + services + the cache
- README: rewritten to lead with the side-by-side framing
The InferenceClass + GKE source path requires zones (validated by
compose-inference-cluster). Without it the function pipeline returns
a pydantic validation error and the cluster never composes.

Caught while bringing the demo up live against crossplane-playground.
Two storage-related changes the demo flushed out:

(1) Cloud-agnostic CSI capability declaration.

User-facing InferenceCluster gets a new spec.storage block with a
storageClassName default and a csiDrivers list of semantic capability
flags — SharedFilesystem / ObjectStorageMount / BlockDevice. The
InferenceCluster composition maps these to cloud-specific CSI addons
per source (GKE Filestore CSI for SharedFilesystem, GCS-FUSE for
ObjectStorageMount, PD-CSI default for BlockDevice). Future EKS / AKS
branches do the equivalent mapping with their own addon names.

The internal GKECluster XR carries a GCP-specific spec.addons block
(gcpFilestoreCsiDriver / gcsFuseCsiDriver / gcePersistentDiskCsiDriver)
that compose-gke-cluster threads into the underlying container.Cluster
resource's addonsConfig. The user-facing surface stays cloud-agnostic;
the internal infrastructure XR keeps the GCP-specific knobs as an
escape hatch for advanced cases.

For BYO clusters (source: Existing) the csiDrivers field is
descriptive only — Modelplane never installs drivers on
customer-managed clusters.

The qwen-cached-demo cluster.yaml uses the new shape:
  spec.storage.csiDrivers: [SharedFilesystem]
  spec.storage.storageClassName: standard-rwx

(2) GCE stockout detection.

compose-gke-cluster.detect_capacity_failures() watches nodepool
managed-resource conditions for GCE_STOCKOUT / "does not have enough
resources" and sets an InsufficientGPUCapacity condition naming the
zone and accelerator type. compose-inference-cluster propagates the
condition up to the user-facing InferenceCluster so the failure is
visible in `kubectl describe inferencecluster` without needing to
peel through to the internal GKECluster + nodepool MRs.

Caught a real one while bringing up the demo against
crossplane-playground: the InferenceCluster sat in Creating for 35
min before we found the stockout in `gcloud node-pools describe`.
With this change a user sees the InsufficientGPUCapacity condition
and the offending zone immediately.

(3) Defensive HACKs in setup.sh / cleanup.sh.

setup.sh detects zombie Network MRs (Crossplane reports Ready=True
but gcloud returns 404 because the create-succeeded annotation
sticks even after the underlying resource is gone) and clears the
crossplane.io/external-create-{succeeded,pending} annotations so the
provider retries the create. Also catches Synced=False MRs stuck on
cached provider errors.

cleanup.sh force-finalizes stuck workload-cluster Helm releases and
provider-kubernetes Objects after the InferenceCluster delete is
issued. provider-helm / provider-kubernetes hang on these because
the target cluster is being torn down and the delete API calls fail.

Both HACK blocks carry TODO comments pointing at the upstream
crossplane provider issues — the right fixes belong in provider-gcp
(don't trust the create-succeeded annotation forever; reconcile with
GCP truth on observe) and provider-helm / provider-kubernetes
(treat "target cluster gone" as delete success).
Mermaid flowchart of what the demo composes, from user-facing XRs
down to live workload pods. ModelCache primitives (XR, PVC MR,
Job MR, mounted PVC) highlighted in yellow to make the new piece
visually obvious vs the shared cluster/KServe infrastructure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add ModelService, ModelEndpoint, ModelReplica, Backend, HTTPRoute
and the per-replica composition path. Introduce a second highlight
color (orange) for the external substrate we may later replace with
Modelplane-internal primitives — KServe Release, LWS Release,
LLMInferenceService MR, and the LWS gang itself.

The yellow ModelCache path and the orange substrate path are now
visually distinct, making clear that the cache work is independent
of any future engine-substrate swap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GCP Filestore Basic HDD (backs the standard-rwx storage class on
GKE) takes 8–15 min to provision the first instance. The 10m wait
was tight enough to time out on a cold cluster even when the
hydration itself worked. 20m gives Filestore comfortable room
without burying real failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Filestore CSI driver provisions Filestore Instances via the
file.googleapis.com API. If the API is not enabled on the project,
PVCs sit Pending indefinitely with SERVICE_DISABLED — the GKE
cluster comes up fine, the CSI driver installs fine, the PVC is
accepted, but every provisioning attempt fails 403. We hit this on
crossplane-playground today and the symptom (PVC Pending forever)
is annoying to diagnose without checking workload-cluster PVC
events directly.

Enabling the API is idempotent and cheap; do it as part of setup
so the first PVC the demo creates can actually bind.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Filestore CSI addon on GKE installs the in-cluster driver fine
on its own, but provisioning calls hit file.googleapis.com. If
that API isn't enabled on the project, the symptom is silent: PVCs
sit Pending forever with SERVICE_DISABLED in their workload-cluster
events while the cluster itself reports healthy. Hit this on
crossplane-playground today; took a workload-cluster describe to
diagnose.

Emit a ProjectService MR for file.googleapis.com when the user
opts into Filestore via storage.csiDrivers: [SharedFilesystem].
ProjectService enable takes seconds and runs in parallel with the
multi-minute cluster create, so it doesn't extend critical path.

Deliberately not enabling container/compute APIs — if those aren't
on, the user's GCP project setup is incomplete and they should fix
it explicitly. The addon case is different because the user
explicitly opted into Filestore via the user-facing API.

disableOnDestroy=False so tearing down one InferenceCluster doesn't
yank an API the rest of the project relies on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two-part change to fix Filestore PVCs landing on the wrong VPC:

1) compose-gke-cluster surfaces the observed Network MR's name in
   GKECluster.status.network.name (new field; XRD updated).

2) compose-inference-cluster, in the GKE branch, reads that status
   field and (when storage.csiDrivers contains SharedFilesystem)
   composes an Object → StorageClass on the workload cluster with
   provisioner=filestore.csi.storage.gke.io and parameters.network=
   <our VPC>. Default SC name is `modelplane-rwx`; user can override
   via spec.storage.storageClassName.

The user-facing API stays cloud-agnostic — the cloud-specific knob
(`network` for Filestore) is wired by the GKE-specific composition
branch from the GKE-specific infra XR's status. Same pattern will
apply when EKS / AKS branches need to wire fileSystemId / shareName.

Demo's 01-cache.yaml switched to mp-filestore-rwx so today's running
cluster (which got the GKE built-in standard-rwx by accident) can
still consume the manually-created custom SC. New clusters will get
the auto-composed `modelplane-rwx` from the InferenceCluster XR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Dennis Ramdass and others added 25 commits May 20, 2026 10:31
Closing #76 (the speculative v0.1/v0.2/v0.3 design doc) and
replacing with a focused page that documents what shipped in this
PR: shape, what gets composed, multi-node Ray bootstrap, scope
boundaries. Demo proof at examples/qwen-cached-demo/.

v0.2+ ideas (content-addressed substrate, lazy load, cross-cluster
dedup) are explicitly out of scope here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update the diagram and prose to match what actually runs:
- LWS gang shown as distinct leader + worker pods (not one
  combined node) with the Ray cluster edge between them
- Both pods mounting the cached PVC explicit
- StorageClass MR carries the parameters.network detail
- LIS MR notes the flat PodSpec + Ray-bootstrap command
- Hydration Job notes hf download (not the deprecated CLI)
- GKECluster.status.network.name → StorageClass edge shown

Adds a one-paragraph "what this proves" prose section so a
reviewer can read the diagram + caption and understand the
v0.1 path end-to-end without spelunking through commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OSS Modelplane doesn't name the future storage substrates beyond
"additional backends may land in later versions" — concrete
names (and any associated commercial framing) belong outside this
repo. Updates:

- apis/modelcaches/definition.yaml: drop the explicit
  "ContentAddressed and Custom backends are deferred to v0.2"
  sentence from the backend description; keep the extension point.
- functions/compose-model-cache/main.py: same swap in the
  module docstring.
- design/modelcache/README.md: rewrite the out-of-scope table and
  the forward-compatibility section to talk about generic
  extension points rather than naming v0.2 backends. Drop the
  content-addressed / cross-cluster-dedup framing entirely.

Also fix a Mermaid lexer error in TOPOLOGY.md: the
`status.network.name` edge label contains dots that the unquoted
form parses as identifier separators; quote it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Keep the v0.1 1-pager focused on what shipped; don't enumerate
future features by name in the out-of-scope table. The XRD's
extension points (backend enum, replication enum, source union)
speak for themselves.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
LWS reports the gang Ready as soon as both pods reach 1/1, but
vLLM inside the leader takes another ~60-120s to load the model
from the cached PVC and finish CUDA graph capture before Uvicorn
opens. The previous curl ran the moment LWS gang reported Ready
and got back `upstream connect error … Connection refused` from
the gateway, even though everything was wired correctly.

Move the readiness wait into the curl-test pod itself: it loops
`curl -f /v1/models` until 200, then sends the chat completion.
Avoids the kubectl-run-rm stdout-capture pitfall and keeps the
single-pod ergonomic of the original demo.sh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ModelService.status.address is already a full URL with the path
prefix (e.g. http://172.18.255.200/ml-team/qwen-cached-demo); the
original curl line in demo.sh prepended "http://" and re-appended
"/${NS}/qwen-cached-demo" — yielding
http://http://172.18.255.200/ml-team/qwen-cached-demo/ml-team/qwen-cached-demo/v1/...

DNS then tried to resolve hostname "http", curl failed silently
(stderr suppressed for the readiness probe), and the retry loop
spun forever. Use the address as-is and append the OpenAI path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The README still described the original side-by-side cached vs
uncached demo; the current demo.sh runs a single multi-node
TensorPipeline 1×2 LWS gang against the cached PVC. Rewrite to
match: T4 quota, the actual phase output, the pod-name pattern,
pointer to TOPOLOGY.md.

demo.gif is a 4×-speed recording of a warm-cluster run captured
via `asciinema rec --command "bash demo.sh"` then `agg --speed 4`.
Cache pre-hydrated, LWS gang Ready in 65s, engine serving 86s
later, real Qwen chat completion JSON in the final frame. GitHub
autoplays the gif inline in the rendered README.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add three kubectl-get checkpoints to demo.sh so the recording
shows what's actually happening on the cluster rather than just
script milestones:

- After cache hydration: print the ModelCache row + the
  Object MRs composing the PVC + Job on every matched
  cluster (with Synced / Ready columns).
- After deployment apply: print the user-facing workload tree
  (ModelDeployment, ModelService, ModelReplica, ModelEndpoint)
  so a viewer sees how one ModelDeployment fans out to one
  ModelReplica per cluster plus a ModelEndpoint per replica.
- After LWS gang Ready: print the Object MRs again — the
  LLMInferenceService MR (the orange-band substrate we may
  swap later) is now visible alongside the cache MRs.

Same scripted flow, more visible payload for the recording.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The custom-columns table tells you the composed MRs exist; the
JSON manifest dump shows exactly what got applied to the workload
cluster — the artifact a viewer of the gif/cast wants to see.

- After cache hydration: pretty-print the PVC manifest. Shows
  accessModes: [ReadWriteMany], the modelplane-rwx storage class
  reference, the size.
- After LWS gang Ready: pretty-print the LLMInferenceService
  manifest. Shows model.uri=pvc://modelcache-..., the worker
  PodSpec with the Ray-bootstrap shell as container.command,
  parallelism.{tensor=1, pipeline=2}, and the engine args list.

Uses kubectl's jsonpath to extract spec.forProvider.manifest
then python3 -m json.tool for formatting — no extra deps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
compose_project_services composes the ProjectService MR for
file.googleapis.com when the user opts into the Filestore CSI
addon, but mark_readiness's tracked list didn't include it.
Result: even when the MR observes Ready, the function never
sets `rsp.desired.resources["projectservice-filestore"].ready`,
so Crossplane's XR readiness aggregator keeps the GKECluster
XR at Ready=False with the message
  "Unready resources: projectservice-filestore"
forever.

Caught on the cold-start full demo recording — setup.sh hit
its 20m kubectl-wait timeout sitting on this single missing
mark even though every underlying GCP resource was Ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
provider-kubernetes's Object is a namespaced resource. The
InferenceCluster XR is cluster-scoped, so any composed namespaced
resource needs metadata.namespace set explicitly — otherwise
Crossplane's reconcile errors with
  cannot get composed resource: an empty namespace may not be
  set when a resource name is provided
and the XR sits Synced=False.

Following the pattern compose_kserve_backend / compose_cluster_
provider_config already use (modelplane-system namespace).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ruff reformat on compose-inference-cluster/main.py (CI lint was
failing on collapse of the observed_gke_network_name dict-walk).

Existing test golden updates:
- test-model-cache: hydration command now filters lost+found from
  the emptiness check and uses `hf download` (huggingface-cli was
  deprecated in huggingface-hub 1.x).
- test-model-replica-with-cache: container args carry
  --model=/mnt/models when spec.caches is set, so vLLM loads the
  cached weights instead of falling back to its hardcoded default.
- test-model-replica-multinode: KServe v0.17+ flat-worker shape
  (worker is a PodSpec, not {size, template}); parallelism.tensor
  is per-pod count + parallelism.pipeline carries the pod count;
  container command carries the Ray-bootstrap shell (leader runs
  `ray start --head` + execs vLLM, worker runs `ray start
  --address=$LWS_LEADER_ADDRESS:6379 --block`).

Two new tests for code paths introduced in this PR:
- test-gkecluster-filestore-addon: asserts compose_project_services
  emits a ProjectService MR for file.googleapis.com when the
  user opts into the Filestore CSI addon. Catches regressions to
  the API-auto-enable path.
- test-inference-cluster-csi-shared-filesystem: asserts the
  compose_gke_storage_classes path emits an Object MR wrapping a
  workload-cluster StorageClass with metadata.namespace set
  (regression we hit live: without namespace, Crossplane errors
  with "an empty namespace may not be set when a resource name is
  provided"). Also asserts parameters.network is pinned to the
  observed VPC from GKECluster.status.network.name.

All 20 composition tests pass. ./nix.sh flake check also passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Earlier commit switched 01-cache.yaml from `standard-rwx` (GKE's
built-in Filestore SC) to `modelplane-rwx` (the SC the InferenceCluster
composition auto-creates with parameters.network pinned to our VPC),
but cluster.yaml's spec.storage.storageClassName was missed and still
said `standard-rwx`. That caused the composed Object MR to try to
create a StorageClass named `standard-rwx`, which conflicts with the
GKE-managed built-in (`parameters` is immutable on existing SCs), so
the Object MR sat Synced=False forever and the InferenceCluster
couldn't reach Ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
StorageClass doesn't carry a Ready condition of its own, so the
DeriveFromObject readiness policy provider-kubernetes uses keeps
the wrapping Object MR at Ready=False forever — blocking the
InferenceCluster XR from going Ready even though the SC was
applied successfully on the workload cluster.

Switch the policy to SuccessfulCreate, which marks the MR Ready as
soon as the SC apply succeeds. For a config-only resource that's
the actual readiness signal we care about.

Test golden updated to match.

Also drop the unused gkecluster import in the new
test-gkecluster-filestore-addon test that ruff flagged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
compose_gke_storage_classes composed the Object MR for the workload-
cluster StorageClass with the correct manifest, but never set
rsp.desired.resources["storage-class-rwx"].ready when the observed
MR was Ready. Crossplane's auto-readiness aggregator left it at
READY_UNSPECIFIED and the InferenceCluster XR sat Ready=False with
"Unready resources: storage-class-rwx" even when the SC was applied
successfully on the workload cluster.

Read observed condition and mirror into the response — same pattern
the function already uses for `gke-cluster`.

Test backfill: add an observed storage-class-rwx Object with
Ready=True to the inference-cluster-csi-shared-filesystem test so
the propagation code path is at least exercised. The compositiontest
framework asserts manifest shape but doesn't directly assert on the
response-side ready flag, so this is regression-friendly more than
regression-proof.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If any other shell mutates ~/.kube/config (e.g. a `gcloud container
clusters get-credentials` to peek at the workload cluster), every
subsequent kubectl call in a running demo.sh / setup.sh / cleanup.sh
silently retargets to whatever the new current-context is. We hit
this live during a recording: the workload-cluster context took over
mid-flight and `kubectl get modelcache` blew up because the workload
cluster doesn't have the ModelCache CRD.

Capture `current-context` at script start (or honor an explicit
MODELPLANE_CONTEXT env var) and define a `kubectl` shell function
that passes `--context=$KCTX` through to every call. Subsequent
context flips elsewhere on the box don't affect the running script.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The kserve-llmisvc-resources Helm chart ships the
LLMInferenceServiceConfig CRD but does NOT create the six default
instances the controller looks up via spec.baseRefs on every
LLMInferenceService admission:

  kserve-config-llm-default
  kserve-config-llm-router-route
  kserve-config-llm-worker-tensor-parallel
  kserve-config-llm-worker-pipeline-parallel
  kserve-config-llm-decode
  kserve-config-llm-prefill

Without them, every LIS admission fails with:
  PresetsCombined False CombineBaseError
  failed to get LLMInferenceServiceConfig
  "kserve-config-llm-worker-pipeline-parallel" …
  not found
and the controller silently retries forever. We hit this twice
during demo recordings; memory note from a prior debugging session
called it out as something compose-kserve-backend should automate.
Doing that now.

Compose six Object MRs (one per preset) targeting the workload
cluster with spec: {} — empty is enough to satisfy the lookup.
Gated on the kserve-controller resource being observed so we
don't race the CRD install. SuccessfulCreate readiness because
LLMInferenceServiceConfig has no Ready condition of its own.

Also: shellcheck fix on demo / setup / cleanup scripts — the
kubectl shell function was being called before its definition.
Use `command kubectl` for the bootstrap kubectl-config-read so
shellcheck stops flagging SC2218.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every ==> step in setup.sh and demo.sh now has:
- A one-line explanation of what's happening behind the script in
  Modelplane terms (what's being composed, what the resource shape
  means architecturally) so a viewer of the gif understands the
  story without reading the demo source.
- An elapsed timer that prints "✓ <phase> in Ns" when each major
  milestone completes (cache hydration, gang Ready, service address,
  engine serving) and a total time at the end.

setup.sh tracks the InferenceCluster Ready transition time + total
setup time. demo.sh already tracked per-phase times; this commit
adds the prose context and a total at the end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a section to demo.sh that fetches workload-cluster credentials
via gcloud, lists the gang pods with their node placement, then
execs into each pod to show:
- the /mnt/models mount line (same NFS endpoint on both pods)
- stat on model.safetensors (same inode + size on both pods)

Together that proves the LWS leader and worker are reading from the
exact same shared PVC backed by the same Filestore instance, on
different nodes. Skips gracefully if gcloud / gke-gcloud-auth-plugin
isn't available.

Uses bracket jsonpath for the dot-containing annotation key
(`crossplane\.io/external-name`) and a single-quoted sh -c body for
the in-pod commands to avoid nested-quoting hell.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug: the preset Object MRs were composed with providerConfigRef.name
pointed at the helm/k8s ProviderConfig (`<xr-name>-cluster`,
`qwen-cached-demo-kserve-cluster`), but Object MRs need a
ClusterProviderConfig — and the one configured for the workload
cluster lives at `<inferencecluster-name>-cluster-kubeconfig`
(`qwen-cached-demo-cluster-kubeconfig`), composed by
compose-inference-cluster.

Result: every preset Object MR sat Synced=False with
  CannotConnectToProvider: cannot get provider config:
  ClusterProviderConfig … "qwen-cached-demo-kserve-cluster" not found
which propagated up to InferenceCluster sitting Ready=False, which
blew through setup.sh's kubectl wait timeout on the cold-start
recording.

The KServeBackend XR is named `<inferencecluster>-kserve` so we
strip that suffix to derive the parent name + the correct CPC name.

Test: new test-backend-presets case observes the kserve-controller
Helm release (the gate condition) and asserts on each of the six
composed preset Object MRs — manifest shape AND providerConfigRef
target. Catches the wrong-PC-name copy-paste class of bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The kserve-llmisvc-resources Helm chart v0.16.0 already creates this
particular preset. If our function composes it first via Object MR
(no Helm ownership labels), the Helm install errors with:
  LLMInferenceServiceConfig "kserve-config-llm-router-route" in
  namespace "kserve" exists and cannot be imported into the current
  release: invalid ownership metadata
…and the whole kserve-controller release sits Synced=False forever.

Cut router-route from _LLMISVC_PRESET_NAMES — only compose the five
the chart actually leaves blank. Test golden updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second recording captures the demo.sh path where the InferenceCluster
is already Ready but the cache hasn't been hydrated yet. End-to-end:
ModelCache apply + hydration → ModelDeployment + ModelService apply
+ workload-tree dump → LWS gang spin-up → composed-MR + LIS-manifest
inspection → the new "prove both gang pods do IO from the same PVC"
block (pod placement on two nodes + matching NFS endpoint + matching
safetensors inode) → real Qwen chat-completion response.

Existing demo.gif (warm cluster, cache pre-hydrated, gang Ready in
65s) stays — they show different parts of the lifecycle. README now
embeds both with captions noting what's pre-hydrated in each.

GIF rendered with `agg --speed 4 --idle-time-limit 1` so dead time
during long phases (Filestore provision wait, LWS gang image pull,
vLLM CUDA-graph capture) collapses to 1s for readability.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
agg --speed 2 --idle-time-limit 1 — keeps the dead-time collapse
but lets a viewer actually follow the per-step output. File size
goes 363 KB → 406 KB, well under GitHub's inline-autoplay budget.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous renders used --speed 2 / --speed 4, which sped up the
output stream globally — meant phase headers and kubectl-get blocks
flashed by faster than a viewer could read. The real fix is to keep
playback at real-time and only collapse the *between-phase* idle
gaps (which are dominated by Filestore provisioning + LWS image
pull + CUDA-graph capture — minutes of dead time that don't add
value to the gif).

agg --speed 1 --idle-time-limit 5: pauses longer than 5s clamp to
5s, everything else plays at the recorded rate. Each kubectl-get
block now lingers long enough to actually read.

File size 363 KB → 458 KB; under GitHub's inline-autoplay budget.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #75 deleted ui/ entirely. An earlier commit on this branch swept
ui/frontend/node_modules into the index via git add -A, so the rebase
faithfully re-added ~10k files (~2.5M lines, ~181M) on top of the
deletion. Drop them.
@dennis-upbound dennis-upbound force-pushed the dennis/modelcache-impl branch from 6eb1a25 to c61136c Compare May 20, 2026 17:40
@dennis-upbound dennis-upbound marked this pull request as draft May 20, 2026 17:40
@dennis-upbound dennis-upbound deleted the dennis/modelcache-impl branch June 19, 2026 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant