v0.1 ModelCache + multi-node LWS unblock#78
Conversation
Known issue: control-plane envoy proxy gets stuck on xds during long demosHit this twice now during cold-start demo recordings — flagging so it's not lost. Symptom: the control-plane envoy proxy pod ( Root cause (suspected): the proxy's envoy container can't reach its xds source. Container logs show repeated: The long-running gRPC config stream from the envoy proxy to the envoy-gateway control plane goes stale during multi-hour sessions, and the proxy never recovers on its own. Workaround (what unblocked both demos): The Deployment recreates the pod, the new proxy gets a fresh xds stream, the Gateway flips to Why this matters here: not something this PR introduces, but it bites the demo flow when the recording session is long enough — by the time we get to the curl step the proxy has already drifted. Worth either:
Not blocking for this PR. |
Composes a PVC + a one-shot hydration Job per matched InferenceCluster. v0.1 scope: Weights kind, PVC backend, HuggingFace + S3 sources, replication = AllMatchingClusters. ContentAddressed / Custom backends, Tokenizer / Bytes / Adapter / Engine kinds, BYO ExistingPVC, and per-cluster selector overrides are deferred. Out of scope here: ModelDeployment integration. The mount-injection that attaches a cache's PVC to a model serving pod lives in compose-model-replica and is deferred until the new ModelDeployment shape (PR #75) stabilizes. Adds: - apis/modelcaches/{definition,composition}.yaml - functions/compose-model-cache/main.py - examples/cache/model-cache-basic.yaml Design: #76.
Apply patterns from skills/crossplane-python-functions:
- Cast XRD int fields with int() — protobuf delivers Quantity sizes
as Python float (`200.0Gi` ≠ valid Kubernetes Quantity)
- Split per-source hydration into _hf_hydration / _s3_hydration module
functions so the discriminator dispatch is one line
- Separate composition from observation: compose_cluster_resources()
only emits Objects; derive_cluster_phase() reads observed state;
mark_ready_resources() flips ready flags AFTER resource.update()
- Add transition events on first compose and on first full readiness
(one-shot, not steady-state, to keep `kubectl describe` quiet)
- Extract _wrap_remote / _observed_remote_status helpers and a
HydrationSpec dataclass to replace the tuple return
New composition test: tests/test-model-cache/{main.py,xr.yaml}.
Mocks a single ready InferenceCluster via extraResources and asserts
the PVC + Job Objects compose with the expected manifests on the
workload cluster.
datamodel-codegen names inline array item types after the singular property name; the generated class is `Cluster`, not `ClustersItem`. Caught by `up test run tests/test-model-cache`.
Stops these from polluting `git status` and accidentally getting committed: - `__pycache__/` and `*.pyc` — Python bytecode caches, regenerated by every test run - `.DS_Store` — macOS Finder metadata - `.venv-test/` — local test virtualenv (mirrors existing `.venv`) - `opencode.json` — per-user opencode tool config; contains a local endpoint URL, no shared value
XRD now matches the full v0.1 design surface so we don't have to churn the API shape later: - artifact.kind: + Tokenizer + Bytes (same hydration path as Weights) - artifact.source: + http, oci, inline, configMap (in addition to huggingFace + s3) - storage.backend: + ExistingPVC (customer-managed PVC, no Job) - status: + resolvedDigest, + lastHydratedAt, + bytesStaged, + references Implementations: - Tokenizer / Bytes: route through the same builder as Weights - ExistingPVC: compose no Objects; report Ready immediately per matched cluster; "Adopted" event on first match - http: curl-based fetch into the PVC; optional Authorization header from a Secret - inline: write content to a file inside the PVC via env-passed value - lastHydratedAt: captured from the remote Job's completionTime - oci, configMap: discriminator locked, surface ImplementationPending condition + warning until wired Two new tests cover the new paths: - test-model-cache-existing-pvc: no Objects composed, 1/1 ready - test-model-cache-pending-source: oci source surfaces empty summary rather than crashing
Mirror the doc trim on PR #76: the kind / source descriptions just name the partition axis ("fetch protocol", "wiring discriminator not content partition") rather than declaring the field MECE. Substance is the same; phrasing matches the design doc.
Mirrors the design-doc edit that replaces the flat RWX-CSI list with the four-category framing (NFS / parallel FS / object-backed FUSE / replicated block). Description-only — surfaces a choice customers should actually be making rather than implying all CSIs are equivalent.
Four curated examples covering the impl's working v0.1 paths: - model-cache-basic.yaml — HuggingFace Weights, basic case. Tidied the header comment from the original scaffold. - model-cache-nim-mode-2b.yaml — pre-seed the NIM profile cache dir via http source. The demoable NIM Mode 2b case once the cluster has NGC creds. Notes that ORAS / oci source is locked in the XRD but impl-pending; that follow-up swaps http for oci against nvcr.io/nim/... directly. - model-cache-existing-pvc.yaml — ExistingPVC backend; customer- managed PVC adoption with no Modelplane-composed PVC or Job. - model-cache-private-s3.yaml — private S3 with access-key Secret; compliance / GDPR scenario. Each example has a tight header explaining the use case, expected speedup, and what cluster/Secret prerequisites need to be in place.
Two examples were technically schema-correct but assumed the user already knew the implicit Secret-key contract: - model-cache-basic.yaml: tokenSecretRef.name: hf-token expects a Secret with key HF_TOKEN. Added a one-line kubectl example so the user can wire it up before applying. - model-cache-private-s3.yaml: the s3 hydration Job reads fixed keys access_key / secret_key from the referenced Secret. Made that explicit so the user doesn't accidentally use AWS_ACCESS_KEY_ID etc. Validated all four examples against the generated Pydantic XRD model (.up/python/models/ai/modelplane/modelcache/v1alpha1.py) — every required field present, every enum and pattern matches.
End-to-end ModelCache integration so a ModelDeployment that
references a cache mounts the pre-staged PVC at engine boot instead
of fetching weights from the source.
API:
- ModelDeployment.spec.caches: [{ name }] — single-item list in v0.1,
references a ModelCache in the same namespace
- ModelReplica.spec.caches mirrors and inherits verbatim
Composition:
- compose-model-deployment passes caches through to each ModelReplica
- compose-model-replica sets model.uri = pvc://<cache-pvc-name> when
a cache is referenced; otherwise falls back to hf://<repo> from
--model= as before
- lib/naming.modelcache_pvc_name() centralizes the PVC naming so
compose-model-cache (creator) and compose-model-replica (consumer)
agree on the convention
Test: tests/test-model-replica-with-cache/ verifies the dispatch.
Caught one bug along the way (llmis name derivation when the deploy
prefix shares characters with the cluster name) — fixed in the test
fixture.
Demo: examples/qwen-cached-demo/ — three yamls + README showing the
cold-start delta on Qwen 2.5 0.5B over the existing qwen-demo flow.
Deploys onto the same GKE InferenceClusters from ../qwen-demo/ with
spec.caches: [{ name: qwen-2-5-0-5b }] pointing at a pre-staged cache.
Four sequenced scripts so the demo is one-command runnable: - setup.sh: applies shared prereqs from ../qwen-demo (00-prereqs, 01-gateway, 02-class), provisions a GKE InferenceCluster via envsubst-templated infra/cluster.yaml ($GCP_PROJECT required), waits for Ready (~5-10 min). - demo.sh: applies cache → waits for ArtifactReady → applies deployment → waits for ReplicasReady → applies service → fetches the gateway address → curls a chat-completions request. Times each phase so the cold-start delta is visible. - cleanup-demo.sh: deletes service / deployment / cache only. Cluster + shared infra stay so demo.sh can re-run immediately. - cleanup.sh: calls cleanup-demo.sh then deletes the InferenceCluster (deprovisions GKE). Shared infra kept (might be reused by other demos in this repo). Every script uses kubectl apply / delete --ignore-not-found so re-running mid-state is safe. README rewritten to lead with the script-driven flow and a phase / script / what-it-does table. infra/cluster.yaml uses source: GKE so Modelplane provisions the GKE cluster directly — no external Secret / kubeconfig wiring required, the user just provides GCP_PROJECT and Crossplane GCP provider credentials. shfmt-clean, schema-validated against generated Pydantic models for ModelCache / ModelDeployment / ModelService / InferenceCluster.
Two cleanups: - Drop the made-up cache/replica timing numbers (18s/24s) — they're ballpark guesses and shouldn't read as commitments. Output snippet now shows the script's shape with placeholders. - Comparison section was instructing the reader to apply ../qwen-demo/04-deployment.yaml; that has replicas: 2 and selects by the broad cluster label, so on a single-cluster demo setup the scheduler would surface InsufficientCapacity. Simpler advice: copy 02-deployment.yaml, strip the caches block, change the name, and apply.
The original demo showed only the cached path — readers had to imagine the uncached cold-start cost. Now the demo applies both deployments side-by-side on the same cluster, polls each for readiness, and prints both timings so the delta is obvious. Changes: - infra/cluster.yaml: nodeCount 1 → 2 (one GPU per parallel deployment) - 02-deployment.yaml: replicas 2 → 1 (now paired with 02b) - 02b-deployment-uncached.yaml: new — same Qwen model, no spec.caches, --model=Qwen/... in engine args so the engine pulls weights from HuggingFace at boot - 03b-service-uncached.yaml: new — exposes the uncached deployment - demo.sh: applies both deployments + services, polls each via jsonpath against status.conditions[type=ReplicasReady], emits per-deployment ready times as they tick, prints a Cached / Uncached comparison summary, runs the curl sanity test against the cached endpoint - cleanup-demo.sh: deletes both deployments + services + the cache - README: rewritten to lead with the side-by-side framing
The InferenceClass + GKE source path requires zones (validated by compose-inference-cluster). Without it the function pipeline returns a pydantic validation error and the cluster never composes. Caught while bringing the demo up live against crossplane-playground.
Two storage-related changes the demo flushed out:
(1) Cloud-agnostic CSI capability declaration.
User-facing InferenceCluster gets a new spec.storage block with a
storageClassName default and a csiDrivers list of semantic capability
flags — SharedFilesystem / ObjectStorageMount / BlockDevice. The
InferenceCluster composition maps these to cloud-specific CSI addons
per source (GKE Filestore CSI for SharedFilesystem, GCS-FUSE for
ObjectStorageMount, PD-CSI default for BlockDevice). Future EKS / AKS
branches do the equivalent mapping with their own addon names.
The internal GKECluster XR carries a GCP-specific spec.addons block
(gcpFilestoreCsiDriver / gcsFuseCsiDriver / gcePersistentDiskCsiDriver)
that compose-gke-cluster threads into the underlying container.Cluster
resource's addonsConfig. The user-facing surface stays cloud-agnostic;
the internal infrastructure XR keeps the GCP-specific knobs as an
escape hatch for advanced cases.
For BYO clusters (source: Existing) the csiDrivers field is
descriptive only — Modelplane never installs drivers on
customer-managed clusters.
The qwen-cached-demo cluster.yaml uses the new shape:
spec.storage.csiDrivers: [SharedFilesystem]
spec.storage.storageClassName: standard-rwx
(2) GCE stockout detection.
compose-gke-cluster.detect_capacity_failures() watches nodepool
managed-resource conditions for GCE_STOCKOUT / "does not have enough
resources" and sets an InsufficientGPUCapacity condition naming the
zone and accelerator type. compose-inference-cluster propagates the
condition up to the user-facing InferenceCluster so the failure is
visible in `kubectl describe inferencecluster` without needing to
peel through to the internal GKECluster + nodepool MRs.
Caught a real one while bringing up the demo against
crossplane-playground: the InferenceCluster sat in Creating for 35
min before we found the stockout in `gcloud node-pools describe`.
With this change a user sees the InsufficientGPUCapacity condition
and the offending zone immediately.
(3) Defensive HACKs in setup.sh / cleanup.sh.
setup.sh detects zombie Network MRs (Crossplane reports Ready=True
but gcloud returns 404 because the create-succeeded annotation
sticks even after the underlying resource is gone) and clears the
crossplane.io/external-create-{succeeded,pending} annotations so the
provider retries the create. Also catches Synced=False MRs stuck on
cached provider errors.
cleanup.sh force-finalizes stuck workload-cluster Helm releases and
provider-kubernetes Objects after the InferenceCluster delete is
issued. provider-helm / provider-kubernetes hang on these because
the target cluster is being torn down and the delete API calls fail.
Both HACK blocks carry TODO comments pointing at the upstream
crossplane provider issues — the right fixes belong in provider-gcp
(don't trust the create-succeeded annotation forever; reconcile with
GCP truth on observe) and provider-helm / provider-kubernetes
(treat "target cluster gone" as delete success).
Mermaid flowchart of what the demo composes, from user-facing XRs down to live workload pods. ModelCache primitives (XR, PVC MR, Job MR, mounted PVC) highlighted in yellow to make the new piece visually obvious vs the shared cluster/KServe infrastructure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add ModelService, ModelEndpoint, ModelReplica, Backend, HTTPRoute and the per-replica composition path. Introduce a second highlight color (orange) for the external substrate we may later replace with Modelplane-internal primitives — KServe Release, LWS Release, LLMInferenceService MR, and the LWS gang itself. The yellow ModelCache path and the orange substrate path are now visually distinct, making clear that the cache work is independent of any future engine-substrate swap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GCP Filestore Basic HDD (backs the standard-rwx storage class on GKE) takes 8–15 min to provision the first instance. The 10m wait was tight enough to time out on a cold cluster even when the hydration itself worked. 20m gives Filestore comfortable room without burying real failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Filestore CSI driver provisions Filestore Instances via the file.googleapis.com API. If the API is not enabled on the project, PVCs sit Pending indefinitely with SERVICE_DISABLED — the GKE cluster comes up fine, the CSI driver installs fine, the PVC is accepted, but every provisioning attempt fails 403. We hit this on crossplane-playground today and the symptom (PVC Pending forever) is annoying to diagnose without checking workload-cluster PVC events directly. Enabling the API is idempotent and cheap; do it as part of setup so the first PVC the demo creates can actually bind. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Filestore CSI addon on GKE installs the in-cluster driver fine on its own, but provisioning calls hit file.googleapis.com. If that API isn't enabled on the project, the symptom is silent: PVCs sit Pending forever with SERVICE_DISABLED in their workload-cluster events while the cluster itself reports healthy. Hit this on crossplane-playground today; took a workload-cluster describe to diagnose. Emit a ProjectService MR for file.googleapis.com when the user opts into Filestore via storage.csiDrivers: [SharedFilesystem]. ProjectService enable takes seconds and runs in parallel with the multi-minute cluster create, so it doesn't extend critical path. Deliberately not enabling container/compute APIs — if those aren't on, the user's GCP project setup is incomplete and they should fix it explicitly. The addon case is different because the user explicitly opted into Filestore via the user-facing API. disableOnDestroy=False so tearing down one InferenceCluster doesn't yank an API the rest of the project relies on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two-part change to fix Filestore PVCs landing on the wrong VPC: 1) compose-gke-cluster surfaces the observed Network MR's name in GKECluster.status.network.name (new field; XRD updated). 2) compose-inference-cluster, in the GKE branch, reads that status field and (when storage.csiDrivers contains SharedFilesystem) composes an Object → StorageClass on the workload cluster with provisioner=filestore.csi.storage.gke.io and parameters.network= <our VPC>. Default SC name is `modelplane-rwx`; user can override via spec.storage.storageClassName. The user-facing API stays cloud-agnostic — the cloud-specific knob (`network` for Filestore) is wired by the GKE-specific composition branch from the GKE-specific infra XR's status. Same pattern will apply when EKS / AKS branches need to wire fileSystemId / shareName. Demo's 01-cache.yaml switched to mp-filestore-rwx so today's running cluster (which got the GKE built-in standard-rwx by accident) can still consume the manually-created custom SC. New clusters will get the auto-composed `modelplane-rwx` from the InferenceCluster XR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closing #76 (the speculative v0.1/v0.2/v0.3 design doc) and replacing with a focused page that documents what shipped in this PR: shape, what gets composed, multi-node Ray bootstrap, scope boundaries. Demo proof at examples/qwen-cached-demo/. v0.2+ ideas (content-addressed substrate, lazy load, cross-cluster dedup) are explicitly out of scope here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update the diagram and prose to match what actually runs: - LWS gang shown as distinct leader + worker pods (not one combined node) with the Ray cluster edge between them - Both pods mounting the cached PVC explicit - StorageClass MR carries the parameters.network detail - LIS MR notes the flat PodSpec + Ray-bootstrap command - Hydration Job notes hf download (not the deprecated CLI) - GKECluster.status.network.name → StorageClass edge shown Adds a one-paragraph "what this proves" prose section so a reviewer can read the diagram + caption and understand the v0.1 path end-to-end without spelunking through commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OSS Modelplane doesn't name the future storage substrates beyond "additional backends may land in later versions" — concrete names (and any associated commercial framing) belong outside this repo. Updates: - apis/modelcaches/definition.yaml: drop the explicit "ContentAddressed and Custom backends are deferred to v0.2" sentence from the backend description; keep the extension point. - functions/compose-model-cache/main.py: same swap in the module docstring. - design/modelcache/README.md: rewrite the out-of-scope table and the forward-compatibility section to talk about generic extension points rather than naming v0.2 backends. Drop the content-addressed / cross-cluster-dedup framing entirely. Also fix a Mermaid lexer error in TOPOLOGY.md: the `status.network.name` edge label contains dots that the unquoted form parses as identifier separators; quote it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Keep the v0.1 1-pager focused on what shipped; don't enumerate future features by name in the out-of-scope table. The XRD's extension points (backend enum, replication enum, source union) speak for themselves. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
LWS reports the gang Ready as soon as both pods reach 1/1, but vLLM inside the leader takes another ~60-120s to load the model from the cached PVC and finish CUDA graph capture before Uvicorn opens. The previous curl ran the moment LWS gang reported Ready and got back `upstream connect error … Connection refused` from the gateway, even though everything was wired correctly. Move the readiness wait into the curl-test pod itself: it loops `curl -f /v1/models` until 200, then sends the chat completion. Avoids the kubectl-run-rm stdout-capture pitfall and keeps the single-pod ergonomic of the original demo.sh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ModelService.status.address is already a full URL with the path prefix (e.g. http://172.18.255.200/ml-team/qwen-cached-demo); the original curl line in demo.sh prepended "http://" and re-appended "/${NS}/qwen-cached-demo" — yielding http://http://172.18.255.200/ml-team/qwen-cached-demo/ml-team/qwen-cached-demo/v1/... DNS then tried to resolve hostname "http", curl failed silently (stderr suppressed for the readiness probe), and the retry loop spun forever. Use the address as-is and append the OpenAI path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The README still described the original side-by-side cached vs uncached demo; the current demo.sh runs a single multi-node TensorPipeline 1×2 LWS gang against the cached PVC. Rewrite to match: T4 quota, the actual phase output, the pod-name pattern, pointer to TOPOLOGY.md. demo.gif is a 4×-speed recording of a warm-cluster run captured via `asciinema rec --command "bash demo.sh"` then `agg --speed 4`. Cache pre-hydrated, LWS gang Ready in 65s, engine serving 86s later, real Qwen chat completion JSON in the final frame. GitHub autoplays the gif inline in the rendered README. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add three kubectl-get checkpoints to demo.sh so the recording shows what's actually happening on the cluster rather than just script milestones: - After cache hydration: print the ModelCache row + the Object MRs composing the PVC + Job on every matched cluster (with Synced / Ready columns). - After deployment apply: print the user-facing workload tree (ModelDeployment, ModelService, ModelReplica, ModelEndpoint) so a viewer sees how one ModelDeployment fans out to one ModelReplica per cluster plus a ModelEndpoint per replica. - After LWS gang Ready: print the Object MRs again — the LLMInferenceService MR (the orange-band substrate we may swap later) is now visible alongside the cache MRs. Same scripted flow, more visible payload for the recording. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The custom-columns table tells you the composed MRs exist; the
JSON manifest dump shows exactly what got applied to the workload
cluster — the artifact a viewer of the gif/cast wants to see.
- After cache hydration: pretty-print the PVC manifest. Shows
accessModes: [ReadWriteMany], the modelplane-rwx storage class
reference, the size.
- After LWS gang Ready: pretty-print the LLMInferenceService
manifest. Shows model.uri=pvc://modelcache-..., the worker
PodSpec with the Ray-bootstrap shell as container.command,
parallelism.{tensor=1, pipeline=2}, and the engine args list.
Uses kubectl's jsonpath to extract spec.forProvider.manifest
then python3 -m json.tool for formatting — no extra deps.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
compose_project_services composes the ProjectService MR for file.googleapis.com when the user opts into the Filestore CSI addon, but mark_readiness's tracked list didn't include it. Result: even when the MR observes Ready, the function never sets `rsp.desired.resources["projectservice-filestore"].ready`, so Crossplane's XR readiness aggregator keeps the GKECluster XR at Ready=False with the message "Unready resources: projectservice-filestore" forever. Caught on the cold-start full demo recording — setup.sh hit its 20m kubectl-wait timeout sitting on this single missing mark even though every underlying GCP resource was Ready. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
provider-kubernetes's Object is a namespaced resource. The InferenceCluster XR is cluster-scoped, so any composed namespaced resource needs metadata.namespace set explicitly — otherwise Crossplane's reconcile errors with cannot get composed resource: an empty namespace may not be set when a resource name is provided and the XR sits Synced=False. Following the pattern compose_kserve_backend / compose_cluster_ provider_config already use (modelplane-system namespace). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ruff reformat on compose-inference-cluster/main.py (CI lint was
failing on collapse of the observed_gke_network_name dict-walk).
Existing test golden updates:
- test-model-cache: hydration command now filters lost+found from
the emptiness check and uses `hf download` (huggingface-cli was
deprecated in huggingface-hub 1.x).
- test-model-replica-with-cache: container args carry
--model=/mnt/models when spec.caches is set, so vLLM loads the
cached weights instead of falling back to its hardcoded default.
- test-model-replica-multinode: KServe v0.17+ flat-worker shape
(worker is a PodSpec, not {size, template}); parallelism.tensor
is per-pod count + parallelism.pipeline carries the pod count;
container command carries the Ray-bootstrap shell (leader runs
`ray start --head` + execs vLLM, worker runs `ray start
--address=$LWS_LEADER_ADDRESS:6379 --block`).
Two new tests for code paths introduced in this PR:
- test-gkecluster-filestore-addon: asserts compose_project_services
emits a ProjectService MR for file.googleapis.com when the
user opts into the Filestore CSI addon. Catches regressions to
the API-auto-enable path.
- test-inference-cluster-csi-shared-filesystem: asserts the
compose_gke_storage_classes path emits an Object MR wrapping a
workload-cluster StorageClass with metadata.namespace set
(regression we hit live: without namespace, Crossplane errors
with "an empty namespace may not be set when a resource name is
provided"). Also asserts parameters.network is pinned to the
observed VPC from GKECluster.status.network.name.
All 20 composition tests pass. ./nix.sh flake check also passes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Earlier commit switched 01-cache.yaml from `standard-rwx` (GKE's built-in Filestore SC) to `modelplane-rwx` (the SC the InferenceCluster composition auto-creates with parameters.network pinned to our VPC), but cluster.yaml's spec.storage.storageClassName was missed and still said `standard-rwx`. That caused the composed Object MR to try to create a StorageClass named `standard-rwx`, which conflicts with the GKE-managed built-in (`parameters` is immutable on existing SCs), so the Object MR sat Synced=False forever and the InferenceCluster couldn't reach Ready. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
StorageClass doesn't carry a Ready condition of its own, so the DeriveFromObject readiness policy provider-kubernetes uses keeps the wrapping Object MR at Ready=False forever — blocking the InferenceCluster XR from going Ready even though the SC was applied successfully on the workload cluster. Switch the policy to SuccessfulCreate, which marks the MR Ready as soon as the SC apply succeeds. For a config-only resource that's the actual readiness signal we care about. Test golden updated to match. Also drop the unused gkecluster import in the new test-gkecluster-filestore-addon test that ruff flagged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
compose_gke_storage_classes composed the Object MR for the workload- cluster StorageClass with the correct manifest, but never set rsp.desired.resources["storage-class-rwx"].ready when the observed MR was Ready. Crossplane's auto-readiness aggregator left it at READY_UNSPECIFIED and the InferenceCluster XR sat Ready=False with "Unready resources: storage-class-rwx" even when the SC was applied successfully on the workload cluster. Read observed condition and mirror into the response — same pattern the function already uses for `gke-cluster`. Test backfill: add an observed storage-class-rwx Object with Ready=True to the inference-cluster-csi-shared-filesystem test so the propagation code path is at least exercised. The compositiontest framework asserts manifest shape but doesn't directly assert on the response-side ready flag, so this is regression-friendly more than regression-proof. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If any other shell mutates ~/.kube/config (e.g. a `gcloud container clusters get-credentials` to peek at the workload cluster), every subsequent kubectl call in a running demo.sh / setup.sh / cleanup.sh silently retargets to whatever the new current-context is. We hit this live during a recording: the workload-cluster context took over mid-flight and `kubectl get modelcache` blew up because the workload cluster doesn't have the ModelCache CRD. Capture `current-context` at script start (or honor an explicit MODELPLANE_CONTEXT env var) and define a `kubectl` shell function that passes `--context=$KCTX` through to every call. Subsequent context flips elsewhere on the box don't affect the running script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The kserve-llmisvc-resources Helm chart ships the
LLMInferenceServiceConfig CRD but does NOT create the six default
instances the controller looks up via spec.baseRefs on every
LLMInferenceService admission:
kserve-config-llm-default
kserve-config-llm-router-route
kserve-config-llm-worker-tensor-parallel
kserve-config-llm-worker-pipeline-parallel
kserve-config-llm-decode
kserve-config-llm-prefill
Without them, every LIS admission fails with:
PresetsCombined False CombineBaseError
failed to get LLMInferenceServiceConfig
"kserve-config-llm-worker-pipeline-parallel" …
not found
and the controller silently retries forever. We hit this twice
during demo recordings; memory note from a prior debugging session
called it out as something compose-kserve-backend should automate.
Doing that now.
Compose six Object MRs (one per preset) targeting the workload
cluster with spec: {} — empty is enough to satisfy the lookup.
Gated on the kserve-controller resource being observed so we
don't race the CRD install. SuccessfulCreate readiness because
LLMInferenceServiceConfig has no Ready condition of its own.
Also: shellcheck fix on demo / setup / cleanup scripts — the
kubectl shell function was being called before its definition.
Use `command kubectl` for the bootstrap kubectl-config-read so
shellcheck stops flagging SC2218.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every ==> step in setup.sh and demo.sh now has: - A one-line explanation of what's happening behind the script in Modelplane terms (what's being composed, what the resource shape means architecturally) so a viewer of the gif understands the story without reading the demo source. - An elapsed timer that prints "✓ <phase> in Ns" when each major milestone completes (cache hydration, gang Ready, service address, engine serving) and a total time at the end. setup.sh tracks the InferenceCluster Ready transition time + total setup time. demo.sh already tracked per-phase times; this commit adds the prose context and a total at the end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a section to demo.sh that fetches workload-cluster credentials via gcloud, lists the gang pods with their node placement, then execs into each pod to show: - the /mnt/models mount line (same NFS endpoint on both pods) - stat on model.safetensors (same inode + size on both pods) Together that proves the LWS leader and worker are reading from the exact same shared PVC backed by the same Filestore instance, on different nodes. Skips gracefully if gcloud / gke-gcloud-auth-plugin isn't available. Uses bracket jsonpath for the dot-containing annotation key (`crossplane\.io/external-name`) and a single-quoted sh -c body for the in-pod commands to avoid nested-quoting hell. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug: the preset Object MRs were composed with providerConfigRef.name pointed at the helm/k8s ProviderConfig (`<xr-name>-cluster`, `qwen-cached-demo-kserve-cluster`), but Object MRs need a ClusterProviderConfig — and the one configured for the workload cluster lives at `<inferencecluster-name>-cluster-kubeconfig` (`qwen-cached-demo-cluster-kubeconfig`), composed by compose-inference-cluster. Result: every preset Object MR sat Synced=False with CannotConnectToProvider: cannot get provider config: ClusterProviderConfig … "qwen-cached-demo-kserve-cluster" not found which propagated up to InferenceCluster sitting Ready=False, which blew through setup.sh's kubectl wait timeout on the cold-start recording. The KServeBackend XR is named `<inferencecluster>-kserve` so we strip that suffix to derive the parent name + the correct CPC name. Test: new test-backend-presets case observes the kserve-controller Helm release (the gate condition) and asserts on each of the six composed preset Object MRs — manifest shape AND providerConfigRef target. Catches the wrong-PC-name copy-paste class of bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The kserve-llmisvc-resources Helm chart v0.16.0 already creates this particular preset. If our function composes it first via Object MR (no Helm ownership labels), the Helm install errors with: LLMInferenceServiceConfig "kserve-config-llm-router-route" in namespace "kserve" exists and cannot be imported into the current release: invalid ownership metadata …and the whole kserve-controller release sits Synced=False forever. Cut router-route from _LLMISVC_PRESET_NAMES — only compose the five the chart actually leaves blank. Test golden updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second recording captures the demo.sh path where the InferenceCluster is already Ready but the cache hasn't been hydrated yet. End-to-end: ModelCache apply + hydration → ModelDeployment + ModelService apply + workload-tree dump → LWS gang spin-up → composed-MR + LIS-manifest inspection → the new "prove both gang pods do IO from the same PVC" block (pod placement on two nodes + matching NFS endpoint + matching safetensors inode) → real Qwen chat-completion response. Existing demo.gif (warm cluster, cache pre-hydrated, gang Ready in 65s) stays — they show different parts of the lifecycle. README now embeds both with captions noting what's pre-hydrated in each. GIF rendered with `agg --speed 4 --idle-time-limit 1` so dead time during long phases (Filestore provision wait, LWS gang image pull, vLLM CUDA-graph capture) collapses to 1s for readability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
agg --speed 2 --idle-time-limit 1 — keeps the dead-time collapse but lets a viewer actually follow the per-step output. File size goes 363 KB → 406 KB, well under GitHub's inline-autoplay budget. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous renders used --speed 2 / --speed 4, which sped up the output stream globally — meant phase headers and kubectl-get blocks flashed by faster than a viewer could read. The real fix is to keep playback at real-time and only collapse the *between-phase* idle gaps (which are dominated by Filestore provisioning + LWS image pull + CUDA-graph capture — minutes of dead time that don't add value to the gif). agg --speed 1 --idle-time-limit 5: pauses longer than 5s clamp to 5s, everything else plays at the recorded rate. Each kubectl-get block now lingers long enough to actually read. File size 363 KB → 458 KB; under GitHub's inline-autoplay budget. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #75 deleted ui/ entirely. An earlier commit on this branch swept ui/frontend/node_modules into the index via git add -A, so the rebase faithfully re-added ~10k files (~2.5M lines, ~181M) on top of the deletion. Drop them.
6eb1a25 to
c61136c
Compare
Warm-cluster run: cache pre-hydrated, LWS gang Ready in 65s, engine answers 86s after gang Ready. 4× speed.
Adds the v0.1
ModelCacheprimitive (XRD + composition function +ModelDeployment.spec.caches[]wiring) and turns multi-nodeTensorPipelineserving from broken-by-default into a working path. See Multi-node LWS — what was broken and what we did below for the specific issues we untangled.Base:
demonstration(#75) so the diff stays scoped. GitHub will rebase tomainautomatically when #75 merges.What ships
ModelCache— composes a per-cluster RWX PVC + a one-shot hydration Job on every matchedInferenceCluster. Pods that reference the cache viaModelDeployment.spec.caches[]get the PVC mounted at the engine's expected path; no per-replica HuggingFace pull at boot.Weights,Tokenizer,ByteshuggingFace,s3,http,inlineimplemented;oci,configMapdiscriminators reservedPVC(Modelplane-managed),ExistingPVC(customer-managed)AllMatchingClustersModelDeployment.spec.caches: [{ name }]→LLMInferenceService.spec.model.uri = pvc://...Cloud-agnostic CSI capability plumbing.
InferenceCluster.spec.storage.csiDrivers: [SharedFilesystem | ObjectStorageMount | BlockDevice]declares the cluster's semantic capability. The GKE composition branch reads the underlying VPC name fromGKECluster.status.network.nameand auto-composes amodelplane-rwxStorageClass on the workload cluster withparameters.networkpinned. Same pattern will hook into EKS / AKS with their respective knobs.GCP API auto-enable.
compose-gke-clusteremits aProjectServiceMR forfile.googleapis.comwhenever the user opts intoSharedFilesystem.1-pager design:
design/modelcache/README.md. Replaces the earlier verbose design doc (PR #76, closed in favor of this focused page).End-to-end demo:
examples/qwen-cached-demo/deploys Qwen 2.5 0.5B on a 2-pod LWS gang with a real chat-completion response. SeeTOPOLOGY.mdfor the composed XR / MR layout.Tests: model-cache × 3 + model-replica-with-cache + the full existing suite — 20/20 pass.
Multi-node LWS — what was broken and what we did
Bringing the demo up needed fixes layered across the function, the LIS shape, the cache hydration command, the engine wrapper, and the cluster's StorageClass. Each row below is something an OSS user would hit on a fresh cluster; calling them out so reviewers know what each fix buys us.
--pipeline-parallel-size > 1needs a Ray cluster spanning the gang's pods. KServe'sServingRuntimeinjects this for you; the custom-template path we use doesn't. Worker pod never joined the leader's Ray cluster, vLLM's placement group hung forever.compose-model-replicainjects a 5-line shell wrapper as the containercommandwhentopology.strategy == TensorPipeline. Leader runsray start --headthen execs the engine; worker runsray start --address=$LWS_LEADER_ADDRESS:6379 --block. Bootstrap lives as a Python constant — when we swap KServe for our ownLeaderWorkerSetcomposition later, the same constant moves verbatim.worker: {size, template}(v0.16 shape); v0.17+ wantsworkeras a flatPodSpecwithcontainers[].parallelism.tensoris now per-pod GPU count andparallelism.pipelineis required whenworkeris present.worker:+ setparallelism.{tensor,pipeline}from the topology.--modelinto engine argv.model.uri = pvc://...mounts the PVC at/mnt/modelsbut the engine container's args don't get--model=<path>anymore. vLLM defaulted tofacebook/opt-125mand silently downloaded it from HF instead of using the mounted cache.--model=/mnt/modelsto the engine args whenspec.caches[]is set.vllm servevspython -m vllm.entrypoints.openai.api_server.vllm servetakes the model as a positional argument; the api_server entrypoint (the image's defaultENTRYPOINT) takes--model=. The first wrapper invokedvllm serve "$@", which silently exited on argparse mismatch.python -m vllm.entrypoints.openai.api_serverthe defaultENTRYPOINTuses. Same arg shape for single-node and multi-node.defaultVPC. GKE's chart-shippedstandard-rwxStorageClass has nonetworkparameter, so Filestore instances landed ondefaultand were unreachable from cluster nodes — PVCs sat Pending forever with NFS mount timeouts.compose-inference-cluster(GKE branch) auto-composes a workload-cluster StorageClass (modelplane-rwx) withparameters.network = <our VPC>. Network surfaced viaGKECluster.status.network.name.lost+foundtriggered a false-positive skip. The hydrate Job's "already populated?" check didls -A /mnt/model. Filestore's ext4-style filesystem putslost+foundat mount root, so a fresh PVC always looked non-empty; the Job exited withartifact already hydrated, skippingand the cache stayed Ready with no weights on disk.lost+foundout of the emptiness check.huggingface-cliwas deprecated inhuggingface-hub 1.xand the[cli]extra removed. The hydrate Job'spip install huggingface_hub[cli]printed a warning andhuggingface-cli downloadexited with a deprecation notice, no bytes written.hf download(new CLI, same flags).Pendingindefinitely withSERVICE_DISABLEDin their workload-cluster events; the GKE cluster itself looks healthy.compose-gke-clusteremits aProjectServiceMR forfile.googleapis.comwhenever the Filestore CSI addon is on.ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Not a Modelplane-layer issue; documented as an example concern.--dtype=half.Net result: a fresh
kubectl apply -f examples/qwen-cached-demo/provisions a GKE cluster with Filestore CSI enabled + a VPC-pinned StorageClass, hydrates Qwen 2.5 0.5B onto a 100 GiB RWX PVC, boots a 2-pod LWS gang on two T4 nodes, both pods mount the same PVC, the leader's Ray head and the worker's Ray client form one cluster, vLLM's pipeline-parallel placement group satisfies, andcurl /v1/chat/completionsreturns a real Qwen response over HTTP 200.Out of scope for v0.1
Adapter/Engineartifact kinds — composition shape needs more design before lockingPVC/ExistingPVC— thebackendenum is the extension pointAllMatchingNodesreplication granularity — per-node staging is a separate problemWorkarounds / hacks (cleanup before / after merge)
Real composition-code workarounds shipping in this PR. The demo scripts under
examples/qwen-cached-demo/have their own pile of provider-gcp / provider-helm / provider-kubernetes workarounds — those will get gitignored and graduate into a smoke-test fixture, so they're not called out here.Composition-code hacks
functions/compose-gke-cluster/main.py(TODO L129)ZONE_RESOURCE_POOL_EXHAUSTEDisn't surfaced through the nodepool MR'sconditions; GCE keeps the pool inPROVISIONINGforever.functions/compose-kserve-backend/main.py(LLMInferenceServiceConfig presets)spec: {}LLMInferenceServiceConfiginstances on the workload cluster because thekserve-llmisvc-resourceschart defines the CRD but doesn't ship the controller-required defaults.functions/compose-kserve-backend/main.py(router-route exclusion)kserve-config-llm-router-routebecause the chart DOES include it, and naive composition leads to Helm "exists and cannot be imported" failures. The list-overlap is fragile across KServe versions.Mid-session debugging interventions captured so they don't repeat
ObjectMR forstorage-class-rwxcomposed withoutmetadata.namespaceset; Crossplane reconcile errored with "an empty namespace may not be set when a resource name is provided".metadata.namespace=modelplane-systemto the composed Object.test-inference-cluster-csi-shared-filesystemasserts on the namespace.storage-class-rwxObject usedpolicy: DeriveFromObjectreadiness, but StorageClass has no Ready condition — Object MR stayedReady=Falseforever.policy: SuccessfulCreate.rsp.desired.resources["storage-class-rwx"].ready— XR aggregator left the InferenceClusterReady=False.READY_TRUEon the response..readyfor direct assertion).compose-gke-clusterdidn't trackprojectservice-filestoreinmark_readiness; XR stuck on "Unready resources: projectservice-filestore" even after the API was enabled.test-gkecluster-filestore-addon.<xr-name>-cluster(the Helm ProviderConfig) instead of<inferencecluster>-cluster-kubeconfig(the ClusterProviderConfig).-kservesuffix from the KServeBackend XR name to derive the parent + the right CPC name.test-backend-presetsasserts onproviderConfigRef.name.What needs to happen to merge
Must
main. PR is currently based ondemonstration(Implement the updated API shape #75) so the diff stays scoped. GitHub will fast-forward once Implement the updated API shape #75 lands.spec.engine→spec.workers.templateper Nic's merged commit on PR #64 (theenginefield is now a curated subset ofPodTemplateSpec). Affectsapis/modeldeployments/definition.yaml,apis/modelreplicas/definition.yaml,functions/compose-model-replica/main.py.examples/qwen-cached-demo/demo scripts (or relocate under an ignored path). They'll graduate into a smoke-test fixture; not part of the merged surface area.Should
LLMInferenceServiceConfiginstances +kserve-llmisvc-resourcesHelm release blocked by chart-shipped vs externally-created preset overlap).Nice to have
Demo flow
Recordings of the warm-cluster demo and the full cold-start flow attached in a follow-up comment.
Related
#74 Fleet signal bus (status emission, future) · #75 API shape parent · PR #76 (closed; design folded into
design/modelcache/README.md)