Skip to content

WIP: ModelCache design doc + examples#76

Closed
dennis-upbound wants to merge 28 commits into
mainfrom
dennis/modelcache-design
Closed

WIP: ModelCache design doc + examples#76
dennis-upbound wants to merge 28 commits into
mainfrom
dennis/modelcache-design

Conversation

@dennis-upbound

@dennis-upbound dennis-upbound commented May 13, 2026

Copy link
Copy Markdown
Collaborator

Design doc + 11 example yamls for the ModelCache primitive. Pure design — no code, no CRDs. Supersedes the sketch in #66.

Why v0.1 PVC-backed (TL;DR)

Multi-node LWS serving needs the same weight bytes mounted by every pod in the gang. The minimum primitive for that is an RWX PVC hydrated once by a side Job (not an init container — KServe's storage-initializer OOMs at 4/8/16 GiB on Kimi K2 / Llama 405B) and mounted read-only by every pod that references it. v0.1 ships exactly that. Single-node scale-up benefits as a side effect (no per-replica HF pull on scale events). The harder content-addressed wins — cross-deployment dedup, lazy load, cross-cluster sharing — stay in v0.2 behind the same user-facing API.

Read

Alignment asks

  1. Timeline shift: v0.2 → v0.1. Driven by multi-node serving (Shared storage for multi-node inference #61) and DRA landing in v0.1 (Align hardware capabilities design with Kubernetes Dynamic Resource Allocation #56). Flagged at the top of the doc; rationale in Why v0.1 PVC-backed above.
  2. Commercial substrate (per Bassam): ContentAddressed = Upbound hosted weight delivery; Custom (webhook) = OSS extension point. Customers run fully OSS via Custom; commercial product is fleet-scale dedup + Modal-class cold-start hosted by Upbound. Reflected in the Storage-backends table + v0.2 section.
  3. Five v0.1 open questions in the doc's Open questions section. Each has a documented lean; please push back if any feel wrong.

Companion

PR #78 (dennis/modelcache-impl) lands the v0.1 OSS scaffolding (XRD, composition function, ModelDeployment.spec.caches wiring, four standalone examples, an end-to-end Qwen-on-GKE cold-start demo with idempotent setup / demo / cleanup scripts) against this design.

Related

#66 (implementation tracker, body refactored) · #61 (closed; mechanism absorbed) · #56 (DRA, v0.1) · #72 KVOffloadTier · #71 ModelService routing affinity · PR #64 · PR #75 · PR #78

Dennis Ramdass and others added 19 commits May 13, 2026 09:28
1-pager covering v0.1 PVC backend (multi-node RWX + Weights/Tokenizer/Bytes),
v0.2 content-addressed backend with lazy loading (adds LoraAdapter + Engine
kinds), and v0.3 substrate unification option across ModelCache,
KVOffloadTier, and HotPrefixPool. Eight example yamls cover single-cluster
basic, multi-node TensorPipeline gang, multi-cluster replication, separate
tokenizer, private S3 source, plus three v0.2 previews.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s example

- Hyperlink all #NN issue/PR references throughout the doc
- New "Design principle: pluggable backends" section explicitly framing
  the pattern shared with #72 KVOffloadTier and #73 HotPrefixPool
- Strengthen v0.3 substrate-unification tie-in with concrete artifact
  lifecycle split (immutable static / mutable runtime / immutable
  precomputed runtime)
- Add four Mermaid diagrams: v0.1 PVC flow, multi-node LWS gang shared
  PVC, v0.2 content-addressed lazy hydration, v0.3 substrate unification
- Add LoraAdapter baseRef field to the spec shape
- Add example 09-bytes-opaque.yaml covering the Bytes kind escape hatch
- Cold-start time estimates in every example header (rough order-of-
  magnitude based on ~50 MB/s HF pull, ~1 GB/s intra-region S3)
- Fix eu-west vs eu-west-1 region naming inconsistency in example 05
- Coherency: normalize replication: AllMatchingClusters across all
  examples; add caches:[{name}] field to ModelDeployment references

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
design.md rewrite folds in the scope-boundary framing from #66 (what
ModelCache covers vs what stays on the engine block — weights, engines,
tokenizers, chat templates, configs in scope; container images, env vars,
shmSize out of scope), the full source taxonomy (huggingFace / s3 / http /
inline / configMap in v0.1; gcs / azure / oci / pvc-clone in v0.2), the
KServe storage-initializer OOM motivation for Job-based prefetch, fail-fast
scheduling on missing RWX storage class, and the rationale for
AllMatchingClusters as the v0.1 replication default vs AllMatchingNodes.

Adds explicit decisions: mount path is intrinsic to the cache (no
per-reference override), one artifact per cache, storage class cluster
default with per-cache override.

Coherency: every ModelCache example now has explicit replication and
clusterSelector. mount path lives on the cache spec, not in the
ModelDeployment caches[] reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Dennis confirmed DRA is on the v0.1 plan (just not yet in Nic's spike).
Updates:
- Add #56 DRA alignment to Related, marked as also-v0.1
- v0.1 mechanism: note clusterSelector can accept CEL over InferenceCluster
  pool attributes once #56 lands (alongside the matchLabels baseline)
- Alternatives considered: clarify nodeSelector.cel is v0.2 not because DRA
  is deferred but because it only fits AllMatchingNodes mode (per-node SSD
  via content-addressed backend), which is itself v0.2

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
design.md:
- Cut vibe statements ("This is fleet-level territory", "ModelCache is
  the first instance; the pattern generalizes")
- Drop redundant restatements in mount-path, scope-boundary, v0.1, and
  v0.3 sections
- Replace "Why now:" subsection with a single line in v0.2
- Cut "Better for positioning if Modelplane is 'the AI content CAS
  company'" marketing line from ContentCache alternative
- Tighten "Same pattern as #72 and #73 — the family's unifying
  architectural principle" to just "Same pattern as #72 and #73"
- Drop "Also doesn't generalize to compiled engines or non-weight
  artifacts" duplicate from engine-native alternative

Diagrams:
- Drop time-estimate labels from arrows (already in prose)
- Remove off-topic "InfiniBand / NIXL KV transfer" arrow from multi-node
  diagram (serving detail, not ModelCache)
- Drop "one set of bytes per artifact" subtitle from v0.2 content store
  node and tighten arrow labels
- Drop "composes" labels on dotted ModelCache -> Job/PVC edges; arrow
  style alone conveys it

Example headers — trim redundant "without ModelCache... With ModelCache..."
restatements in speedup paragraphs across 01, 02, 03, 05, 06, 07, 08.
Same numbers, half the words.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v0.1 ModelCache section now covers:
- Invalidation: source-version-as-cache-identity (immutable cache pattern),
  manual re-fetch annotation for source-side fixes
- Status: Ready/Populated/Failed per cluster, status fields, emission
  into #74 signal bus

v0.3 substrate unification expanded with three subsections:
- How the three primitives relate as a staged cold-start pipeline
  (weights-loading, first-request prefill, runtime KV pressure)
- Unified invalidation: master key is (modelDigest, tokenizerDigest);
  per-primitive eviction policies on top
- Unified observability table mapping signal types to each primitive,
  composite "cache effectiveness" view on ModelService.status

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add Mermaid diagram for the staged cold-start pipeline (3-phase boxes:
  cluster boot / first request / runtime) — visualizes how ModelCache,
  HotPrefixPool, and KVOffloadTier compose into a coherent cold-start
  story. Previously this was prose-only.
- Drop redundant decision #8 ("Substrate unification deferred to v0.3"),
  already implicit from v0.3 section header.
- Tighten Problem section closer from two disjoint sentences to one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t resolution

Folds in the prior-art research findings on customer-provided infrastructure
and delegated GC.

Shape:
- New "Storage backends" subsection (table) listing PVC / ExistingPVC v0.1
  and ContentAddressed / Custom v0.2
- New "BYO scenarios" subsection covering five axes: BYO source, BYO storage
  class, BYO pre-populated PVC, BYO P2P fan-out (Spegel/Dragonfly), BYO
  cluster. Explicit on what Modelplane retains regardless of BYO
  (artifact identity, scheduler gating, refcounting, observability,
  invalidation policy).
- Promote `oci` source from v0.2 to v0.1 — OCI + KitOps + Harbor is the
  converging air-gap pattern
- Add shim sources (`mlflow:` / `kubeflow-modelregistry:` / `wandb:`) to v0.2

v0.1 Invalidation:
- Rename to "Invalidation and GC"
- Add tag → digest resolution at hydration time, with status.resolvedDigest
- Clarify refcount is visible but Modelplane doesn't auto-GC: operator
  retires explicitly, PVCs reclaim per K8s reclaimPolicy

v0.2 GC:
- Brief subsection: delegate to object-store lifecycle policies + touch on
  access. Explicit refcounting only if a future case forces it.
- Update market-signal closer to include Baseten BDN, Run:ai, KitOps,
  Dragonfly+OCI as converging signals.

Examples:
- Add 10-byo-existing-pvc.yaml showing the ExistingPVC backend (customer
  manages population, Modelplane mounts and orchestrates)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
design.md:
- Shape: clarify artifact.kind enum spans v0.1 vs v0.2; remove stray
  baseRef from Weights example (baseRef only applies to LoraAdapter,
  examples 07/08 show it properly)
- Sources: tighten v0.2 shim sources sentence (no fake URI-scheme colons)
- BYO: note ExistingPVC ignores replication
- v0.1 mechanism: backtick ReadWriteMany; backtick InferenceCluster
- Invalidation: fix "status annotation" → metadata annotation; sources
  identity covers HF revision / S3 version path / OCI digest
- Status: add resolvedDigest and references to listed fields; drop
  sourceETag (subsumed by resolvedDigest)
- v0.2 closer: tighten "PVC ships v0.1 fast" wording; use backticked
  backend names (PVC / ContentAddressed)
- v0.3 ModelCache eviction: align with v0.2 (TTL+touch by default, not
  refcount-by-default)
- Key decision 4: add ExistingPVC to backend list; backtick all four
- Open questions: backtick artifact-kind/source/backend names
  consistently; add `oci` to question 2
- Roadmap: use API names (LoraAdapter / Engine) in v0.2 issue title
- Examples list: backtick the cross-ref cache name

examples/10-byo-existing-pvc.yaml:
- Note that replication is ignored for ExistingPVC

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fold in Bassam's "three packaging patterns" framing from PR #64 review
comment 4414021192 (engine-fetches-weights / engine-image-bakes-weights /
runtime-and-artifacts-separate). Grounds ModelCache as the Pattern 3
primitive that also accelerates Pattern 1; clarifies why Pattern 2 (NIM)
doesn't need it.

Add Locality routing subsection in v0.3 substrate unification connecting
the three primitives to #71 ModelService routing affinity. Cold-start
pipeline covers what new replicas need; locality routing covers where
existing requests go. ModelCache feeds both — status.clusters[] is the
eligibility signal for fleet routing.

v0.1 mechanism: tighten the scheduling-gating bullet to explicitly call
out status.clusters[] as the eligibility signal the fleet matcher reads,
not just an implicit scheduler hook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Status header now explicitly notes ModelCache advances from v0.2 (per
Bassam's PR #64 review framing) to v0.1, driven by multi-node serving
requirements (#61 closure) and DRA landing in v0.1 (#56). Flagged for
team alignment since this is a deliberate timeline shift from the
earlier framing.

Roadmap #66 line: tighten the awkward kind/source bundling
("Weights/Tokenizer/Bytes/inline/configMap" mixed kinds with sources)
into separate categories with backtick-normalized backend / kind names.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Doc:
- Line 38: backtick PVC and ContentAddressed in design-principle paragraph
  (was unbacked when describing the backend, inconsistent with other
  references in the doc)
- Line 167: "content-addressed backend" → "ContentAddressed backend" with
  backticks (named backend, not concept)
- Line 366: normalize slash spacing in roadmap line (Weights / Tokenizer
  / Bytes with spaces, matching PVC / ExistingPVC style)
- Line 383: backtick ContentAddressed in examples list entry for 06

Examples:
- example 09 (bytes-opaque): second ModelCache was missing clusterSelector
  and replication. Added both for consistency with the first cache and
  with every other example.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…es, example 11

Three packaging patterns: split Pattern 2 (NIM) into three sub-modes —
2a (weights baked in image, no ModelCache needed), 2b (NIM image fetches
into /opt/nim/.cache, ModelCache pre-seeds), 2c (air-gap via
ExistingPVC). Previous text glossed over 2b and 2c as if they didn't
exist.

Sources table: clarify http source covers NIM/NGC URLs (pre-seeding
/opt/nim/.cache), not just generic HTTP.

v0.2 Engine kind: extend to NIM profiles. engine.runtime: NIM +
profileId makes the (GPU SM, count, precision, TP, PP, target) tuple
explicit so Modelplane validates against cluster hardware before staging
— avoids the silent wrong-profile failure (e.g. H100 profile on B200).

Scope boundary: clarify that "container images out of scope" applies to
Mode 2a baked-weight images; Modes 2b/2c stage the NIM cache dir via
ModelCache.

New example 11-nim-cache.yaml: complete Mode 2b reference. NIM image,
NGC creds Secret reused as both imagePullSecret and NGC_API_KEY env,
Bytes kind staging /opt/nim/.cache from NGC, Hopper-only clusterSelector
matching the profile.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…log source

Three additions:

1. After Mode 2a/b/c list: note that NVIDIA's NIM Operator (NIMService /
   NIMCache CRDs) composes with ModelCache rather than competing —
   ModelCache stages at the K8s storage layer; NIM Operator consumes
   the mounted path. Customers already on NIM Operator can keep using it.

2. v0.2 LoraAdapter kind: baseRef can point at either a Weights cache or
   a NIM profile (one-liner) — supports customer fine-tunes layered on
   NIM bases.

3. v0.2 Engine kind: expand NIM coverage to include status.nimProfile
   (so deployments can verify compatibility without dereferencing the
   image) and a new nimCatalog shim source ({model, profile} resolves
   to the profile-specific NGC URL + cache layout, survives NGC URL
   schema changes).

None of these are blocking for v0.1; example 11 still covers v0.1 NIM
Mode 2b in full.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tion

Header:
- Split the overstuffed Status line into Status + Timeline shift fields.
  Previous version crammed three sentences into one bold-prefix line;
  Timeline shift is a separate distinct concern worth its own field.

Diagrams:
- v0.1 PVC backend diagram: flip JOB -->|pull once| HF to HF -->|fetch|
  JOB. Previous direction read as request-flow while the adjacent
  JOB -->|write| PVC arrow was data-flow — inconsistent in one diagram.
- Multi-node serving diagram: same fix, HF -->|fetch| JOB.
- Now both diagrams are consistent data-flow left-to-right: source ->
  fetch -> Job -> write -> PVC -> mount -> Pod.

Language:
- Shape section: --model=hf://repo example was contrived (no engine
  actually uses an hf:// scheme). Replaced with realistic
  --model=meta-llama/Llama-3.3-70B-Instruct to match engine arg shape.
- v0.1 lead sentence: "Targets dense models..." was a fragment.
  Rewrite as "Use cases: ..." for cleaner reading.
- v0.1 mechanism bullet: split the two-sentence "scheduling gated +
  status.clusters[] is the eligibility signal" bullet into two scannable
  bullets.
- v0.2 storage backend: tighten "Cross-tenant dedup for public artifacts
  (opt-in for non-public)" cryptic parenthetical into a clearer sentence.
- AWS EFS/FSx -> AWS EFS / FSx for slash-spacing consistency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per "we have time to think for v0.2 etc unless it affects v0.1 API shape":

- Drop "v0.3 substrate unification — file as roadmap marker now or wait
  until v0.2 ships?" — pure project-management timing question, doesn't
  affect v0.1 API.
- Reword question 4 from "Migration from PVC to ContentAddressed" to
  "storage.backend mutability" — same content but explicit that the
  decision affects v0.1 API (whether the field can be flipped on an
  existing CR).
- Add lead sentence noting these are v0.1-scoped questions.
- Bold the question titles for scannability.

The remaining five questions all affect v0.1 API shape directly:
artifact kinds, sources, eviction, backend mutability, cross-namespace
refs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
One sentence in the Locality routing subsection: the same per-cluster
ready state, hydration latency, and hit rates the cache family emits
into #74 also feed a future intent-based serving layer (e.g. ttft.p99
SLA fields on ModelService). ModelCache is the supply-side input; the
SLA primitive is a separate design.

Connects ModelCache to the intent-based scheduling conversation without
expanding scope. Intent primitive itself stays out of this doc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…c); add Coverage note

Adapter is the generalized v0.2 kind for auxiliary weights bound to a
base. adapterType discriminator covers lora / controlnet / ipadapter /
textualInversion / t2iAdapter. Bakes in diffusion / vision LoRA /
NIM-base fine-tune cases without committing to per-subtype primitives.

Changes across design.md:
- Shape YAML kind comment: v0.2 list now Adapter | Engine
- Artifact kind discriminator paragraph: validation requirements
  include adapterType
- New "Coverage" note: ModelCache is format- and modality-agnostic.
  Same primitive serves LLM weights (safetensors / GGUF / ONNX),
  embedding models, multimodal VLMs, ASR/TTS, voice libraries,
  compiled engines, and arbitrary byte trees via Bytes. Format
  awareness lives in the engine, not the cache.
- Out-of-scope-for-v0.1 bullet renamed
- v0.2 New artifact kinds bullet expanded to describe Adapter
- v0.3 substrate diagram + bullet: "LoRAs" -> "adapters"
- Open question 1: "LoraAdapter" -> "Adapter (with adapterType: lora)"
- Roadmap v0.2 issue title: "Adapter and Engine kinds"

Example 07-v0.2-lora-adapter.yaml:
- Header retitled to "Adapter kind (LoRA subtype)"
- kind: LoraAdapter -> kind: Adapter + adapterType: lora
- Filename kept (file demonstrates the LoRA subtype specifically)

Example 09-bytes-opaque.yaml:
- Header comment LoraAdapter -> Adapter

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Match the slash-with-spaces convention used elsewhere in the doc
(AWS EFS / FSx, LMCache / Mooncake / NIXL, FUSE / S3 CSI).

Final coherence pass result: no stale LoraAdapter refs anywhere, only
deliberate LoraCache reference (in Alternatives considered, rejected
pattern), clean heading hierarchy across 16 sections, all 11 examples
internally consistent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Dennis Ramdass added 2 commits May 14, 2026 11:32
Adds explicit discriminator-axis notes to the research-driven
categorizations so it's clear what each taxonomy partitions on, and
fixes one real ME issue:

- Three packaging patterns: state the axis (fetch responsibility);
  call out that hybrid factorings are linear combinations, not a
  fourth pattern. Includes a short inline definition of MECE on
  first use.
- NIM modes 2a/2b/2c: state the axis (where weights live × who put
  them there).
- Artifact kinds: clarify that Weights/Tokenizer/Bytes are validation
  and wiring discriminators, not strict content partitions — the same
  bytes can fit multiple kinds.
- Sources: state the v0.1 axis (fetch protocol); restructure v0.2
  into two layers — direct fetch sources (gcs/azure/pvc-clone) stay
  under spec.artifact.source; registry resolvers (mlflow / kubeflow /
  wandb / nimCatalog) move under spec.artifact.resolvedVia so the
  API stays MECE across abstraction levels.
- Storage backends: state the axis (who owns the substrate).
- Replication: state that v0.1+v0.2 modes are MECE on granularity but
  not selectivity; flag KOfN / SingleCluster / weighted as future
  work rather than implying the current set is exhaustive.
Previous pass slathered "MECE" across every taxonomy in the doc and
claimed it was the analysis method during research. That overclaimed
— most of the categorizations are post-hoc synthesis (vendor surveys,
comparative scans, deep dives) rather than MECE-driven analysis up
front. Only the NIM 2a/2b/2c sub-modes were genuinely MECE in method.

This pass keeps the substance — naming the discriminator axis for
each taxonomy so the categories don't drift across axes — but drops
the jargon. Three packaging patterns, sources, backends, replication
modes: each now states its partition axis plainly. The v0.2 source
restructure (resolvedVia for registry resolvers) stays as a real API
change; the rationale is just "different abstraction levels" rather
than "stay MECE."
dennis-upbound pushed a commit that referenced this pull request May 14, 2026
Mirror the doc trim on PR #76: the kind / source descriptions just
name the partition axis ("fetch protocol", "wiring discriminator not
content partition") rather than declaring the field MECE. Substance
is the same; phrasing matches the design doc.
Dennis Ramdass and others added 7 commits May 14, 2026 11:47
Three substantive precision wins from properly partitioning the
research:

- Pattern 1 framing: engines natively address {local path, HF repo}
  only; HTTP / cloud / custom URIs are plugin territory. Tighter
  scope for "engine fetches" than the original "via --model=<repo>".
- Job-vs-init-container reframed around execution location: init
  container shares pod limits (broken at scale), external Job has
  own limits (v0.1), DaemonSet pre-stages per-node (Run:ai), CSI/FUSE
  streams from object (v0.2). KServe init OOM is a category problem,
  not a tuning problem. The same axis covers prior art and our own
  v0.1 → v0.2 → v0.3 trajectory.
- RWX CSI flat vendor list replaced with four categories that have
  materially different cold-start perf/cost: NFS, parallel FS,
  object-backed FUSE, replicated block. Surfaces a choice customers
  should actually be making rather than implying all CSIs are
  equivalent.

NIM modes were already MECE in research; no changes there.
Three substantive changes plus two tables that earn their keep:

1. Scope-boundary section replaces the bulleted in/out lists with a
   four-row layer table (image / weights / aux / compiled engine).
   Names where Truss / Bento / Cog / KitOps / NIM all sit (Layer 1,
   out of scope) and where ModelCache plugs in (Layers 2-4).
   Modelplane stays packaging-format-agnostic.

2. Storage-backends section adds a provider column and reframes the
   v0.2 backends per Bassam's note: `Custom` is the OSS extension
   point (webhook contract for any third-party CAS / streaming /
   BDN-style substrate); `ContentAddressed` is commercial, hosted by
   Upbound as managed weight delivery. Same API shape; substrate
   provider differs. Design-principle paragraph and v0.2 storage-
   backend intro updated to match.

3. `kind: Engine` (v0.2) extended to explicitly name TRT-LLM
   `.engine` blobs, vLLM CUDA-graph caches, NIM Mode 2b profile
   cache dirs, and KitOps ModelKit (via source.oci). Makes Layer 4
   coverage concrete instead of TRT-LLM-only.

`kind: Adapter` already covers multi-LoRA / fine-tune-output series
(no edit needed). The Tokenizer/Bytes/Weights overlap note from the
prior MECE pass still holds.
Three v0.2 preview examples (06 content-addressed, 07 lora-adapter,
08 compiled-engine) all use `backend: ContentAddressed` without
acknowledging the OSS/commercial split. Per the design's storage-
backends section: `ContentAddressed` is hosted commercially by
Upbound; `Custom` is the OSS hook for BYO substrates. Each example
now points at the alternative so readers see both paths.

Also #8: note that the Engine kind covers NIM Mode 2b profile cache
dirs and KitOps ModelKit bundles, not just TRT-LLM blobs.
Three v0.2 content-addressed examples (06/07/08) move into
examples/content-addressed/, renumbered 01/02/03 within the subdir.
A README in the subdir explains the OSS-vs-commercial provider split
once instead of repeating it as a comment block on each example.
Stale "v0.2 PREVIEW" headers and "Provider note:" blocks dropped —
the directory placement tells the story.

Top-level renumbers to close the gap: 09→06, 10→07, 11→08. Examples
section of the design doc rewritten to reflect the split (v0.1 OSS
list + content-addressed subdir pointer).
Customer's ModelCache RWX needs map to cloud-specific CSI driver
installs via InferenceCluster.spec.storage.csiDrivers — a list of
semantic capabilities (SharedFilesystem / ObjectStorageMount /
BlockDevice). The composition function maps these to cloud-native
CSI addons per source: GKE Filestore CSI for SharedFilesystem,
GCS-FUSE for ObjectStorageMount, etc. EKS / AKS branches use their
own mapping. BYO (source: Existing) clusters get the field as a
descriptive declaration only — Modelplane never installs CSI
drivers on customer-managed clusters.

Adds a new entry to the Key Decisions list and extends the v0.1
mechanism description with the capability flag.
The point is that customers don't get drivers they didn't ask for;
the MECE-on-capability wording was the kind of jargon overuse from
earlier in the doc work.
Lead the v0.1 section with what it actually unblocks (LWS gang
needs the same weight bytes on every pod) rather than diving
straight into use cases. Makes explicit that single-node scale-up
benefits are a side effect and that the harder content-addressed
wins stay in v0.2 behind the same user-facing API.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dennis-upbound

Copy link
Copy Markdown
Collaborator Author

Closing this in favor of a focused 1-pager on PR #78 (design/modelcache/README.md) that documents what v0.1 actually ships — the verbose design doc here was speculative across v0.1/v0.2/v0.3 and got stale as we built. The branch stays for history but doesn't merge.

Key v0.2+ ideas from this doc (commercial content-addressed substrate, OCI overlap, Dragonfly + Xet layering) are being captured separately as commercial research, not in the OSS repo.

dennis-upbound pushed a commit that referenced this pull request May 16, 2026
Closing #76 (the speculative v0.1/v0.2/v0.3 design doc) and
replacing with a focused page that documents what shipped in this
PR: shape, what gets composed, multi-node Ray bootstrap, scope
boundaries. Demo proof at examples/qwen-cached-demo/.

v0.2+ ideas (content-addressed substrate, lazy load, cross-cluster
dedup) are explicitly out of scope here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dennis-upbound dennis-upbound deleted the dennis/modelcache-design branch May 16, 2026 00:22
dennis-upbound pushed a commit that referenced this pull request May 20, 2026
Composes a PVC + a one-shot hydration Job per matched InferenceCluster.
v0.1 scope: Weights kind, PVC backend, HuggingFace + S3 sources,
replication = AllMatchingClusters. ContentAddressed / Custom backends,
Tokenizer / Bytes / Adapter / Engine kinds, BYO ExistingPVC, and
per-cluster selector overrides are deferred.

Out of scope here: ModelDeployment integration. The mount-injection
that attaches a cache's PVC to a model serving pod lives in
compose-model-replica and is deferred until the new ModelDeployment
shape (PR #75) stabilizes.

Adds:
- apis/modelcaches/{definition,composition}.yaml
- functions/compose-model-cache/main.py
- examples/cache/model-cache-basic.yaml

Design: #76.
dennis-upbound pushed a commit that referenced this pull request May 20, 2026
Mirror the doc trim on PR #76: the kind / source descriptions just
name the partition axis ("fetch protocol", "wiring discriminator not
content partition") rather than declaring the field MECE. Substance
is the same; phrasing matches the design doc.
dennis-upbound pushed a commit that referenced this pull request May 20, 2026
Closing #76 (the speculative v0.1/v0.2/v0.3 design doc) and
replacing with a focused page that documents what shipped in this
PR: shape, what gets composed, multi-node Ray bootstrap, scope
boundaries. Demo proof at examples/qwen-cached-demo/.

v0.2+ ideas (content-addressed substrate, lazy load, cross-cluster
dedup) are explicitly out of scope here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant