WIP: ModelCache design doc + examples#76
Closed
dennis-upbound wants to merge 28 commits into
Closed
Conversation
1-pager covering v0.1 PVC backend (multi-node RWX + Weights/Tokenizer/Bytes), v0.2 content-addressed backend with lazy loading (adds LoraAdapter + Engine kinds), and v0.3 substrate unification option across ModelCache, KVOffloadTier, and HotPrefixPool. Eight example yamls cover single-cluster basic, multi-node TensorPipeline gang, multi-cluster replication, separate tokenizer, private S3 source, plus three v0.2 previews. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s example - Hyperlink all #NN issue/PR references throughout the doc - New "Design principle: pluggable backends" section explicitly framing the pattern shared with #72 KVOffloadTier and #73 HotPrefixPool - Strengthen v0.3 substrate-unification tie-in with concrete artifact lifecycle split (immutable static / mutable runtime / immutable precomputed runtime) - Add four Mermaid diagrams: v0.1 PVC flow, multi-node LWS gang shared PVC, v0.2 content-addressed lazy hydration, v0.3 substrate unification - Add LoraAdapter baseRef field to the spec shape - Add example 09-bytes-opaque.yaml covering the Bytes kind escape hatch - Cold-start time estimates in every example header (rough order-of- magnitude based on ~50 MB/s HF pull, ~1 GB/s intra-region S3) - Fix eu-west vs eu-west-1 region naming inconsistency in example 05 - Coherency: normalize replication: AllMatchingClusters across all examples; add caches:[{name}] field to ModelDeployment references Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
design.md rewrite folds in the scope-boundary framing from #66 (what ModelCache covers vs what stays on the engine block — weights, engines, tokenizers, chat templates, configs in scope; container images, env vars, shmSize out of scope), the full source taxonomy (huggingFace / s3 / http / inline / configMap in v0.1; gcs / azure / oci / pvc-clone in v0.2), the KServe storage-initializer OOM motivation for Job-based prefetch, fail-fast scheduling on missing RWX storage class, and the rationale for AllMatchingClusters as the v0.1 replication default vs AllMatchingNodes. Adds explicit decisions: mount path is intrinsic to the cache (no per-reference override), one artifact per cache, storage class cluster default with per-cache override. Coherency: every ModelCache example now has explicit replication and clusterSelector. mount path lives on the cache spec, not in the ModelDeployment caches[] reference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Dennis confirmed DRA is on the v0.1 plan (just not yet in Nic's spike). Updates: - Add #56 DRA alignment to Related, marked as also-v0.1 - v0.1 mechanism: note clusterSelector can accept CEL over InferenceCluster pool attributes once #56 lands (alongside the matchLabels baseline) - Alternatives considered: clarify nodeSelector.cel is v0.2 not because DRA is deferred but because it only fits AllMatchingNodes mode (per-node SSD via content-addressed backend), which is itself v0.2 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
design.md:
- Cut vibe statements ("This is fleet-level territory", "ModelCache is
the first instance; the pattern generalizes")
- Drop redundant restatements in mount-path, scope-boundary, v0.1, and
v0.3 sections
- Replace "Why now:" subsection with a single line in v0.2
- Cut "Better for positioning if Modelplane is 'the AI content CAS
company'" marketing line from ContentCache alternative
- Tighten "Same pattern as #72 and #73 — the family's unifying
architectural principle" to just "Same pattern as #72 and #73"
- Drop "Also doesn't generalize to compiled engines or non-weight
artifacts" duplicate from engine-native alternative
Diagrams:
- Drop time-estimate labels from arrows (already in prose)
- Remove off-topic "InfiniBand / NIXL KV transfer" arrow from multi-node
diagram (serving detail, not ModelCache)
- Drop "one set of bytes per artifact" subtitle from v0.2 content store
node and tighten arrow labels
- Drop "composes" labels on dotted ModelCache -> Job/PVC edges; arrow
style alone conveys it
Example headers — trim redundant "without ModelCache... With ModelCache..."
restatements in speedup paragraphs across 01, 02, 03, 05, 06, 07, 08.
Same numbers, half the words.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v0.1 ModelCache section now covers: - Invalidation: source-version-as-cache-identity (immutable cache pattern), manual re-fetch annotation for source-side fixes - Status: Ready/Populated/Failed per cluster, status fields, emission into #74 signal bus v0.3 substrate unification expanded with three subsections: - How the three primitives relate as a staged cold-start pipeline (weights-loading, first-request prefill, runtime KV pressure) - Unified invalidation: master key is (modelDigest, tokenizerDigest); per-primitive eviction policies on top - Unified observability table mapping signal types to each primitive, composite "cache effectiveness" view on ModelService.status Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add Mermaid diagram for the staged cold-start pipeline (3-phase boxes: cluster boot / first request / runtime) — visualizes how ModelCache, HotPrefixPool, and KVOffloadTier compose into a coherent cold-start story. Previously this was prose-only. - Drop redundant decision #8 ("Substrate unification deferred to v0.3"), already implicit from v0.3 section header. - Tighten Problem section closer from two disjoint sentences to one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t resolution Folds in the prior-art research findings on customer-provided infrastructure and delegated GC. Shape: - New "Storage backends" subsection (table) listing PVC / ExistingPVC v0.1 and ContentAddressed / Custom v0.2 - New "BYO scenarios" subsection covering five axes: BYO source, BYO storage class, BYO pre-populated PVC, BYO P2P fan-out (Spegel/Dragonfly), BYO cluster. Explicit on what Modelplane retains regardless of BYO (artifact identity, scheduler gating, refcounting, observability, invalidation policy). - Promote `oci` source from v0.2 to v0.1 — OCI + KitOps + Harbor is the converging air-gap pattern - Add shim sources (`mlflow:` / `kubeflow-modelregistry:` / `wandb:`) to v0.2 v0.1 Invalidation: - Rename to "Invalidation and GC" - Add tag → digest resolution at hydration time, with status.resolvedDigest - Clarify refcount is visible but Modelplane doesn't auto-GC: operator retires explicitly, PVCs reclaim per K8s reclaimPolicy v0.2 GC: - Brief subsection: delegate to object-store lifecycle policies + touch on access. Explicit refcounting only if a future case forces it. - Update market-signal closer to include Baseten BDN, Run:ai, KitOps, Dragonfly+OCI as converging signals. Examples: - Add 10-byo-existing-pvc.yaml showing the ExistingPVC backend (customer manages population, Modelplane mounts and orchestrates) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
design.md: - Shape: clarify artifact.kind enum spans v0.1 vs v0.2; remove stray baseRef from Weights example (baseRef only applies to LoraAdapter, examples 07/08 show it properly) - Sources: tighten v0.2 shim sources sentence (no fake URI-scheme colons) - BYO: note ExistingPVC ignores replication - v0.1 mechanism: backtick ReadWriteMany; backtick InferenceCluster - Invalidation: fix "status annotation" → metadata annotation; sources identity covers HF revision / S3 version path / OCI digest - Status: add resolvedDigest and references to listed fields; drop sourceETag (subsumed by resolvedDigest) - v0.2 closer: tighten "PVC ships v0.1 fast" wording; use backticked backend names (PVC / ContentAddressed) - v0.3 ModelCache eviction: align with v0.2 (TTL+touch by default, not refcount-by-default) - Key decision 4: add ExistingPVC to backend list; backtick all four - Open questions: backtick artifact-kind/source/backend names consistently; add `oci` to question 2 - Roadmap: use API names (LoraAdapter / Engine) in v0.2 issue title - Examples list: backtick the cross-ref cache name examples/10-byo-existing-pvc.yaml: - Note that replication is ignored for ExistingPVC Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fold in Bassam's "three packaging patterns" framing from PR #64 review comment 4414021192 (engine-fetches-weights / engine-image-bakes-weights / runtime-and-artifacts-separate). Grounds ModelCache as the Pattern 3 primitive that also accelerates Pattern 1; clarifies why Pattern 2 (NIM) doesn't need it. Add Locality routing subsection in v0.3 substrate unification connecting the three primitives to #71 ModelService routing affinity. Cold-start pipeline covers what new replicas need; locality routing covers where existing requests go. ModelCache feeds both — status.clusters[] is the eligibility signal for fleet routing. v0.1 mechanism: tighten the scheduling-gating bullet to explicitly call out status.clusters[] as the eligibility signal the fleet matcher reads, not just an implicit scheduler hook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Status header now explicitly notes ModelCache advances from v0.2 (per Bassam's PR #64 review framing) to v0.1, driven by multi-node serving requirements (#61 closure) and DRA landing in v0.1 (#56). Flagged for team alignment since this is a deliberate timeline shift from the earlier framing. Roadmap #66 line: tighten the awkward kind/source bundling ("Weights/Tokenizer/Bytes/inline/configMap" mixed kinds with sources) into separate categories with backtick-normalized backend / kind names. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Doc: - Line 38: backtick PVC and ContentAddressed in design-principle paragraph (was unbacked when describing the backend, inconsistent with other references in the doc) - Line 167: "content-addressed backend" → "ContentAddressed backend" with backticks (named backend, not concept) - Line 366: normalize slash spacing in roadmap line (Weights / Tokenizer / Bytes with spaces, matching PVC / ExistingPVC style) - Line 383: backtick ContentAddressed in examples list entry for 06 Examples: - example 09 (bytes-opaque): second ModelCache was missing clusterSelector and replication. Added both for consistency with the first cache and with every other example. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…es, example 11 Three packaging patterns: split Pattern 2 (NIM) into three sub-modes — 2a (weights baked in image, no ModelCache needed), 2b (NIM image fetches into /opt/nim/.cache, ModelCache pre-seeds), 2c (air-gap via ExistingPVC). Previous text glossed over 2b and 2c as if they didn't exist. Sources table: clarify http source covers NIM/NGC URLs (pre-seeding /opt/nim/.cache), not just generic HTTP. v0.2 Engine kind: extend to NIM profiles. engine.runtime: NIM + profileId makes the (GPU SM, count, precision, TP, PP, target) tuple explicit so Modelplane validates against cluster hardware before staging — avoids the silent wrong-profile failure (e.g. H100 profile on B200). Scope boundary: clarify that "container images out of scope" applies to Mode 2a baked-weight images; Modes 2b/2c stage the NIM cache dir via ModelCache. New example 11-nim-cache.yaml: complete Mode 2b reference. NIM image, NGC creds Secret reused as both imagePullSecret and NGC_API_KEY env, Bytes kind staging /opt/nim/.cache from NGC, Hopper-only clusterSelector matching the profile. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…log source
Three additions:
1. After Mode 2a/b/c list: note that NVIDIA's NIM Operator (NIMService /
NIMCache CRDs) composes with ModelCache rather than competing —
ModelCache stages at the K8s storage layer; NIM Operator consumes
the mounted path. Customers already on NIM Operator can keep using it.
2. v0.2 LoraAdapter kind: baseRef can point at either a Weights cache or
a NIM profile (one-liner) — supports customer fine-tunes layered on
NIM bases.
3. v0.2 Engine kind: expand NIM coverage to include status.nimProfile
(so deployments can verify compatibility without dereferencing the
image) and a new nimCatalog shim source ({model, profile} resolves
to the profile-specific NGC URL + cache layout, survives NGC URL
schema changes).
None of these are blocking for v0.1; example 11 still covers v0.1 NIM
Mode 2b in full.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tion Header: - Split the overstuffed Status line into Status + Timeline shift fields. Previous version crammed three sentences into one bold-prefix line; Timeline shift is a separate distinct concern worth its own field. Diagrams: - v0.1 PVC backend diagram: flip JOB -->|pull once| HF to HF -->|fetch| JOB. Previous direction read as request-flow while the adjacent JOB -->|write| PVC arrow was data-flow — inconsistent in one diagram. - Multi-node serving diagram: same fix, HF -->|fetch| JOB. - Now both diagrams are consistent data-flow left-to-right: source -> fetch -> Job -> write -> PVC -> mount -> Pod. Language: - Shape section: --model=hf://repo example was contrived (no engine actually uses an hf:// scheme). Replaced with realistic --model=meta-llama/Llama-3.3-70B-Instruct to match engine arg shape. - v0.1 lead sentence: "Targets dense models..." was a fragment. Rewrite as "Use cases: ..." for cleaner reading. - v0.1 mechanism bullet: split the two-sentence "scheduling gated + status.clusters[] is the eligibility signal" bullet into two scannable bullets. - v0.2 storage backend: tighten "Cross-tenant dedup for public artifacts (opt-in for non-public)" cryptic parenthetical into a clearer sentence. - AWS EFS/FSx -> AWS EFS / FSx for slash-spacing consistency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per "we have time to think for v0.2 etc unless it affects v0.1 API shape": - Drop "v0.3 substrate unification — file as roadmap marker now or wait until v0.2 ships?" — pure project-management timing question, doesn't affect v0.1 API. - Reword question 4 from "Migration from PVC to ContentAddressed" to "storage.backend mutability" — same content but explicit that the decision affects v0.1 API (whether the field can be flipped on an existing CR). - Add lead sentence noting these are v0.1-scoped questions. - Bold the question titles for scannability. The remaining five questions all affect v0.1 API shape directly: artifact kinds, sources, eviction, backend mutability, cross-namespace refs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
One sentence in the Locality routing subsection: the same per-cluster ready state, hydration latency, and hit rates the cache family emits into #74 also feed a future intent-based serving layer (e.g. ttft.p99 SLA fields on ModelService). ModelCache is the supply-side input; the SLA primitive is a separate design. Connects ModelCache to the intent-based scheduling conversation without expanding scope. Intent primitive itself stays out of this doc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…c); add Coverage note Adapter is the generalized v0.2 kind for auxiliary weights bound to a base. adapterType discriminator covers lora / controlnet / ipadapter / textualInversion / t2iAdapter. Bakes in diffusion / vision LoRA / NIM-base fine-tune cases without committing to per-subtype primitives. Changes across design.md: - Shape YAML kind comment: v0.2 list now Adapter | Engine - Artifact kind discriminator paragraph: validation requirements include adapterType - New "Coverage" note: ModelCache is format- and modality-agnostic. Same primitive serves LLM weights (safetensors / GGUF / ONNX), embedding models, multimodal VLMs, ASR/TTS, voice libraries, compiled engines, and arbitrary byte trees via Bytes. Format awareness lives in the engine, not the cache. - Out-of-scope-for-v0.1 bullet renamed - v0.2 New artifact kinds bullet expanded to describe Adapter - v0.3 substrate diagram + bullet: "LoRAs" -> "adapters" - Open question 1: "LoraAdapter" -> "Adapter (with adapterType: lora)" - Roadmap v0.2 issue title: "Adapter and Engine kinds" Example 07-v0.2-lora-adapter.yaml: - Header retitled to "Adapter kind (LoRA subtype)" - kind: LoraAdapter -> kind: Adapter + adapterType: lora - Filename kept (file demonstrates the LoRA subtype specifically) Example 09-bytes-opaque.yaml: - Header comment LoraAdapter -> Adapter Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Match the slash-with-spaces convention used elsewhere in the doc (AWS EFS / FSx, LMCache / Mooncake / NIXL, FUSE / S3 CSI). Final coherence pass result: no stale LoraAdapter refs anywhere, only deliberate LoraCache reference (in Alternatives considered, rejected pattern), clean heading hierarchy across 16 sections, all 11 examples internally consistent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6 tasks
added 2 commits
May 14, 2026 11:32
Adds explicit discriminator-axis notes to the research-driven categorizations so it's clear what each taxonomy partitions on, and fixes one real ME issue: - Three packaging patterns: state the axis (fetch responsibility); call out that hybrid factorings are linear combinations, not a fourth pattern. Includes a short inline definition of MECE on first use. - NIM modes 2a/2b/2c: state the axis (where weights live × who put them there). - Artifact kinds: clarify that Weights/Tokenizer/Bytes are validation and wiring discriminators, not strict content partitions — the same bytes can fit multiple kinds. - Sources: state the v0.1 axis (fetch protocol); restructure v0.2 into two layers — direct fetch sources (gcs/azure/pvc-clone) stay under spec.artifact.source; registry resolvers (mlflow / kubeflow / wandb / nimCatalog) move under spec.artifact.resolvedVia so the API stays MECE across abstraction levels. - Storage backends: state the axis (who owns the substrate). - Replication: state that v0.1+v0.2 modes are MECE on granularity but not selectivity; flag KOfN / SingleCluster / weighted as future work rather than implying the current set is exhaustive.
Previous pass slathered "MECE" across every taxonomy in the doc and claimed it was the analysis method during research. That overclaimed — most of the categorizations are post-hoc synthesis (vendor surveys, comparative scans, deep dives) rather than MECE-driven analysis up front. Only the NIM 2a/2b/2c sub-modes were genuinely MECE in method. This pass keeps the substance — naming the discriminator axis for each taxonomy so the categories don't drift across axes — but drops the jargon. Three packaging patterns, sources, backends, replication modes: each now states its partition axis plainly. The v0.2 source restructure (resolvedVia for registry resolvers) stays as a real API change; the rationale is just "different abstraction levels" rather than "stay MECE."
dennis-upbound
pushed a commit
that referenced
this pull request
May 14, 2026
Mirror the doc trim on PR #76: the kind / source descriptions just name the partition axis ("fetch protocol", "wiring discriminator not content partition") rather than declaring the field MECE. Substance is the same; phrasing matches the design doc.
Three substantive precision wins from properly partitioning the
research:
- Pattern 1 framing: engines natively address {local path, HF repo}
only; HTTP / cloud / custom URIs are plugin territory. Tighter
scope for "engine fetches" than the original "via --model=<repo>".
- Job-vs-init-container reframed around execution location: init
container shares pod limits (broken at scale), external Job has
own limits (v0.1), DaemonSet pre-stages per-node (Run:ai), CSI/FUSE
streams from object (v0.2). KServe init OOM is a category problem,
not a tuning problem. The same axis covers prior art and our own
v0.1 → v0.2 → v0.3 trajectory.
- RWX CSI flat vendor list replaced with four categories that have
materially different cold-start perf/cost: NFS, parallel FS,
object-backed FUSE, replicated block. Surfaces a choice customers
should actually be making rather than implying all CSIs are
equivalent.
NIM modes were already MECE in research; no changes there.
Three substantive changes plus two tables that earn their keep: 1. Scope-boundary section replaces the bulleted in/out lists with a four-row layer table (image / weights / aux / compiled engine). Names where Truss / Bento / Cog / KitOps / NIM all sit (Layer 1, out of scope) and where ModelCache plugs in (Layers 2-4). Modelplane stays packaging-format-agnostic. 2. Storage-backends section adds a provider column and reframes the v0.2 backends per Bassam's note: `Custom` is the OSS extension point (webhook contract for any third-party CAS / streaming / BDN-style substrate); `ContentAddressed` is commercial, hosted by Upbound as managed weight delivery. Same API shape; substrate provider differs. Design-principle paragraph and v0.2 storage- backend intro updated to match. 3. `kind: Engine` (v0.2) extended to explicitly name TRT-LLM `.engine` blobs, vLLM CUDA-graph caches, NIM Mode 2b profile cache dirs, and KitOps ModelKit (via source.oci). Makes Layer 4 coverage concrete instead of TRT-LLM-only. `kind: Adapter` already covers multi-LoRA / fine-tune-output series (no edit needed). The Tokenizer/Bytes/Weights overlap note from the prior MECE pass still holds.
Three v0.2 preview examples (06 content-addressed, 07 lora-adapter, 08 compiled-engine) all use `backend: ContentAddressed` without acknowledging the OSS/commercial split. Per the design's storage- backends section: `ContentAddressed` is hosted commercially by Upbound; `Custom` is the OSS hook for BYO substrates. Each example now points at the alternative so readers see both paths. Also #8: note that the Engine kind covers NIM Mode 2b profile cache dirs and KitOps ModelKit bundles, not just TRT-LLM blobs.
Three v0.2 content-addressed examples (06/07/08) move into examples/content-addressed/, renumbered 01/02/03 within the subdir. A README in the subdir explains the OSS-vs-commercial provider split once instead of repeating it as a comment block on each example. Stale "v0.2 PREVIEW" headers and "Provider note:" blocks dropped — the directory placement tells the story. Top-level renumbers to close the gap: 09→06, 10→07, 11→08. Examples section of the design doc rewritten to reflect the split (v0.1 OSS list + content-addressed subdir pointer).
Customer's ModelCache RWX needs map to cloud-specific CSI driver installs via InferenceCluster.spec.storage.csiDrivers — a list of semantic capabilities (SharedFilesystem / ObjectStorageMount / BlockDevice). The composition function maps these to cloud-native CSI addons per source: GKE Filestore CSI for SharedFilesystem, GCS-FUSE for ObjectStorageMount, etc. EKS / AKS branches use their own mapping. BYO (source: Existing) clusters get the field as a descriptive declaration only — Modelplane never installs CSI drivers on customer-managed clusters. Adds a new entry to the Key Decisions list and extends the v0.1 mechanism description with the capability flag.
The point is that customers don't get drivers they didn't ask for; the MECE-on-capability wording was the kind of jargon overuse from earlier in the doc work.
Lead the v0.1 section with what it actually unblocks (LWS gang needs the same weight bytes on every pod) rather than diving straight into use cases. Makes explicit that single-node scale-up benefits are a side effect and that the harder content-addressed wins stay in v0.2 behind the same user-facing API. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
Closing this in favor of a focused 1-pager on PR #78 ( Key v0.2+ ideas from this doc (commercial content-addressed substrate, OCI overlap, Dragonfly + Xet layering) are being captured separately as commercial research, not in the OSS repo. |
dennis-upbound
pushed a commit
that referenced
this pull request
May 16, 2026
Closing #76 (the speculative v0.1/v0.2/v0.3 design doc) and replacing with a focused page that documents what shipped in this PR: shape, what gets composed, multi-node Ray bootstrap, scope boundaries. Demo proof at examples/qwen-cached-demo/. v0.2+ ideas (content-addressed substrate, lazy load, cross-cluster dedup) are explicitly out of scope here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dennis-upbound
pushed a commit
that referenced
this pull request
May 20, 2026
Composes a PVC + a one-shot hydration Job per matched InferenceCluster. v0.1 scope: Weights kind, PVC backend, HuggingFace + S3 sources, replication = AllMatchingClusters. ContentAddressed / Custom backends, Tokenizer / Bytes / Adapter / Engine kinds, BYO ExistingPVC, and per-cluster selector overrides are deferred. Out of scope here: ModelDeployment integration. The mount-injection that attaches a cache's PVC to a model serving pod lives in compose-model-replica and is deferred until the new ModelDeployment shape (PR #75) stabilizes. Adds: - apis/modelcaches/{definition,composition}.yaml - functions/compose-model-cache/main.py - examples/cache/model-cache-basic.yaml Design: #76.
dennis-upbound
pushed a commit
that referenced
this pull request
May 20, 2026
Mirror the doc trim on PR #76: the kind / source descriptions just name the partition axis ("fetch protocol", "wiring discriminator not content partition") rather than declaring the field MECE. Substance is the same; phrasing matches the design doc.
dennis-upbound
pushed a commit
that referenced
this pull request
May 20, 2026
Closing #76 (the speculative v0.1/v0.2/v0.3 design doc) and replacing with a focused page that documents what shipped in this PR: shape, what gets composed, multi-node Ray bootstrap, scope boundaries. Demo proof at examples/qwen-cached-demo/. v0.2+ ideas (content-addressed substrate, lazy load, cross-cluster dedup) are explicitly out of scope here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Design doc + 11 example yamls for the
ModelCacheprimitive. Pure design — no code, no CRDs. Supersedes the sketch in #66.Why v0.1 PVC-backed (TL;DR)
Multi-node LWS serving needs the same weight bytes mounted by every pod in the gang. The minimum primitive for that is an RWX PVC hydrated once by a side Job (not an init container — KServe's storage-initializer OOMs at 4/8/16 GiB on Kimi K2 / Llama 405B) and mounted read-only by every pod that references it. v0.1 ships exactly that. Single-node scale-up benefits as a side effect (no per-replica HF pull on scale events). The harder content-addressed wins — cross-deployment dedup, lazy load, cross-cluster sharing — stay in v0.2 behind the same user-facing API.
Read
content-addressed/(commercial substrate vs OSSCustomhook explained incontent-addressed/README.md)Alignment asks
ContentAddressed= Upbound hosted weight delivery;Custom(webhook) = OSS extension point. Customers run fully OSS viaCustom; commercial product is fleet-scale dedup + Modal-class cold-start hosted by Upbound. Reflected in the Storage-backends table + v0.2 section.Companion
PR #78 (
dennis/modelcache-impl) lands the v0.1 OSS scaffolding (XRD, composition function,ModelDeployment.spec.cacheswiring, four standalone examples, an end-to-end Qwen-on-GKE cold-start demo with idempotent setup / demo / cleanup scripts) against this design.Related
#66 (implementation tracker, body refactored) · #61 (closed; mechanism absorbed) · #56 (DRA, v0.1) · #72 KVOffloadTier · #71 ModelService routing affinity · PR #64 · PR #75 · PR #78