Drop KServe — dispatch to native + llm-d (Dynamo-ready) by dennis-upbound · Pull Request #99 · modelplaneai/modelplane

dennis-upbound · 2026-06-02T19:11:43Z

Fixes #65.

Removes the KServe dependency and composes serving directly.

Status: rebased onto latest main; native single-pod path validated end-to-end on a real GKE cluster (see below). Reworking the llm-d multi-pod path to be gateway-agnostic / Traefik-compatible — see Routing approach below.

What changed

compose-model-replica is now a dispatcher (backends/{native,llmd,dynamo}.py):
- single self-contained pod → native Kubernetes Deployment + Service + HTTPRoute
- multi-pod (pipeline > 1) → llm-d LeaderWorkerSet + Service + HTTPRoute (see Routing approach)
- Dynamo stub (dormant in v0.1)
- No user-facing backend field; KServe LLMInferenceService emission deleted.
KServeBackend XR → ServingStack: installs the serving substrate (LeaderWorkerSet, Gateway API, cert-manager, Prometheus, Envoy Gateway); drops KServe + KEDA. Creates the modelplane-system namespace on the workload cluster.
In-cluster HTTPRoute strips the /<ns>/<deployment>/ prefix (URLRewrite) so the engine sees /v1/....
compose-model-deployment endpoint path reconciled with the new HTTPRoute convention.
Weight loading: no ModelCache → engine fetches directly; documented in getting-started.

Routing approach — GAIE vs Traefik (how we're doing it)

v0.1 does no inference-aware (KV-/load-aware) endpoint picking, so the GAIE InferencePool v1 + EPP machinery the llm-d path originally emitted isn't needed yet.

Decision: drop InferencePool/EPP and route HTTPRoute → Service (plain Gateway API — Service + EndpointSlice, no Envoy Backend CRD), exactly like the native path. The HTTPRoute attaches to the workload cluster's inference gateway (Envoy inference-gateway, installed by ServingStack); the Service selects the LWS leader pods (only leaders serve the OpenAI API for vLLM). llm-d now routes identically to native on any Gateway API impl, so the GAIE v1 CRD swap and the inference-extension CRD install are gone from this PR.

Two distinct layers, kept straight (an earlier version of this section conflated them):

Workload gateway (Envoy inference-gateway, one per cluster): routes to a replica's serving pods. The dropped InferencePool/EPP did KV-/load-aware pod picking here. Reintroducing it is a workload-gateway concern — it needs a GAIE-conformant workload gateway (Envoy Gateway's InferencePool v1 support is unconfirmed; alternatively switch the workload gateway to Istio/agentgateway). Traefik is not involved at this layer.
Control-plane gateway (Traefik modelplane): routes across replicas/clusters via weighted backendRefs. #8 (inference-aware routing on the control plane) and #90 (weighted splits) live here — Traefik does weights natively but is not GAIE-conformant, so any pod-level picking at this layer would need an in-path picker rather than a GAIE gateway extension.

Multi-node bootstrap (Ray)

The LWS gang needs a Ray cluster spanning its pods (vLLM --pipeline-parallel-size > 1). Approach (matches the upstream LWS/vLLM/KServe convention; reuses the proven bootstrap from the closed PR #78):

Separate leaderTemplate/workerTemplate with role-specific commands (no if WORKER_INDEX==0 branch): leader runs ray start --head then execs the engine; worker runs ray start --address=$LWS_LEADER_ADDRESS:6379 --block.
LWS_* env (LWS_WORKER_INDEX/LWS_LEADER_ADDRESS/LWS_GROUP_SIZE) is the documented public interface.
User command override = escape hatch: an optional command on the curated Container (added in this PR, schemas regenerated) bypasses injection and runs verbatim on both templates — covers non-vLLM engines (e.g. SGLang's torch.dist) against the LWS_* contract. No init containers / no sidecar in v0.1.

Live validation (real GKE)

Provisioned a fresh GKE cluster end-to-end and served Qwen2.5-0.5B single-pod via the native path — pod 1/1 on an L4, real OpenAI completions. Surfaced + fixed: URLRewrite prefix-strip, and the modelplane-system namespace creation. Re #102: this PR doesn't change RoutingReady-vs-ModelReady gating, so #102 stands separately.

Revised follow-ups

llm-d path → Service routing (this PR): drop InferencePool/EPP + the inference-extension CRD install; emit HTTPRoute → Service selecting LWS leaders.
Multi-node Ray bootstrap (this PR): separate leader/worker templates + LWS_* contract + override escape hatch.
Live-validate multi-node (pipeline: 2) on a 2-GPU-node cluster.
Inference-aware routing → #8 (separate); weighted splitting → #90 (Traefik-native, no GAIE needed).

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>

…IE v1.5 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>

Introduces function/backends/base.py with the Backend Protocol, backend identifiers (native/llmd/dynamo), topology helpers (nodes_per_worker, needs_cross_pod_coordination), and the select_backend dispatcher. Tests verify single-pod -> NATIVE and multi-node -> LLMD routing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>

Implements NativeBackend.build(), which composes a Deployment, Service, and HTTPRoute as provider-kubernetes Objects for single-pod (pipeline=1) replicas. Engine args (including --model=) are passed through unmodified — no KServe hf:// rewrite needed. Adds TestNativeBackend with realistic fixtures (metadata namespace + InferenceCluster status.providerConfigRef populated). Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>

…convention Replaces the stale KServe-era path /{_NAMESPACE_REMOTE}/{child_name}/ with /{namespace}/{deployment-name}/, matching the PathPrefix that compose-model-replica's native and llm-d backends emit on the remote cluster. Removes the now-unused _NAMESPACE_REMOTE constant and stale LLMInferenceService comment; updates the compose_endpoints docstring. Tests updated to expect the new path format and confirmed passing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>

…avior preserved) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>

…ight fetch Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>

…erve link Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>

…efresh readiness coupling Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>

The control-plane ModelService rewrites to /<ns>/<deployment>/ (identity), so the remote gateway receives /<ns>/<deployment>/v1/...; the engine serves /v1/... and would 404. Add a URLRewrite ReplacePrefixMatch filter on the native and llm-d HTTPRoutes to strip the prefix to /. Found in code review. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>

- compose-inference-cluster: kssv1alpha1 -> ssv1alpha1 in main's new EKS backend-secrets method (missed by the KServeBackend->ServingStack rename) - compose-serving-stack: drop now-unused json import (adopted main's YAML inference-extension CRD loader) - regenerate uv.lock for the compose-serving-stack workspace member rename - add the code-review notes Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>

…cluster The Gateway (and the model-serving HTTPRoutes that target it) live in modelplane-system on the workload cluster, but nothing provisioned that namespace — the old KServe path relied on the kserve chart's own namespace. Compose a Namespace object so the Gateway can be created. Found provisioning a real GKE cluster (Gateway object failed: namespaces "modelplane-system" not found). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>

The docs/superpowers/ specs/plans/notes are local planning artifacts from the superpowers workflow, not repo content. Ignore the directory and untrack the four files currently committed (kept on disk). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The control-plane gateway is now Traefik (d0df0e7), which is not GAIE-conformant: it can't consume an InferencePool backendRef or call an ext-proc endpoint-picker. So the llm-d multi-pod path drops the InferencePool/EPP machinery and routes HTTPRoute -> Service, exactly like the native path and Nic's Traefik pattern. The Service selects the LWS *leader* pods (only the leader serves the OpenAI API for vLLM). This works on any Gateway API impl (Traefik and Envoy) and removes the GAIE v1 CRD swap and the workload-gateway-conformance question from scope. Inference-aware endpoint picking moves to #8 as a Traefik-compatible in-path picker. Also lands the multi-node Ray bootstrap: separate LWS leader/worker templates with role-specific commands (no LWS_WORKER_INDEX branch) — leader runs `ray start --head` then execs the engine; worker runs `ray start --address=$LWS_LEADER_ADDRESS:6379 --block`. The LWS_* env is the documented public contract. A user-command override (non-vLLM escape hatch) is TODO'd pending a `command` field on the curated Container. compose-serving-stack no longer installs the inference-extension CRDs (nothing emits an InferencePool); drops the unused yaml + mark_readiness keys. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds an optional `command` field to the curated Container in the ModelDeployment and ModelReplica XRDs (regenerated the Python schemas via `crossplane project build`). When the engine container sets a `command`, the llm-d backend injects neither the vLLM/Ray bootstrap nor vLLM-specific parallelism flags — the command runs verbatim on both the LWS leader and worker templates and owns cross-node coordination against the LWS_* contract (LWS_WORKER_INDEX, LWS_LEADER_ADDRESS, LWS_GROUP_SIZE). This is the escape hatch for symmetric non-vLLM engines such as SGLang (--nnodes/--node-rank/--dist-init-addr). vLLM/Ray stays the turnkey default when no command is set. The native single-pod backend also passes a user command through as the container entrypoint override. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

negz · 2026-06-05T05:45:53Z

The original llm-d path emitted a GAIE InferencePool v1 + EPP and pointed an HTTPRoute backendRef at the pool. That mechanism requires a GAIE-conformant, ext-proc-capable data plane (Envoy/Istio/agentgateway). But the control-plane gateway is now Traefik (Nic, d0df0e7 — Traefik is the only Gateway API impl that does per-backendRef URLRewrite for weighted cross-endpoint splits), and Traefik is not GAIE-conformant — it can't consume an InferencePool backendRef or call an ext-proc EPP.

I'm just starting my review - but does this matter? llm-d is running on the InferenceCluster right? We only need to run Traefik on the InferenceGateway - i.e. on the modelplane control plane. It sits in front of whatever the InferenceCluster's stack uses for routing. I would've thought the InferenceCluster (and thus llm-d etc) would be blissfully unaware of Traefik (or whatever) sitting in front of it.

negz

Thanks @dennis-upbound! I like the backend dispatch pattern this introduces. I have a few questions though - and one concern WRT potential collisions if we allow multiple replicas per IC.

negz · 2026-06-05T05:49:05Z

                                  type: string
                                  minLength: 1
                                  description: Container image.
+                                command:


Did you end up needing this? I'm curious what for.

negz · 2026-06-05T05:50:40Z

    referenceable: true
    additionalPrinterColumns:
-    - name: KSERVE
+    - name: GAIE


Do we use GAIA? 🤔 The PR description said it was dropped in favor of Traefik.

negz · 2026-06-05T05:51:38Z

 # Modelplane provisions the full GKE cluster (VPC, subnet, system pool,
 # GPU pools, service account, IAM bindings) and installs the inference
-# stack (cert-manager, Envoy Gateway, Prometheus, KEDA, LeaderWorkerSet).
+# stack (cert-manager, Envoy Gateway, Prometheus, LeaderWorkerSet, Gateway API Inference Extension).


Nit: The rest of this is wrapped at 80 chars.

(Here and in a few other comments.)

negz · 2026-06-05T05:57:28Z

        """
-        llmis = resource.child_name(self.xr.metadata.name)
-        rewrite_path = f"/{_NAMESPACE_REMOTE}/{llmis}/"
+        deployment_name = self.xr.metadata.name


Nit: Why break self.xr.metadata.name out into a var (used once AFAICT) but not self.xr.metadata.namespace? My preference would be avoid the single use var in both cases.

negz · 2026-06-05T06:09:05Z

+    ) -> dict[str, k8sobjv1alpha1.Object]:
+        engine = base.engine_container(replica)
+        pc = cluster.status.providerConfigRef.name
+        name = resource.child_name(deployment_name)


Could this cause a collision if multiple replicas land on the same IC? They can't with the scheduler today, but we do intend to allow it in v0.1.

negz · 2026-06-05T06:33:48Z

+                        # Strip the /<ns>/<deployment>/ routing prefix so the engine
+                        # (which serves /v1/...) sees the path it expects.


Does this mean at the IG layer we rewrite to /ns/deployment but then we just strip it off here? 🤔

negz · 2026-06-05T06:34:35Z

+            container["env"] = [e.model_dump(exclude_none=True) for e in engine.env]
+
+        pod_spec = {
+            "containers": [container],


Same question WRT ModelDeployment having multiple containers.

negz · 2026-06-05T06:35:00Z

+                        # Strip the /<ns>/<deployment>/ routing prefix so the engine
+                        # (which serves /v1/...) sees the path it expects.


Same question WRT the /ns/deployment rewrite potentially being a no-op.

negz · 2026-06-05T06:35:35Z

+        self.engine = base.engine_container(self.xr)
+        backend = _BACKENDS[base.select_backend(self.xr)]()
+        deployment_name = self._deployment_name()
+        for key, composed in backend.build(self.xr, self.ic, deployment_name).items():
+            resource.update(self.rsp.desired.resources[key], composed)


Nice, I like this clean pattern.

negz · 2026-06-05T06:39:04Z

+    )
+
+
+class TestDispatch(unittest.TestCase):


WDYT about making this test match the existing conventions Case table, etc. I could see the argument it doesn't make sense for backends, but this file feels like it introduces a second test style.

…tale GAIE Resolves Nic's review feedback on PR #99. Collisions (the main concern): backends named workload resources after the *deployment* (shared by all replicas), so two replicas of one deployment on the same InferenceCluster would collide. Name the Deployment/LWS/Service/HTTPRoute after the replica (unique per placement) instead, and make the routing path per-replica (/<ns>/<replica>/) — compose-model-deployment now emits a per-replica rewritePath/url to match. Dropped the now-redundant deployment_name arg from the Backend protocol and all backends. Stale GAIE surface: the ServingStack XRD still advertised a GAIE printer column, a spec.versions.gatewayApiInferenceExtension field, and an "Inference Extension" description after the InferencePool path was dropped — removed all three and regenerated schemas. Accuracy: corrected the llm-d docstring (the HTTPRoute attaches to the workload Envoy gateway, not control-plane Traefik; GAIE/inference-aware routing is the #8 / control-plane concern) and documented why the per-replica prefix is rewritten at the control plane then stripped here (not a no-op — it addresses the replica across multiple deployments on one IC gateway). Tests: refactored test_backends.py to the Case-table convention; updated test_fn.py and compose-model-deployment tests for replica-scoped names/paths. Nits: wrapped the gke example at 80 chars; inlined a single-use var. Sidecar support (non-engine containers) is TODO'd as a follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

dennis-upbound · 2026-06-05T14:22:26Z

Thanks for the thorough review @negz! Addressed in 9362db7:

Collisions (your main concern) — backends named workload resources after the deployment (shared by all replicas). They're now named after the replica (unique per placement: child_name(deployment, cluster)), and the routing path is per-replica (/<ns>/<replica>/). compose-model-deployment emits a matching per-replica rewritePath/url, so N replicas of one deployment can co-exist on one IC without colliding. Also dropped the now-redundant deployment_name arg from the Backend protocol + all backends.
servingstacks GAIE references — removed the GAIE printer column, the spec.versions.gatewayApiInferenceExtension field, and the "Inference Extension" wording, and regenerated schemas. Nothing read the field after the InferencePool path was dropped.
llm-d docstring / PR "Traefik" accuracy — you're right, I'd conflated layers. Reworded: the model HTTPRoute attaches to the workload Envoy gateway; GAIE/inference-aware routing (and Traefik's non-conformance) is the control-plane / Explore inference-aware routing on the control plane gateway #8 concern at the layer above. PR description updated too.
/ns/deployment rewrite then strip — no-op? Not a no-op: the control plane rewrites the public /<ns>/<service>/ to the replica's /<ns>/<replica>/, which must survive to the workload gateway so it can address the right replica among multiple deployments on that IC; the backend then strips it so the engine sees /v1/. Added a clarifying comment in both backends.
The command field — it's the non-vLLM multi-node escape hatch: when set on the engine container it bypasses the built-in vLLM/Ray bootstrap and runs verbatim on both LWS templates, so engines like SGLang (--nnodes/--node-rank/--dist-init-addr, symmetric across pods) work against the LWS_* contract.
test_backends.py style — refactored to the Case-table convention (selection + Dynamo stay focused, since they assert dispatch/raises, not manifests).
Nits — gke example wrapped at 80; inlined the single-use var.

Deferred (TODO'd in code): sidecars (non-engine containers in the MD template) — out of v0.1 scope and non-trivial for the LWS gang; happy to file an issue. The endpoint-name-derived-namespaces idea I left as a thought for later.

dennis-upbound · 2026-06-05T14:42:31Z

Correction to my note above on the GAIE/Traefik layering — I conflated two layers, want to set it straight:

The dropped InferencePool/EPP were composed on the workload cluster and consumed by the workload Envoy gateway (inference-gateway), doing KV-/load-aware pod picking there. So whether to keep them is a workload-gateway question (Envoy's InferencePool v1 support is unconfirmed; or switch the workload gateway to Istio/agentgateway) — Traefik (the control-plane modelplane gateway) never sees them. Saying "Traefik isn't GAIE-conformant, so drop the InferencePool" (and my softer "it's the #8/control-plane concern") was wrong.

Cleaner split:

Workload (Envoy inference-gateway): pod-level KV-/load-aware picking — the dropped InferencePool's job. Dropped for v0.1 simply because we don't need picking yet; reintroducing it is the workload-gateway-conformance question, no Traefik involvement.
Control plane (Traefik modelplane): routing across replicas — this is Explore inference-aware routing on the control plane gateway #8, and weighted splits are Support weighted traffic splitting across endpoint selectors #90 (Traefik does weights natively).

The code is unchanged (Service routing on the workload gateway is correct either way); I've fixed the wording in the llmd.py docstring and the PR description.

…fik) The dropped InferencePool/EPP were a workload-gateway (Envoy `inference-gateway`) concern — pod-level KV-/load-aware picking — not a Traefik one. Reintroducing them depends on the workload gateway's GAIE-conformance, independent of the control-plane Traefik gateway. Issue #8 (inference-aware routing across replicas on the control plane) is a separate layer. Docstring only; no logic change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

negz · 2026-06-05T21:15:37Z

out of v0.1 scope and non-trivial for the LWS gang; happy to file an issue

Please do. If we can't actually support multiple containers we shouldn't allow the MD to specify multiple containers. I think we do at the moment.

dennis-upbound · 2026-06-05T21:39:43Z

Two things:

On the Traefik/routing question (your "does this matter? llm-d would be blissfully unaware of Traefik") — exactly right, and that's the correction I posted above. The InferenceCluster's stack (llm-d/Envoy inference-gateway) is unaware of Traefik; Traefik (modelplane) only fronts it on the control plane. My original wording wrongly tied the workload InferencePool drop to Traefik — the drop is purely a workload-gateway decision (v0.1 needs no KV-aware picking; Envoy InferencePool v1 conformance unconfirmed). Fixed in the docstring + PR description.

On sidecars — agreed, and good catch that the description even claimed they "pass through" (they were silently dropped). Filed #108. For this PR I've constrained containers to maxItems: 1 (single engine container) in both the ModelDeployment and ModelReplica XRDs, so we no longer advertise multi-container support we don't implement. Relaxing it is the work in #108 (needs design for which containers land on the LWS leader vs workers).

The XRD allowed multiple containers (CEL required exactly one named `engine` but didn't cap the total) and the description claimed extras "pass through as sidecars" — but the backends render only `engine` and silently dropped the rest. Add `maxItems: 1` to the containers array in the ModelDeployment and ModelReplica XRDs so we don't advertise multi-container support we don't implement. Real sidecar / multi-container support is tracked in #108. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Dennis Ramdass and others added 17 commits June 3, 2026 13:06

Add design spec: drop KServe, dispatch to native + llm-d

841b14f

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add implementation plan for dropping KServe; note v0.1 topology in spec

0a4f74c

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>

docs: pin llm-d v0.7 / GAIE surface for KServe removal

ba45da3

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>

Revise spec/plan per llm-d v0.7 spike: render Objects, keep Envoy, GA…

4b4877f

…IE v1.5 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>

feat(replica): add Dynamo backend stub (dormant in v0.1)

b53649a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>

feat(replica): add llm-d multi-pod backend (renders Objects)

d621666

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>

refactor(replica): dispatch to backends, drop KServe LLMInferenceService

f5c385e

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>

refactor(api): rename KServeBackend XRD/function to ServingStack (beh…

67abf62

…avior preserved) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>

feat(serving-stack): install GAIE/llm-d substrate, drop KServe and KEDA

92f9021

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>

docs: replace KServe with native + llm-d dispatch; document direct we…

ea8f0a4

…ight fetch Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>

docs: fix stale ModelEndpoint rewritePath description; drop orphan KS…

9222b6e

…erve link Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>

chore(serving-stack): reframe Prometheus as observability; note CRD-r…

6962442

…efresh readiness coupling Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>

dennis-upbound force-pushed the dennis/drop-kserve-spec branch from 2e5dda2 to 2b39078 Compare June 3, 2026 21:44

dennis-upbound marked this pull request as ready for review June 4, 2026 14:24

dennis-upbound changed the title ~~WIP: Drop KServe — dispatch to native + llm-d (Dynamo-ready)~~ Drop KServe — dispatch to native + llm-d (Dynamo-ready) Jun 4, 2026

Dennis Ramdass and others added 4 commits June 4, 2026 10:39

negz reviewed Jun 5, 2026

View reviewed changes

dennis-upbound mentioned this pull request Jun 5, 2026

Support sidecar / multi-container pods in ModelDeployment #108

Open

negz approved these changes Jun 6, 2026

View reviewed changes

negz merged commit 0521837 into main Jun 6, 2026
2 checks passed

This was referenced Jun 6, 2026

Support multiple containers (sidecars) in a serving pod #110

Closed

Add a Dynamo backend #111

Open

Support WVA scaling signal for KServe backends #30

Closed

negz deleted the dennis/drop-kserve-spec branch June 16, 2026 16:57

		# Strip the /<ns>/<deployment>/ routing prefix so the engine
		# (which serves /v1/...) sees the path it expects.

		)


		class TestDispatch(unittest.TestCase):

Uh oh!

Conversation

dennis-upbound commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Routing approach — GAIE vs Traefik (how we're doing it)

Multi-node bootstrap (Ray)

Live validation (real GKE)

Revised follow-ups

Uh oh!

negz commented Jun 5, 2026

Uh oh!

negz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dennis-upbound commented Jun 5, 2026

Uh oh!

dennis-upbound commented Jun 5, 2026

Uh oh!

negz commented Jun 5, 2026

Uh oh!

dennis-upbound commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dennis-upbound commented Jun 2, 2026 •

edited

Loading