Drop KServe — dispatch to native + llm-d (Dynamo-ready)#99
Conversation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
…IE v1.5 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
Introduces function/backends/base.py with the Backend Protocol, backend identifiers (native/llmd/dynamo), topology helpers (nodes_per_worker, needs_cross_pod_coordination), and the select_backend dispatcher. Tests verify single-pod -> NATIVE and multi-node -> LLMD routing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
Implements NativeBackend.build(), which composes a Deployment, Service, and HTTPRoute as provider-kubernetes Objects for single-pod (pipeline=1) replicas. Engine args (including --model=) are passed through unmodified — no KServe hf:// rewrite needed. Adds TestNativeBackend with realistic fixtures (metadata namespace + InferenceCluster status.providerConfigRef populated). Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
…convention
Replaces the stale KServe-era path /{_NAMESPACE_REMOTE}/{child_name}/
with /{namespace}/{deployment-name}/, matching the PathPrefix that
compose-model-replica's native and llm-d backends emit on the remote
cluster. Removes the now-unused _NAMESPACE_REMOTE constant and stale
LLMInferenceService comment; updates the compose_endpoints docstring.
Tests updated to expect the new path format and confirmed passing.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
…avior preserved) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
…ight fetch Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
…erve link Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
…efresh readiness coupling Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
The control-plane ModelService rewrites to /<ns>/<deployment>/ (identity), so the remote gateway receives /<ns>/<deployment>/v1/...; the engine serves /v1/... and would 404. Add a URLRewrite ReplacePrefixMatch filter on the native and llm-d HTTPRoutes to strip the prefix to /. Found in code review. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
- compose-inference-cluster: kssv1alpha1 -> ssv1alpha1 in main's new EKS backend-secrets method (missed by the KServeBackend->ServingStack rename) - compose-serving-stack: drop now-unused json import (adopted main's YAML inference-extension CRD loader) - regenerate uv.lock for the compose-serving-stack workspace member rename - add the code-review notes Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
2e5dda2 to
2b39078
Compare
…cluster The Gateway (and the model-serving HTTPRoutes that target it) live in modelplane-system on the workload cluster, but nothing provisioned that namespace — the old KServe path relied on the kserve chart's own namespace. Compose a Namespace object so the Gateway can be created. Found provisioning a real GKE cluster (Gateway object failed: namespaces "modelplane-system" not found). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
The docs/superpowers/ specs/plans/notes are local planning artifacts from the superpowers workflow, not repo content. Ignore the directory and untrack the four files currently committed (kept on disk). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The control-plane gateway is now Traefik (d0df0e7), which is not GAIE-conformant: it can't consume an InferencePool backendRef or call an ext-proc endpoint-picker. So the llm-d multi-pod path drops the InferencePool/EPP machinery and routes HTTPRoute -> Service, exactly like the native path and Nic's Traefik pattern. The Service selects the LWS *leader* pods (only the leader serves the OpenAI API for vLLM). This works on any Gateway API impl (Traefik and Envoy) and removes the GAIE v1 CRD swap and the workload-gateway-conformance question from scope. Inference-aware endpoint picking moves to #8 as a Traefik-compatible in-path picker. Also lands the multi-node Ray bootstrap: separate LWS leader/worker templates with role-specific commands (no LWS_WORKER_INDEX branch) — leader runs `ray start --head` then execs the engine; worker runs `ray start --address=$LWS_LEADER_ADDRESS:6379 --block`. The LWS_* env is the documented public contract. A user-command override (non-vLLM escape hatch) is TODO'd pending a `command` field on the curated Container. compose-serving-stack no longer installs the inference-extension CRDs (nothing emits an InferencePool); drops the unused yaml + mark_readiness keys. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds an optional `command` field to the curated Container in the ModelDeployment and ModelReplica XRDs (regenerated the Python schemas via `crossplane project build`). When the engine container sets a `command`, the llm-d backend injects neither the vLLM/Ray bootstrap nor vLLM-specific parallelism flags — the command runs verbatim on both the LWS leader and worker templates and owns cross-node coordination against the LWS_* contract (LWS_WORKER_INDEX, LWS_LEADER_ADDRESS, LWS_GROUP_SIZE). This is the escape hatch for symmetric non-vLLM engines such as SGLang (--nnodes/--node-rank/--dist-init-addr). vLLM/Ray stays the turnkey default when no command is set. The native single-pod backend also passes a user command through as the container entrypoint override. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
I'm just starting my review - but does this matter? llm-d is running on the InferenceCluster right? We only need to run Traefik on the InferenceGateway - i.e. on the modelplane control plane. It sits in front of whatever the InferenceCluster's stack uses for routing. I would've thought the InferenceCluster (and thus llm-d etc) would be blissfully unaware of Traefik (or whatever) sitting in front of it. |
negz
left a comment
There was a problem hiding this comment.
Thanks @dennis-upbound! I like the backend dispatch pattern this introduces. I have a few questions though - and one concern WRT potential collisions if we allow multiple replicas per IC.
| type: string | ||
| minLength: 1 | ||
| description: Container image. | ||
| command: |
There was a problem hiding this comment.
Did you end up needing this? I'm curious what for.
| referenceable: true | ||
| additionalPrinterColumns: | ||
| - name: KSERVE | ||
| - name: GAIE |
There was a problem hiding this comment.
Do we use GAIA? 🤔 The PR description said it was dropped in favor of Traefik.
| # Modelplane provisions the full GKE cluster (VPC, subnet, system pool, | ||
| # GPU pools, service account, IAM bindings) and installs the inference | ||
| # stack (cert-manager, Envoy Gateway, Prometheus, KEDA, LeaderWorkerSet). | ||
| # stack (cert-manager, Envoy Gateway, Prometheus, LeaderWorkerSet, Gateway API Inference Extension). |
There was a problem hiding this comment.
Nit: The rest of this is wrapped at 80 chars.
(Here and in a few other comments.)
| """ | ||
| llmis = resource.child_name(self.xr.metadata.name) | ||
| rewrite_path = f"/{_NAMESPACE_REMOTE}/{llmis}/" | ||
| deployment_name = self.xr.metadata.name |
There was a problem hiding this comment.
Nit: Why break self.xr.metadata.name out into a var (used once AFAICT) but not self.xr.metadata.namespace? My preference would be avoid the single use var in both cases.
| ) -> dict[str, k8sobjv1alpha1.Object]: | ||
| engine = base.engine_container(replica) | ||
| pc = cluster.status.providerConfigRef.name | ||
| name = resource.child_name(deployment_name) |
There was a problem hiding this comment.
Could this cause a collision if multiple replicas land on the same IC? They can't with the scheduler today, but we do intend to allow it in v0.1.
| # Strip the /<ns>/<deployment>/ routing prefix so the engine | ||
| # (which serves /v1/...) sees the path it expects. |
There was a problem hiding this comment.
Does this mean at the IG layer we rewrite to /ns/deployment but then we just strip it off here? 🤔
| container["env"] = [e.model_dump(exclude_none=True) for e in engine.env] | ||
|
|
||
| pod_spec = { | ||
| "containers": [container], |
There was a problem hiding this comment.
Same question WRT ModelDeployment having multiple containers.
| # Strip the /<ns>/<deployment>/ routing prefix so the engine | ||
| # (which serves /v1/...) sees the path it expects. |
There was a problem hiding this comment.
Same question WRT the /ns/deployment rewrite potentially being a no-op.
| self.engine = base.engine_container(self.xr) | ||
| backend = _BACKENDS[base.select_backend(self.xr)]() | ||
| deployment_name = self._deployment_name() | ||
| for key, composed in backend.build(self.xr, self.ic, deployment_name).items(): | ||
| resource.update(self.rsp.desired.resources[key], composed) |
There was a problem hiding this comment.
Nice, I like this clean pattern.
| ) | ||
|
|
||
|
|
||
| class TestDispatch(unittest.TestCase): |
There was a problem hiding this comment.
WDYT about making this test match the existing conventions Case table, etc. I could see the argument it doesn't make sense for backends, but this file feels like it introduces a second test style.
…tale GAIE Resolves Nic's review feedback on PR #99. Collisions (the main concern): backends named workload resources after the *deployment* (shared by all replicas), so two replicas of one deployment on the same InferenceCluster would collide. Name the Deployment/LWS/Service/HTTPRoute after the replica (unique per placement) instead, and make the routing path per-replica (/<ns>/<replica>/) — compose-model-deployment now emits a per-replica rewritePath/url to match. Dropped the now-redundant deployment_name arg from the Backend protocol and all backends. Stale GAIE surface: the ServingStack XRD still advertised a GAIE printer column, a spec.versions.gatewayApiInferenceExtension field, and an "Inference Extension" description after the InferencePool path was dropped — removed all three and regenerated schemas. Accuracy: corrected the llm-d docstring (the HTTPRoute attaches to the workload Envoy gateway, not control-plane Traefik; GAIE/inference-aware routing is the #8 / control-plane concern) and documented why the per-replica prefix is rewritten at the control plane then stripped here (not a no-op — it addresses the replica across multiple deployments on one IC gateway). Tests: refactored test_backends.py to the Case-table convention; updated test_fn.py and compose-model-deployment tests for replica-scoped names/paths. Nits: wrapped the gke example at 80 chars; inlined a single-use var. Sidecar support (non-engine containers) is TODO'd as a follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for the thorough review @negz! Addressed in 9362db7:
Deferred (TODO'd in code): sidecars (non-engine containers in the MD template) — out of v0.1 scope and non-trivial for the LWS gang; happy to file an issue. The endpoint-name-derived-namespaces idea I left as a thought for later. |
|
Correction to my note above on the GAIE/Traefik layering — I conflated two layers, want to set it straight: The dropped Cleaner split:
The code is unchanged (Service routing on the workload gateway is correct either way); I've fixed the wording in the |
…fik) The dropped InferencePool/EPP were a workload-gateway (Envoy `inference-gateway`) concern — pod-level KV-/load-aware picking — not a Traefik one. Reintroducing them depends on the workload gateway's GAIE-conformance, independent of the control-plane Traefik gateway. Issue #8 (inference-aware routing across replicas on the control plane) is a separate layer. Docstring only; no logic change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Please do. If we can't actually support multiple containers we shouldn't allow the MD to specify multiple containers. I think we do at the moment. |
|
Two things: On the Traefik/routing question (your "does this matter? llm-d would be blissfully unaware of Traefik") — exactly right, and that's the correction I posted above. The InferenceCluster's stack (llm-d/Envoy On sidecars — agreed, and good catch that the description even claimed they "pass through" (they were silently dropped). Filed #108. For this PR I've constrained |
The XRD allowed multiple containers (CEL required exactly one named `engine` but didn't cap the total) and the description claimed extras "pass through as sidecars" — but the backends render only `engine` and silently dropped the rest. Add `maxItems: 1` to the containers array in the ModelDeployment and ModelReplica XRDs so we don't advertise multi-container support we don't implement. Real sidecar / multi-container support is tracked in #108. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fixes #65.
Removes the KServe dependency and composes serving directly.
Status: rebased onto latest
main; native single-pod path validated end-to-end on a real GKE cluster (see below). Reworking the llm-d multi-pod path to be gateway-agnostic / Traefik-compatible — see Routing approach below.What changed
compose-model-replicais now a dispatcher (backends/{native,llmd,dynamo}.py):Deployment+Service+HTTPRoutepipeline > 1) → llm-dLeaderWorkerSet+Service+HTTPRoute(see Routing approach)LLMInferenceServiceemission deleted.KServeBackendXR →ServingStack: installs the serving substrate (LeaderWorkerSet, Gateway API, cert-manager, Prometheus, Envoy Gateway); drops KServe + KEDA. Creates themodelplane-systemnamespace on the workload cluster./<ns>/<deployment>/prefix (URLRewrite) so the engine sees/v1/....compose-model-deploymentendpoint path reconciled with the new HTTPRoute convention.ModelCache→ engine fetches directly; documented in getting-started.Routing approach — GAIE vs Traefik (how we're doing it)
v0.1 does no inference-aware (KV-/load-aware) endpoint picking, so the GAIE
InferencePoolv1 + EPP machinery the llm-d path originally emitted isn't needed yet.Decision: drop
InferencePool/EPP and routeHTTPRoute → Service(plain Gateway API —Service+EndpointSlice, no EnvoyBackendCRD), exactly like the native path. The HTTPRoute attaches to the workload cluster's inference gateway (Envoyinference-gateway, installed byServingStack); the Service selects the LWS leader pods (only leaders serve the OpenAI API for vLLM). llm-d now routes identically to native on any Gateway API impl, so the GAIE v1 CRD swap and the inference-extension CRD install are gone from this PR.Two distinct layers, kept straight (an earlier version of this section conflated them):
inference-gateway, one per cluster): routes to a replica's serving pods. The droppedInferencePool/EPP did KV-/load-aware pod picking here. Reintroducing it is a workload-gateway concern — it needs a GAIE-conformant workload gateway (Envoy Gateway'sInferencePoolv1 support is unconfirmed; alternatively switch the workload gateway to Istio/agentgateway). Traefik is not involved at this layer.modelplane): routes across replicas/clusters via weightedbackendRefs. #8 (inference-aware routing on the control plane) and #90 (weighted splits) live here — Traefik does weights natively but is not GAIE-conformant, so any pod-level picking at this layer would need an in-path picker rather than a GAIE gateway extension.Multi-node bootstrap (Ray)
The LWS gang needs a Ray cluster spanning its pods (vLLM
--pipeline-parallel-size > 1). Approach (matches the upstream LWS/vLLM/KServe convention; reuses the proven bootstrap from the closed PR #78):leaderTemplate/workerTemplatewith role-specificcommands (noif WORKER_INDEX==0branch): leader runsray start --headthen execs the engine; worker runsray start --address=$LWS_LEADER_ADDRESS:6379 --block.LWS_*env (LWS_WORKER_INDEX/LWS_LEADER_ADDRESS/LWS_GROUP_SIZE) is the documented public interface.commandoverride = escape hatch: an optionalcommandon the curated Container (added in this PR, schemas regenerated) bypasses injection and runs verbatim on both templates — covers non-vLLM engines (e.g. SGLang's torch.dist) against theLWS_*contract. No init containers / no sidecar in v0.1.Live validation (real GKE)
Provisioned a fresh GKE cluster end-to-end and served Qwen2.5-0.5B single-pod via the native path — pod
1/1on an L4, real OpenAI completions. Surfaced + fixed: URLRewrite prefix-strip, and themodelplane-systemnamespace creation. Re #102: this PR doesn't changeRoutingReady-vs-ModelReadygating, so #102 stands separately.Revised follow-ups
InferencePool/EPP + the inference-extension CRD install; emitHTTPRoute → Serviceselecting LWS leaders.LWS_*contract + override escape hatch.pipeline: 2) on a 2-GPU-node cluster.🤖 Generated with Claude Code