Skip to content

Drop KServe — dispatch to native + llm-d (Dynamo-ready)#99

Merged
negz merged 24 commits into
mainfrom
dennis/drop-kserve-spec
Jun 6, 2026
Merged

Drop KServe — dispatch to native + llm-d (Dynamo-ready)#99
negz merged 24 commits into
mainfrom
dennis/drop-kserve-spec

Conversation

@dennis-upbound

@dennis-upbound dennis-upbound commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Fixes #65.

Removes the KServe dependency and composes serving directly.

Status: rebased onto latest main; native single-pod path validated end-to-end on a real GKE cluster (see below). Reworking the llm-d multi-pod path to be gateway-agnostic / Traefik-compatible — see Routing approach below.

What changed

  • compose-model-replica is now a dispatcher (backends/{native,llmd,dynamo}.py):
    • single self-contained pod → native Kubernetes Deployment + Service + HTTPRoute
    • multi-pod (pipeline > 1) → llm-d LeaderWorkerSet + Service + HTTPRoute (see Routing approach)
    • Dynamo stub (dormant in v0.1)
    • No user-facing backend field; KServe LLMInferenceService emission deleted.
  • KServeBackend XR → ServingStack: installs the serving substrate (LeaderWorkerSet, Gateway API, cert-manager, Prometheus, Envoy Gateway); drops KServe + KEDA. Creates the modelplane-system namespace on the workload cluster.
  • In-cluster HTTPRoute strips the /<ns>/<deployment>/ prefix (URLRewrite) so the engine sees /v1/....
  • compose-model-deployment endpoint path reconciled with the new HTTPRoute convention.
  • Weight loading: no ModelCache → engine fetches directly; documented in getting-started.

Routing approach — GAIE vs Traefik (how we're doing it)

v0.1 does no inference-aware (KV-/load-aware) endpoint picking, so the GAIE InferencePool v1 + EPP machinery the llm-d path originally emitted isn't needed yet.

Decision: drop InferencePool/EPP and route HTTPRoute → Service (plain Gateway API — Service + EndpointSlice, no Envoy Backend CRD), exactly like the native path. The HTTPRoute attaches to the workload cluster's inference gateway (Envoy inference-gateway, installed by ServingStack); the Service selects the LWS leader pods (only leaders serve the OpenAI API for vLLM). llm-d now routes identically to native on any Gateway API impl, so the GAIE v1 CRD swap and the inference-extension CRD install are gone from this PR.

Two distinct layers, kept straight (an earlier version of this section conflated them):

  • Workload gateway (Envoy inference-gateway, one per cluster): routes to a replica's serving pods. The dropped InferencePool/EPP did KV-/load-aware pod picking here. Reintroducing it is a workload-gateway concern — it needs a GAIE-conformant workload gateway (Envoy Gateway's InferencePool v1 support is unconfirmed; alternatively switch the workload gateway to Istio/agentgateway). Traefik is not involved at this layer.
  • Control-plane gateway (Traefik modelplane): routes across replicas/clusters via weighted backendRefs. #8 (inference-aware routing on the control plane) and #90 (weighted splits) live here — Traefik does weights natively but is not GAIE-conformant, so any pod-level picking at this layer would need an in-path picker rather than a GAIE gateway extension.

Multi-node bootstrap (Ray)

The LWS gang needs a Ray cluster spanning its pods (vLLM --pipeline-parallel-size > 1). Approach (matches the upstream LWS/vLLM/KServe convention; reuses the proven bootstrap from the closed PR #78):

  • Separate leaderTemplate/workerTemplate with role-specific commands (no if WORKER_INDEX==0 branch): leader runs ray start --head then execs the engine; worker runs ray start --address=$LWS_LEADER_ADDRESS:6379 --block.
  • LWS_* env (LWS_WORKER_INDEX/LWS_LEADER_ADDRESS/LWS_GROUP_SIZE) is the documented public interface.
  • User command override = escape hatch: an optional command on the curated Container (added in this PR, schemas regenerated) bypasses injection and runs verbatim on both templates — covers non-vLLM engines (e.g. SGLang's torch.dist) against the LWS_* contract. No init containers / no sidecar in v0.1.

Live validation (real GKE)

Provisioned a fresh GKE cluster end-to-end and served Qwen2.5-0.5B single-pod via the native path — pod 1/1 on an L4, real OpenAI completions. Surfaced + fixed: URLRewrite prefix-strip, and the modelplane-system namespace creation. Re #102: this PR doesn't change RoutingReady-vs-ModelReady gating, so #102 stands separately.

Revised follow-ups

  1. llm-d path → Service routing (this PR): drop InferencePool/EPP + the inference-extension CRD install; emit HTTPRoute → Service selecting LWS leaders.
  2. Multi-node Ray bootstrap (this PR): separate leader/worker templates + LWS_* contract + override escape hatch.
  3. Live-validate multi-node (pipeline: 2) on a 2-GPU-node cluster.
  4. Inference-aware routing → #8 (separate); weighted splitting → #90 (Traefik-native, no GAIE needed).

🤖 Generated with Claude Code

Dennis Ramdass and others added 17 commits June 3, 2026 13:06
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
…IE v1.5

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
Introduces function/backends/base.py with the Backend Protocol, backend
identifiers (native/llmd/dynamo), topology helpers (nodes_per_worker,
needs_cross_pod_coordination), and the select_backend dispatcher.
Tests verify single-pod -> NATIVE and multi-node -> LLMD routing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
Implements NativeBackend.build(), which composes a Deployment, Service, and
HTTPRoute as provider-kubernetes Objects for single-pod (pipeline=1) replicas.
Engine args (including --model=) are passed through unmodified — no KServe
hf:// rewrite needed. Adds TestNativeBackend with realistic fixtures (metadata
namespace + InferenceCluster status.providerConfigRef populated).

Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
…convention

Replaces the stale KServe-era path /{_NAMESPACE_REMOTE}/{child_name}/
with /{namespace}/{deployment-name}/, matching the PathPrefix that
compose-model-replica's native and llm-d backends emit on the remote
cluster. Removes the now-unused _NAMESPACE_REMOTE constant and stale
LLMInferenceService comment; updates the compose_endpoints docstring.
Tests updated to expect the new path format and confirmed passing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
…avior preserved)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
…ight fetch

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
…erve link

Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
…efresh readiness coupling

Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
The control-plane ModelService rewrites to /<ns>/<deployment>/ (identity),
so the remote gateway receives /<ns>/<deployment>/v1/...; the engine serves
/v1/... and would 404. Add a URLRewrite ReplacePrefixMatch filter on the
native and llm-d HTTPRoutes to strip the prefix to /. Found in code review.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
- compose-inference-cluster: kssv1alpha1 -> ssv1alpha1 in main's new EKS
  backend-secrets method (missed by the KServeBackend->ServingStack rename)
- compose-serving-stack: drop now-unused json import (adopted main's YAML
  inference-extension CRD loader)
- regenerate uv.lock for the compose-serving-stack workspace member rename
- add the code-review notes

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
@dennis-upbound dennis-upbound force-pushed the dennis/drop-kserve-spec branch from 2e5dda2 to 2b39078 Compare June 3, 2026 21:44
@dennis-upbound dennis-upbound marked this pull request as ready for review June 4, 2026 14:24
@dennis-upbound dennis-upbound changed the title WIP: Drop KServe — dispatch to native + llm-d (Dynamo-ready) Drop KServe — dispatch to native + llm-d (Dynamo-ready) Jun 4, 2026
Dennis Ramdass and others added 4 commits June 4, 2026 10:39
…cluster

The Gateway (and the model-serving HTTPRoutes that target it) live in
modelplane-system on the workload cluster, but nothing provisioned that
namespace — the old KServe path relied on the kserve chart's own namespace.
Compose a Namespace object so the Gateway can be created. Found provisioning
a real GKE cluster (Gateway object failed: namespaces "modelplane-system"
not found).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Dennis Ramdass <dramdass@Denniss-MacBook-Pro-2.local>
The docs/superpowers/ specs/plans/notes are local planning artifacts from
the superpowers workflow, not repo content. Ignore the directory and untrack
the four files currently committed (kept on disk).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The control-plane gateway is now Traefik (d0df0e7), which is not
GAIE-conformant: it can't consume an InferencePool backendRef or call an
ext-proc endpoint-picker. So the llm-d multi-pod path drops the
InferencePool/EPP machinery and routes HTTPRoute -> Service, exactly like the
native path and Nic's Traefik pattern. The Service selects the LWS *leader*
pods (only the leader serves the OpenAI API for vLLM). This works on any
Gateway API impl (Traefik and Envoy) and removes the GAIE v1 CRD swap and the
workload-gateway-conformance question from scope. Inference-aware endpoint
picking moves to #8 as a Traefik-compatible in-path picker.

Also lands the multi-node Ray bootstrap: separate LWS leader/worker templates
with role-specific commands (no LWS_WORKER_INDEX branch) — leader runs
`ray start --head` then execs the engine; worker runs
`ray start --address=$LWS_LEADER_ADDRESS:6379 --block`. The LWS_* env is the
documented public contract. A user-command override (non-vLLM escape hatch)
is TODO'd pending a `command` field on the curated Container.

compose-serving-stack no longer installs the inference-extension CRDs (nothing
emits an InferencePool); drops the unused yaml + mark_readiness keys.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds an optional `command` field to the curated Container in the ModelDeployment
and ModelReplica XRDs (regenerated the Python schemas via `crossplane project
build`). When the engine container sets a `command`, the llm-d backend injects
neither the vLLM/Ray bootstrap nor vLLM-specific parallelism flags — the command
runs verbatim on both the LWS leader and worker templates and owns cross-node
coordination against the LWS_* contract (LWS_WORKER_INDEX, LWS_LEADER_ADDRESS,
LWS_GROUP_SIZE). This is the escape hatch for symmetric non-vLLM engines such as
SGLang (--nnodes/--node-rank/--dist-init-addr). vLLM/Ray stays the turnkey
default when no command is set.

The native single-pod backend also passes a user command through as the
container entrypoint override.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@negz

negz commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

The original llm-d path emitted a GAIE InferencePool v1 + EPP and pointed an HTTPRoute backendRef at the pool. That mechanism requires a GAIE-conformant, ext-proc-capable data plane (Envoy/Istio/agentgateway). But the control-plane gateway is now Traefik (Nic, d0df0e7 — Traefik is the only Gateway API impl that does per-backendRef URLRewrite for weighted cross-endpoint splits), and Traefik is not GAIE-conformant — it can't consume an InferencePool backendRef or call an ext-proc EPP.

I'm just starting my review - but does this matter? llm-d is running on the InferenceCluster right? We only need to run Traefik on the InferenceGateway - i.e. on the modelplane control plane. It sits in front of whatever the InferenceCluster's stack uses for routing. I would've thought the InferenceCluster (and thus llm-d etc) would be blissfully unaware of Traefik (or whatever) sitting in front of it.

@negz negz left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dennis-upbound! I like the backend dispatch pattern this introduces. I have a few questions though - and one concern WRT potential collisions if we allow multiple replicas per IC.

type: string
minLength: 1
description: Container image.
command:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you end up needing this? I'm curious what for.

Comment thread apis/servingstacks/definition.yaml Outdated
referenceable: true
additionalPrinterColumns:
- name: KSERVE
- name: GAIE

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use GAIA? 🤔 The PR description said it was dropped in favor of Traefik.

# Modelplane provisions the full GKE cluster (VPC, subnet, system pool,
# GPU pools, service account, IAM bindings) and installs the inference
# stack (cert-manager, Envoy Gateway, Prometheus, KEDA, LeaderWorkerSet).
# stack (cert-manager, Envoy Gateway, Prometheus, LeaderWorkerSet, Gateway API Inference Extension).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The rest of this is wrapped at 80 chars.

(Here and in a few other comments.)

"""
llmis = resource.child_name(self.xr.metadata.name)
rewrite_path = f"/{_NAMESPACE_REMOTE}/{llmis}/"
deployment_name = self.xr.metadata.name

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Why break self.xr.metadata.name out into a var (used once AFAICT) but not self.xr.metadata.namespace? My preference would be avoid the single use var in both cases.

) -> dict[str, k8sobjv1alpha1.Object]:
engine = base.engine_container(replica)
pc = cluster.status.providerConfigRef.name
name = resource.child_name(deployment_name)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this cause a collision if multiple replicas land on the same IC? They can't with the scheduler today, but we do intend to allow it in v0.1.

Comment on lines +210 to +211
# Strip the /<ns>/<deployment>/ routing prefix so the engine
# (which serves /v1/...) sees the path it expects.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean at the IG layer we rewrite to /ns/deployment but then we just strip it off here? 🤔

container["env"] = [e.model_dump(exclude_none=True) for e in engine.env]

pod_spec = {
"containers": [container],

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question WRT ModelDeployment having multiple containers.

Comment on lines +111 to +112
# Strip the /<ns>/<deployment>/ routing prefix so the engine
# (which serves /v1/...) sees the path it expects.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question WRT the /ns/deployment rewrite potentially being a no-op.

Comment on lines +149 to +153
self.engine = base.engine_container(self.xr)
backend = _BACKENDS[base.select_backend(self.xr)]()
deployment_name = self._deployment_name()
for key, composed in backend.build(self.xr, self.ic, deployment_name).items():
resource.update(self.rsp.desired.resources[key], composed)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I like this clean pattern.

)


class TestDispatch(unittest.TestCase):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about making this test match the existing conventions Case table, etc. I could see the argument it doesn't make sense for backends, but this file feels like it introduces a second test style.

…tale GAIE

Resolves Nic's review feedback on PR #99.

Collisions (the main concern): backends named workload resources after the
*deployment* (shared by all replicas), so two replicas of one deployment on the
same InferenceCluster would collide. Name the Deployment/LWS/Service/HTTPRoute
after the replica (unique per placement) instead, and make the routing path
per-replica (/<ns>/<replica>/) — compose-model-deployment now emits a per-replica
rewritePath/url to match. Dropped the now-redundant deployment_name arg from the
Backend protocol and all backends.

Stale GAIE surface: the ServingStack XRD still advertised a GAIE printer column,
a spec.versions.gatewayApiInferenceExtension field, and an "Inference Extension"
description after the InferencePool path was dropped — removed all three and
regenerated schemas.

Accuracy: corrected the llm-d docstring (the HTTPRoute attaches to the workload
Envoy gateway, not control-plane Traefik; GAIE/inference-aware routing is the
#8 / control-plane concern) and documented why the per-replica prefix is
rewritten at the control plane then stripped here (not a no-op — it addresses
the replica across multiple deployments on one IC gateway).

Tests: refactored test_backends.py to the Case-table convention; updated
test_fn.py and compose-model-deployment tests for replica-scoped names/paths.

Nits: wrapped the gke example at 80 chars; inlined a single-use var. Sidecar
support (non-engine containers) is TODO'd as a follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dennis-upbound

Copy link
Copy Markdown
Collaborator Author

Thanks for the thorough review @negz! Addressed in 9362db7:

  • Collisions (your main concern) — backends named workload resources after the deployment (shared by all replicas). They're now named after the replica (unique per placement: child_name(deployment, cluster)), and the routing path is per-replica (/<ns>/<replica>/). compose-model-deployment emits a matching per-replica rewritePath/url, so N replicas of one deployment can co-exist on one IC without colliding. Also dropped the now-redundant deployment_name arg from the Backend protocol + all backends.
  • servingstacks GAIE references — removed the GAIE printer column, the spec.versions.gatewayApiInferenceExtension field, and the "Inference Extension" wording, and regenerated schemas. Nothing read the field after the InferencePool path was dropped.
  • llm-d docstring / PR "Traefik" accuracy — you're right, I'd conflated layers. Reworded: the model HTTPRoute attaches to the workload Envoy gateway; GAIE/inference-aware routing (and Traefik's non-conformance) is the control-plane / Explore inference-aware routing on the control plane gateway #8 concern at the layer above. PR description updated too.
  • /ns/deployment rewrite then strip — no-op? Not a no-op: the control plane rewrites the public /<ns>/<service>/ to the replica's /<ns>/<replica>/, which must survive to the workload gateway so it can address the right replica among multiple deployments on that IC; the backend then strips it so the engine sees /v1/. Added a clarifying comment in both backends.
  • The command field — it's the non-vLLM multi-node escape hatch: when set on the engine container it bypasses the built-in vLLM/Ray bootstrap and runs verbatim on both LWS templates, so engines like SGLang (--nnodes/--node-rank/--dist-init-addr, symmetric across pods) work against the LWS_* contract.
  • test_backends.py style — refactored to the Case-table convention (selection + Dynamo stay focused, since they assert dispatch/raises, not manifests).
  • Nits — gke example wrapped at 80; inlined the single-use var.

Deferred (TODO'd in code): sidecars (non-engine containers in the MD template) — out of v0.1 scope and non-trivial for the LWS gang; happy to file an issue. The endpoint-name-derived-namespaces idea I left as a thought for later.

@dennis-upbound

Copy link
Copy Markdown
Collaborator Author

Correction to my note above on the GAIE/Traefik layering — I conflated two layers, want to set it straight:

The dropped InferencePool/EPP were composed on the workload cluster and consumed by the workload Envoy gateway (inference-gateway), doing KV-/load-aware pod picking there. So whether to keep them is a workload-gateway question (Envoy's InferencePool v1 support is unconfirmed; or switch the workload gateway to Istio/agentgateway) — Traefik (the control-plane modelplane gateway) never sees them. Saying "Traefik isn't GAIE-conformant, so drop the InferencePool" (and my softer "it's the #8/control-plane concern") was wrong.

Cleaner split:

The code is unchanged (Service routing on the workload gateway is correct either way); I've fixed the wording in the llmd.py docstring and the PR description.

…fik)

The dropped InferencePool/EPP were a workload-gateway (Envoy `inference-gateway`)
concern — pod-level KV-/load-aware picking — not a Traefik one. Reintroducing
them depends on the workload gateway's GAIE-conformance, independent of the
control-plane Traefik gateway. Issue #8 (inference-aware routing across replicas
on the control plane) is a separate layer. Docstring only; no logic change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@negz

negz commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

out of v0.1 scope and non-trivial for the LWS gang; happy to file an issue

Please do. If we can't actually support multiple containers we shouldn't allow the MD to specify multiple containers. I think we do at the moment.

@dennis-upbound

Copy link
Copy Markdown
Collaborator Author

Two things:

On the Traefik/routing question (your "does this matter? llm-d would be blissfully unaware of Traefik") — exactly right, and that's the correction I posted above. The InferenceCluster's stack (llm-d/Envoy inference-gateway) is unaware of Traefik; Traefik (modelplane) only fronts it on the control plane. My original wording wrongly tied the workload InferencePool drop to Traefik — the drop is purely a workload-gateway decision (v0.1 needs no KV-aware picking; Envoy InferencePool v1 conformance unconfirmed). Fixed in the docstring + PR description.

On sidecars — agreed, and good catch that the description even claimed they "pass through" (they were silently dropped). Filed #108. For this PR I've constrained containers to maxItems: 1 (single engine container) in both the ModelDeployment and ModelReplica XRDs, so we no longer advertise multi-container support we don't implement. Relaxing it is the work in #108 (needs design for which containers land on the LWS leader vs workers).

The XRD allowed multiple containers (CEL required exactly one named `engine`
but didn't cap the total) and the description claimed extras "pass through as
sidecars" — but the backends render only `engine` and silently dropped the
rest. Add `maxItems: 1` to the containers array in the ModelDeployment and
ModelReplica XRDs so we don't advertise multi-container support we don't
implement. Real sidecar / multi-container support is tracked in #108.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@negz negz merged commit 0521837 into main Jun 6, 2026
2 checks passed
@negz negz deleted the dennis/drop-kserve-spec branch June 16, 2026 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Drop the KServe dependency - compose to llm-d and Dynamo for multi-pod serving

2 participants