Integrate Zenoh#2362
Conversation
❌ 1 Tests Failed:
View the top 1 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
b66e833 to
f8d2d42
Compare
Greptile SummaryThis PR integrates Zenoh as a first-class transport backend alongside LCM, selectable via
Confidence Score: 4/5Safe to merge for Zenoh-transport users; the DDS transport path has a lock that was accidentally removed. The new Zenoh stack is well-designed and thoroughly tested. One concern stands out: DDSTransport.broadcast() and DDSTransport.subscribe() had their _start_lock guard removed in this PR to match the new ZenohTransport pattern, but DDS and Zenoh don't share the same session-lifetime invariant — dds.stop() can tear down the connection while a concurrent broadcast is mid-publish, which was exactly what the original lock prevented. All other changes are clean. dimos/core/transport.py — the DDSTransport broadcast/subscribe lock removal deserves a second look before merging. Important Files Changed
Sequence DiagramsequenceDiagram
participant CLI as dimos CLI / humancli / dtop
participant TF as transport_factory
participant GC as GlobalConfig (DIMOS_TRANSPORT)
participant ZT as ZenohTransport / pZenohTransport
participant ZSP as ZenohSessionPool
participant ZPS as ZenohPubSubBase
participant LCMT as LCMTransport / pLCMTransport
CLI->>GC: apply_transport_arg(argv)
CLI->>TF: make_transport(name, msg_type)
TF->>GC: read transport (zenoh or lcm)
alt "transport == zenoh"
TF->>ZT: ZenohTransport(topic, type)
ZT->>ZPS: "Zenoh(**kwargs)"
ZPS->>ZSP: acquire(ZenohConfig)
ZSP-->>ZPS: shared zenoh.Session
else "transport == lcm"
TF->>LCMT: LCMTransport(topic, type)
end
CLI->>ZT: broadcast(msg)
ZT->>ZPS: publish(topic, lcm_encoded_bytes)
ZPS->>ZPS: _get_publisher(key_expr) cached
ZPS-->>CLI: published
CLI->>ZT: subscribe(callback)
ZT->>ZPS: subscribe(topic, wrapper)
ZPS->>ZSP: session.declare_subscriber(key_expr)
ZSP-->>CLI: unsubscribe fn
Reviews (2): Last reviewed commit: "Merge branch 'main' into paul/feat-integ..." | Re-trigger Greptile |
| def stop(self) -> None: | ||
| """Stop the Zenoh service — does NOT close the shared session.""" | ||
| super().stop() |
There was a problem hiding this comment.
No production path to close stale sessions
stop() deliberately skips closing the shared session, but this means entries in _sessions accumulate for the lifetime of the process with no cleanup hook. In a long-running robot process that restarts the bridge or changes the Zenoh config mid-run, each distinct ZenohConfig.session_key opens a new session that is never closed. Test fixtures work around this by calling session.close() + _sessions.clear() manually, but there is no equivalent atexit/finalizer for production paths. Consider registering an atexit handler or exposing a close_all_sessions() utility so callers have a clean teardown path without needing to reach into module internals.
| undeclared = False | ||
|
|
||
| def unsubscribe() -> None: | ||
| nonlocal undeclared | ||
| if undeclared: | ||
| return | ||
| undeclared = True | ||
| with self._subscriber_lock: | ||
| if sub not in self._subscribers: | ||
| return # Already removed by stop() — stop() owns the undeclare | ||
| self._subscribers.remove(sub) | ||
| sub.undeclare() # type: ignore[no-untyped-call] |
There was a problem hiding this comment.
Non-atomic idempotency guard on
unsubscribe()
The undeclared bool is read and written without a lock, so two threads can both pass if undeclared: before either sets it to True. The behavior is ultimately safe because the subsequent with self._subscriber_lock: check prevents double-undeclare() (the second caller sees sub already removed and returns), but the undeclared flag never achieves its stated purpose of avoiding the lock contention — both callers still acquire the lock. Replacing the flag with a threading.Lock acquired-once pattern (or threading.Event) would make the idempotency guarantee explicit and lock-free for subsequent callers.
| if "pubsubs" in fields_set and pubsubs is not None: | ||
| is_legacy_default = len(pubsubs) == 1 and isinstance(pubsubs[0], LCM) | ||
| if not is_legacy_default: | ||
| return pubsubs | ||
| return _default_pubsubs(getattr(config, "g", config)) |
There was a problem hiding this comment.
[LCM()] explicit override silently falls through to transport default
_resolve_pubsubs treats pubsubs=[LCM()] as the legacy implicit default even when the caller explicitly set the field ("pubsubs" in fields_set). A developer who wants to force LCM-only visualization on a zenoh transport — e.g., to suppress the second Zenoh subscription for debugging — cannot express that intent: their value is indistinguishable from the old default and will be silently replaced by [Zenoh(), LCM()]. The companion test TestBridgePubsubResolution only covers explicit non-LCM overrides, leaving this case untested. Consider a sentinel (e.g., None) to mean "use transport default" rather than overloading the [LCM()] value.
0afc3a9 to
4472fc9
Compare
| def make_transport( | ||
| name: str, msg_type: type | None = None, *, g: GlobalConfig = global_config | ||
| ) -> PubSubTransport[Any]: | ||
| """Construct the active-backend pub/sub transport for a logical channel. |
There was a problem hiding this comment.
Few things, Transport isn't neccessarily a PubSubTransport - we can use TCP and IP address as a setting here etc (in theory, haven't implemented)
in case of PubSubTransport, a string doesn't define a full topic, there is a reason why Topic for LCM and Zenoh is a different object. Zenoh offers QoS settings etc per channel. maybe specific router config etc.
So I'm thinking when global switch zenoh or LCM is used, for lcm that can literally be just LCM(topic_string) but zenoh probably wants reliable delivery for RPC specifically or specific per topic configuration (Image can be unreliable, but not agent messages)
I'm not sure right now what to suggest here - if we can normalize transport requirements across topics "this is reliable", "this is unreliable" or if we need per transport global blueprint config overlay that this global config switch just applies? global overlay seems better to me
| return Topic(topic=topic) | ||
|
|
||
|
|
||
| class ZenohRPC(PubSubRPCMixin[Topic, Any], PickleZenoh): |
There was a problem hiding this comment.
We can do this initially, but Zenoh actually supports RPC on their protocol level, so we dont need to piggyback to pubsub here.
4472fc9 to
2eaac55
Compare
eab7ba7 to
ee8cb1f
Compare
| self._started = False | ||
|
|
||
| def broadcast(self, _, msg) -> None: # type: ignore[no-untyped-def] | ||
| with self._start_lock: | ||
| if not self._started: | ||
| self.start() | ||
| self.dds.publish(self.topic, msg) | ||
| if not self._started: | ||
| self.start() | ||
| self.dds.publish(self.topic, msg) | ||
|
|
||
| def subscribe( | ||
| self, callback: Callable[[T], None], selfstream: Stream[T] | None = None | ||
| ) -> Callable[[], None]: | ||
| with self._start_lock: | ||
| if not self._started: | ||
| self.start() | ||
| return self.dds.subscribe(self.topic, lambda msg, topic: callback(msg)) | ||
| if not self._started: | ||
| self.start() |
There was a problem hiding this comment.
DDSTransport loses atomic start-and-publish guarantee
The PR removed _start_lock from broadcast() and subscribe() in DDSTransport. The original code held the lock around both the start check and the dds.publish() / dds.subscribe() call, serializing those operations against stop(). Without the lock, a concurrent stop() can call dds.stop() while broadcast() is mid-publish — whether that crashes or silently corrupts depends on the underlying DDS implementation. The new ZenohTransport intentionally skips this lock because ZenohSessionPool never closes the session, so a post-stop publish safely re-declares the publisher on the live session. DDS doesn't have the same invariant: dds.stop() can genuinely tear down the connection. The pre-existing _start_lock field is still allocated in __init__, so restoring the lock in broadcast and subscribe is a one-line change per method.
Problem
Closes DIM-XXX
Solution
How to Test
Contributor License Agreement