feat(transport): WebRTC DataChannel transport (cloudflare SFU)#2048
feat(transport): WebRTC DataChannel transport (cloudflare SFU)#2048spomichter wants to merge 15 commits into
Conversation
e56fd1f to
212450b
Compare
…e SFU Implements a new pubsub transport backed by WebRTC DataChannels over Cloudflare's Realtime SFU. Two new classes in dimos/protocol/pubsub/impl/webrtcpubsub.py: - CloudflareSession: manages the WebRTC PeerConnection lifecycle. Opens two CF sessions (publisher + subscriber) so a single process can do loopback pubsub. Runs aiortc on a dedicated background asyncio thread with its own ThreadPoolExecutor (so we don't leak asyncio_N worker threads). Uses negotiated=True placeholder DCs with id=100 during transport establishment to avoid stream-id collisions with CF-assigned ids. - WebRTCPubSub: bytes-on-the-wire pubsub facade matching the LCMPubSubBase / BytesSharedMemory interface (string topics, bytes payloads). Lazily creates pub/sub DataChannel pairs on first publish/subscribe per topic. Also adds: - WebRTCTransport in dimos/core/transport.py (mirrors LCMTransport pattern, no encoding - bytes only). - WebRTC benchmark testcase in dimos/protocol/pubsub/benchmark/testdata.py, gated on aiortc + CF_TELEOP_APP_ID / CF_TELEOP_APP_SECRET env vars. - Integration test in dimos/protocol/pubsub/impl/test_webrtcpubsub.py covering basic pub/sub, latency, and throughput (all live tests skip without CF credentials). - aiortc + httpx as new 'webrtc' optional extra in pyproject.toml. Live benchmark (us-east-2 -> CF edge): - 64-256B: ~10K msgs/s, 0% loss - 1KiB: ~7K msgs/s, 0% loss - >= 64KiB: dropped (above SCTP message size) - Median single-RTT: ~2.5 ms
…sport - Add BrokerProvider: DataChannelProvider that works through the hosted teleop broker (dimensional-teleop) instead of directly with CF credentials. Handles session registration, heartbeat loop, and DataChannel creation when an operator joins via the broker's bridge-datachannel API. - Extend WebRTCTransport with optional msg_type parameter for typed LCM encode/decode with fingerprint-based filtering. Multiple transports can share a single multiplexed DataChannel and each receives only its type. - Add hosted teleop blueprints (dimos/teleop/hosted/) demonstrating the module-free architecture: make_teleop_hosted_go2() uses pure transport (zero modules), make_teleop_hosted_go2_scaled() adds a thin TeleopScalerModule for speed scaling only. - Add unit tests for typed mode, fingerprint filtering, multiplexed dispatch, and BrokerProvider credential validation.
- Rebase on main and regenerate uv.lock (resolve conflict) - Add _LoopbackProvider (in-process, no network) to benchmark testdata - Enables local WebRTC transport benchmarking without CF credentials - All 12 message sizes pass locally (2.78s total)
212450b to
2c263eb
Compare
❌ 1 Tests Failed:
View the top 1 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
Greptile SummaryThis PR introduces a WebRTC DataChannel transport (
Confidence Score: 5/5Safe to merge; all the blocking issues from the prior review round have been addressed in this revision. The previous round surfaced a cluster of defects — double-constructing provider on pickle, silent publish no-op, lifecycle races in start/stop, zombie loop threads, and channel-name collisions. The current code resolves each one: _rebuild_webrtc_transport prevents double construction, BrokerProvider.publish() raises NotImplementedError, AsyncProviderBase sets _started=False at the top of stop() and tears down the loop thread on _connect() failure, and _dc_name appends a SHA1 suffix to prevent collisions. The only new finding is a narrow first-call race in WebRTCTransport.start() that can orphan subscribe_all callbacks — an edge case since subscribe_all is rarely used and the underlying provider is a singleton. dimos/core/transport.py (WebRTCTransport.start() lazy-init guard) and dimos/protocol/pubsub/impl/webrtc/webrtcpubsub.py (subscribe_all N×M delivery, noted in prior review). Important Files Changed
Sequence DiagramsequenceDiagram
participant MT as Module Thread
participant WRT as WebRTCTransport
participant WPS as WebRTCPubSub
participant PC as ProviderConfig
participant P as Provider (singleton)
participant CF as Cloudflare Realtime
MT->>WRT: subscribe(callback)
WRT->>WRT: start() [lazy]
WRT->>PC: config.provider()
PC-->>WRT: Provider singleton (per-process)
WRT->>WPS: WebRTCPubSub(provider)
WRT->>WPS: start()
WPS->>P: start()
P->>CF: POST /sessions/new (pub + sub)
P->>CF: WebRTC ICE/DTLS handshake
CF-->>P: connected
WRT->>WPS: subscribe(topic, _typed_cb)
WPS->>P: subscribe(topic, _wrapped)
P->>CF: POST /datachannels/new (sub)
CF-->>P: DataChannel open
MT->>WRT: broadcast(msg)
WRT->>WPS: publish(topic, lcm_encoded)
WPS->>P: publish(topic, data)
P->>CF: DataChannel.send(data)
CF-->>P: message event (loopback/broker)
P->>WPS: _wrapped(data, topic)
WPS->>WRT: _typed_cb(data, topic)
WRT->>WRT: fingerprint check → lcm_decode
WRT->>MT: callback(typed_msg)
Reviews (13): Last reviewed commit: "docs(webrtc): mark CloudflareProvider as..." | Re-trigger Greptile |
The previous lock regen dropped the `exclude-newer-span` marker, leaving only the frozen `exclude-newer` timestamp. uv then treats every resolve as "cooldown was newly added" and forces a re-resolve against today minus 7 days — which currently excludes md-babel-py 1.2.0 (published 2026-05-15) and breaks `uv sync --extra all` / `uv lock`. Re-adding the span line tells uv the lock was generated with P7D semantics, so the existing pinned versions are honored.
- Remove __init__.py files (project policy: no init files) - Remove section markers from test_webrtcpubsub.py - Regenerate all_blueprints.py (adds TeleopScalerModule) - Fix WebRTCTransport.__reduce__ to preserve msg_type across pickle - Fix CloudflareProvider.publish() race: snapshot loop ref before use - Fix CloudflareProvider.subscribe() race: check sub_channels inside lock - Add comment clarifying TwistStamped→Twist type safety in blueprint
| def __reduce__(self): # type: ignore[no-untyped-def] | ||
| # Provider cannot be pickled (holds sockets/threads); on unpickle | ||
| # a new provider is created from env vars. Preserve msg_type so | ||
| # typed fingerprint filtering survives multiprocessing. | ||
| return (WebRTCTransport, (self.topic,), {"msg_type": self._msg_type}) | ||
|
|
||
| def __setstate__(self, state: dict) -> None: # type: ignore[no-untyped-def] | ||
| msg_type = state.get("msg_type") | ||
| self.__init__(self.topic, msg_type=msg_type) # type: ignore[misc] |
There was a problem hiding this comment.
__reduce__ double-construction breaks pickling when env vars are absent
Python's pickle protocol for __reduce__ returning a 3-tuple (callable, args, state) first calls callable(*args) — i.e. WebRTCTransport(self.topic) with no provider — before calling __setstate__. That intermediate constructor call immediately tries to build a CloudflareProvider from env vars. If CF_TELEOP_APP_ID / CF_TELEOP_APP_SECRET are not set in the worker process (e.g. the transport was originally built with an explicit BrokerProvider), the reconstruction raises RuntimeError("CF_TELEOP_APP_ID and CF_TELEOP_APP_SECRET required") before __setstate__ is ever reached. When the env vars ARE set, two CloudflareProvider instances are constructed: the first (from the (callable, args) step) is immediately orphaned when __setstate__ creates a second one. The fix is to delegate reconstruction to a module-level helper so __init__ is called exactly once.
There was a problem hiding this comment.
This is correct. This line calls __init__ a second time. So it constructs two CloudflareProvider classes.
- Add return type annotations to CloudflareProvider event handlers - Fix type: ignore codes to match actual mypy errors (attr-defined) - Add type annotation to __setstate__ dict parameter - Add type: ignore[arg-type] for WebRTCPubSub→PubSub duck typing in benchmark - Remove TwistStamped subclass comment from blueprint
46028dc to
3baba45
Compare
| ch = self._pub_channels.get(topic) | ||
| if ch is None: | ||
| ch = self._run_sync(self._ensure_pub(topic)) |
There was a problem hiding this comment.
publish() checks _pub_channels and lazily creates the pub DataChannel with no lock, so two concurrent first-publish calls to the same new topic both see ch = None, dispatch two _ensure_pub coroutines to the event loop, and both pass _ensure_pub's own if topic in self._pub_channels guard before either coroutine stores the channel (they interleave during the HTTP await). The result is two CF REST calls for the same channel name, a race to overwrite _pub_channels[topic], and one thread holding a stale/orphaned channel reference that routes messages to the wrong (or duplicate) CF DataChannel. subscribe() already applies a partial guard by reading topic not in self._sub_channels inside self._lock — the same pattern should be used here.
| ch = self._pub_channels.get(topic) | |
| if ch is None: | |
| ch = self._run_sync(self._ensure_pub(topic)) | |
| with self._lock: | |
| ch = self._pub_channels.get(topic) | |
| needs_pub = ch is None | |
| if needs_pub: | |
| ch = self._run_sync(self._ensure_pub(topic)) |
The throughput benchmark blasts 5000 msgs/sec which overflows WebRTC DataChannel SCTP buffers. WebRTC is a low-rate control transport (50-80Hz teleop), not a bulk pipe. WebRTC transport is still tested by: - test_webrtc_transport.py (unit tests, mock provider, <1s) - test_webrtcpubsub.py (CF integration, skipped without creds)
test_webrtcpubsub.py talks to live Cloudflare SFU (~68s) and requires CF_TELEOP_APP_ID + CF_TELEOP_APP_SECRET. Tag with pytestmark = tool so they're excluded by CI's default -m 'not (tool or self_hosted ...)' filter. Run locally with: pytest -m tool
| self._external_pub_id = publisher_session_id | ||
| self._ordered = ordered | ||
| self._max_retransmits = max_retransmits | ||
|
|
There was a problem hiding this comment.
DataChannel name collision via
_sanitize_topic
_sanitize_topic replaces every non-alphanumeric/underscore/dash character with _. This means two distinct topic names that differ only in those characters — e.g., "cmd/vel" and "cmd_vel" — both produce "pub_cmd_vel" as the CF DataChannel name. When _ensure_pub is then called for the second topic, it issues a second POST /datachannels/new with the same dataChannelName in the same CF session. CF either rejects this with an error (the assert fires) or creates a duplicate channel, silently mixing or dropping messages between the two topics. The collision is invisible at the Python dict level because _pub_channels is keyed by the original topic string, not the sanitized name — so neither guard in _ensure_pub catches it.
Delete test_import and test_sanitize_dc_name (no useful coverage). Move pytestmark from module-level to individual test functions.
|
@greptile review |
| await self._http.delete( | ||
| f"{self._broker_url}/api/v1/sessions/{self._session_id}", | ||
| headers=self._headers, | ||
| ) | ||
| except Exception: | ||
| pass # Best-effort cleanup | ||
|
|
||
| if self._pc: | ||
| await self._pc.close() | ||
| self._pc = None | ||
| if self._http: | ||
| await self._http.aclose() |
There was a problem hiding this comment.
Zombie event-loop thread on
start() failure
If _run_sync(self._connect()) raises (broker unreachable, ICE timeout, network error), self._started stays False but self._thread is still alive — blocked forever on _stop_ev.wait(). A subsequent start() call sees _started == False, creates a fresh thread, and overwrites self._loop/self._stop_ev, orphaning the old thread. Every failed start leaks one daemon thread and its event loop. The same pattern exists in CloudflareProvider.start() (~line 693–702).
| def start(self) -> None: | ||
| with self._lock: | ||
| if self._started: | ||
| return | ||
| self._thread = threading.Thread(target=self._run_loop, daemon=True, name="cf-webrtc") | ||
| self._thread.start() | ||
| if not self._ready.wait(timeout=5.0): | ||
| raise RuntimeError("CF event loop failed to start") | ||
| self._run_sync(self._connect()) | ||
| self._started = True |
There was a problem hiding this comment.
State corruption on
start() retry after a failed _connect()
If _run_sync(self._connect()) raises (e.g., network error, CF 5xx, ICE timeout), self._started stays False but self._thread, self._loop, and self._ready remain in a partially-initialized state. The next start() call then:
- Creates a new Thread2 and sets
self._thread. self._ready.wait(5.0)returns immediately because the event was already set by Thread1 — it is never cleared on a failed start._run_sync(self._connect())fires before Thread2 has had a chance to run, soself._loopmight still point toloop1(Thread1's loop, which is still alive). If_connect()succeeds this time,self._started = True, but when Thread2 eventually overwritesself._loop = loop2, every subsequentpublish()/subscribe()routes coroutines toloop2while all channel state was built onloop1. This is silent state corruption.
The fix is to reset self._ready, self._loop, and self._thread in the exception path of start(), or (better) to call stop() in the exception handler so the cleanup path is unified.
The same pattern exists in BrokerProvider.start() at line 166 for identical reasons.
| def broadcast(self, _, msg) -> None: # type: ignore[no-untyped-def] | ||
| if not self._started: | ||
| self.start() | ||
| if self._msg_type is not None and hasattr(msg, "lcm_encode"): |
There was a problem hiding this comment.
This suggests that you shouldn't use msg_type: type[T] | None, but a bound type var. If that's done, then there's no need for hasattr(msg, "lcm_encode") because we'll know at static analysis time that a particular type cannot be used.
from dimos.msgs.protocol import DimosMsg
M = TypeVar("M", bound=DimosMsg)
And then change __init__ to take msg_type: type[M] | None.
|
|
||
| def _typed_cb(data: bytes, _topic: str) -> None: | ||
| if len(data) >= 8 and data[:8] == fp: | ||
| callback(msg_type.lcm_decode(data)) # type: ignore[attr-defined] |
There was a problem hiding this comment.
| callback(msg_type.lcm_decode(data)) # type: ignore[attr-defined] | |
| callback(msg_type.lcm_decode(data)) |
Not needed once you do the DimosMsg comment from above.
|
|
||
| def __setstate__(self, state: dict[str, Any]) -> None: | ||
| msg_type = state.get("msg_type") | ||
| self.__init__(self.topic, msg_type=msg_type) # type: ignore[misc] |
There was a problem hiding this comment.
Calling __init__ is quite odd, especially since by this point it has already been called. I guess the reason you have to do this is because __reduce__ only allows restoring through the args and not kwargs.
The better way is to use a factory function. No need for __setstate__ at all:
@classmethod
def _from_pickle(cls, topic: str, msg_type: type[M] | None) -> "WebRTCTransport[M]":
return cls(topic, msg_type=msg_type)
def __reduce__(
self,
) -> tuple[Callable[[str, type[M] | None], "WebRTCTransport[M]"], tuple[str, type[M] | None]]:
return (self.__class__._from_pickle, (self.topic, self._msg_type))| return self.dds.subscribe(self.topic, lambda msg, topic: callback(msg)) | ||
|
|
||
|
|
||
| class WebRTCTransport(PubSubTransport[T]): |
There was a problem hiding this comment.
How is this intended to be used?
What concerns me is that by having one Transport per field, we're wasting a lot of resources. It might make sense for LCM which is more lightweight, but CloudflareProvider is quite a heavy thing, intended to be used to transport multiple channels.
Is this just meant to be used for 1 or 2 In/Out fields? Because if it's used for whole modules, then it would be quite wasteful.
(I mentioned this before. The root issue is that we don't have we don't usually maintain relations between objects in memory. If we had a root TransportProvider object, that object could maintain active transports and share resources which are common (so different transports could share the same CloudflareProvider) and it could also gracefully shutdown things when they're no longer needed.)
There was a problem hiding this comment.
IDK right now and can think more on this, but basically individual pubsub implementations can maintain their own global registry if needed and more efficient
we had an LCM implementation which shared the single cpp LCM instance previously. implemnetations can own this, so we don't have to worry on user end
|
|
||
| @property | ||
| def is_connected(self) -> bool: | ||
| return self._started |
There was a problem hiding this comment.
Needs with self._lock because it's also updated with the lock.
| def _on_msg(payload: Any) -> None: | ||
| if isinstance(payload, str): | ||
| payload = payload.encode() | ||
| for cb in list(cbs.get(topic, ())): |
There was a problem hiding this comment.
cbs is modified under a lock so you need to use a lock to read the topics here too.
You're doing list(topics) here to avoid looping over a list that could be altered from a different thread. Although this works, it relies on the fact that list is a C code which holds the GIL. In some versions of Python the GIL is removed and this could fail. So both the .get and the list duplication need to be done under a lock.
| # ─── Public API (DataChannelProvider) ──────────────────────────── | ||
|
|
||
| def publish(self, topic: str, data: bytes) -> None: | ||
| if not self._started: |
There was a problem hiding this comment.
Need thread-safe access to self._started.
| logger = setup_logger() | ||
|
|
||
|
|
||
| class DataChannelProvider(ABC): |
There was a problem hiding this comment.
Since all the methods are abstract, DataChannelProvider could work better as a Protocol.
| skip_unless_cf = pytest.mark.skipif( | ||
| not (WEBRTC_AVAILABLE and CF_CREDS_PRESENT), | ||
| reason="Requires aiortc + CF_TELEOP_APP_ID/CF_TELEOP_APP_SECRET", | ||
| ) |
There was a problem hiding this comment.
I don't think tests should check imports.
Checking imports makes sense for users so that we support smaller install times.
But for developers, tests should fail, not "silently pass" if you're missing dependencies.
There's the potential that a person is writing a feature, breaks these tests, but he doesn't know because he doesn't have the necessary packages installed. The same thing can happen in CI, where tests "pass" only because we forgot to install the dependencies. (If I'm not mistaken, this is already the case since the webrtc group has not been included in all.)
There was a problem hiding this comment.
Kinda, but we'll also need to help downstream packagers with running tests etc. Many of these packagers will not be able to use API keys and may run the tests with no network access. So, having a simple way to skip tests that have such dependencies makes sense.
I think it makes sense for a default pytest run to pass without additional setup needed. Then we need a way to ensure our team is running the extra tests..
| "cyclonedds>=0.10.5", | ||
| ] | ||
|
|
||
| webrtc = [ |
There was a problem hiding this comment.
Needs to be included in all, below.
| import threading | ||
| from typing import Any | ||
|
|
||
| from dimos.protocol.pubsub.impl.webrtcpubsub import DataChannelProvider |
There was a problem hiding this comment.
why not put this DataChannelProvider in webrtc_providers/spec.py
| @pytest.mark.tool | ||
| @skip_unless_cf | ||
| @pytest.mark.timeout(60) | ||
| def test_basic_pub_sub(pubsub: WebRTCPubSub) -> None: |
There was a problem hiding this comment.
main tests to pass for pubsub are standardized grid tests in pubsub/spec.py - these check that you actually behave like other pubsubs
and you have a benchmark which is good for comparison, at pubsub/benchmark
python -m pytest -svk "not bytes and not udp" -m tool dimos/protocol/pubsub/benchmark/test_benchmark.py
| @pytest.mark.tool | ||
| @skip_unless_cf | ||
| @pytest.mark.timeout(60) | ||
| def test_latency(pubsub: WebRTCPubSub) -> None: |
There was a problem hiding this comment.
def for a benchmark no need to diy this
|
|
||
| logger = setup_logger() | ||
|
|
||
|
|
There was a problem hiding this comment.
why not name this Provider and move it to webrtc_providers/spec.py
even better, host your stuff in your dir (because webrtc is complex)
protocol/pubsub/impl/webrtc
protocol/pubsub/impl/webrtc/webrtcpubsub.py
protocol/pubsub/impl/webrtc/test_webrtcpubsub.py
protocol/pubsub/impl/webrtc/providers
protocol/pubsub/impl/webrtc/providers/spec.py
protocol/pubsub/impl/webrtc/providers/cloudflare
There was a problem hiding this comment.
done moved into this format
|
|
||
| # Start heartbeat loop | ||
| assert self._loop is not None | ||
| self._loop.create_task(self._heartbeat_loop()) |
There was a problem hiding this comment.
We should really hold onto this task and on shutdown do:
t.cancel()
with suppress(asyncio.CancelledError):
await t
Or use aiojobs (in which case the exception log inside the loop shouldn't be needed anymore).
|
|
||
|
|
||
| @pytest.fixture | ||
| def pubsub() -> Generator[WebRTCPubSub, None, None]: |
There was a problem hiding this comment.
Really need to put this in CLAUDE.md...
| def pubsub() -> Generator[WebRTCPubSub, None, None]: | |
| def pubsub() -> Iterator[WebRTCPubSub]: |
- WebRTCPubSub now extends AllPubSub[str, bytes] from the pubsub spec, making it a first-class DimOS pubsub (same as LCM, SHM, Redis) - DataChannelProvider changed from ABC to Protocol (per review feedback) - Implements subscribe_all via fan-out on per-topic subscriptions - Added WebRTCPubSub to the standardized grid tests in test_spec.py using MockProvider (no network, runs in CI) - Enables encoder mixin composition (LCMEncoderMixin, PickleEncoderMixin) - Gains free sugar methods: sub(), aiter(), queue()
…ew/webrtc-transport # Conflicts: # dimos/robot/all_blueprints.py # pyproject.toml # uv.lock
…le provider configs - Move webrtcpubsub + providers into protocol/pubsub/impl/webrtc/ with providers/spec.py (Provider protocol, ProviderConfig, AsyncProviderBase) - ProviderConfig: picklable, hashable factory resolving to a per-process singleton provider — transports survive pickling into module workers and share one PeerConnection per process - WebRTCTransport rebuilt on DimosMsg-bound typevar; CloudflareTransport subclass binds BrokerConfig for blueprint use - Fingerprint filter now derives from the wire format (TwistStamped inherits Twist's fingerprint but encodes as LCM TwistStamped) - BrokerProvider: operator rejoin via SCTP id tracking, heartbeat task held and cancelled on disconnect, X-Robot-API-Key auth, id=0 throwaway channel, publish() raises (broker is receive-only for now) - CloudflareProvider: locking discipline, asyncio channel-creation lock, collision-safe DC names - Benchmark: WebRTC case in the standard harness, env-overridable knobs (DIMOS_BENCH_DURATION_S / _MAX_MESSAGES / _RECEIVE_TIMEOUT_S) - teleop-hosted-go2-transport: transport-only go2 blueprint (3 lines) - Delete dimos/teleop/hosted (duplicate scaler), add webrtc extra to all
| def subscribe(self, topic: str, callback: Callable[[bytes, str], None]) -> Callable[[], None]: | ||
| if not self._started: | ||
| self.start() | ||
|
|
||
| def _wrapped(data: bytes, t: str) -> None: | ||
| callback(data, t) | ||
| for all_cb in list(self._all_callbacks): | ||
| try: | ||
| all_cb(data, t) | ||
| except Exception: | ||
| logger.exception("subscribe_all callback error") | ||
|
|
||
| return self._provider.subscribe(topic, _wrapped) |
There was a problem hiding this comment.
subscribe_all fires N times per message when N topic subscriptions exist
Each subscribe() call wraps the callback in _wrapped, which calls every entry in _all_callbacks on delivery. With N subscriptions on the same topic (e.g., two typed transports both subscribing to "cmd_unreliable" on the same WebRTCPubSub instance, or the test_multiple_subscribers spec test), each inbound message triggers N _wrapped closures, and each one fires all _all_callbacks — so a single message causes every subscribe_all callback to execute N times. The AllPubSub contract requires each subscribe_all callback to receive each message exactly once.
| return CloudflareProvider(self) | ||
|
|
||
|
|
||
| class CloudflareProvider(AsyncProviderBase): |
There was a problem hiding this comment.
To clarify this is for TEST AND BENCHMARK only. in production this cloudflare provider sits in Ec2 in the Teleop server. The broker.py communicates with the server.
Problem
DimOS has LCM, SHM, DDS, and ROS transports for local/LAN communication but no transport for internet-scale real-time data (NAT traversal, global edge routing). The hosted teleop system currently requires a dedicated
HostedTeleopModuleto bridge WebRTC DataChannels to DimOS streams.Solution
A WebRTC DataChannel transport backed by Cloudflare Realtime, following the same pubsub abstraction as
LCMTransport/SHMTransport.Architecture (
dimos/protocol/pubsub/impl/webrtc/)providers/spec.py—Providerprotocol (Cloudflare, broker, LiveKit, …),AsyncProviderBase(shared loop-thread lifecycle), andProviderConfig: a picklable, hashable factory that resolves to a per-process singleton provider. Transports survive pickling into module worker processes, and all transports in a process share one PeerConnection.providers/broker.py— broker-mediated CF Realtime (hosted teleop): heartbeat-driven channel lifecycle incl. operator rejoin (SCTP id changes), receive-only until the broker bridges robot→operator channels.providers/cloudflare.py— direct CF access (pub+sub session loopback pair); used by integration tests and the benchmark.webrtcpubsub.py—WebRTCPubSub(AllPubSub[str, bytes]), passes the standard grid tests inpubsub/test_spec.py.WebRTCTransport[M: DimosMsg](core/transport.py) — typed LCM encode/decode + wire-fingerprint demux on a multiplexed channel;CloudflareTransportsubclass bindsBrokerConfigfor blueprints:Fingerprints are derived from the wire format, not
_get_packed_fingerprint()—TwistStampedinheritsTwist's fingerprint but encodes as LCMTwistStamped, so class fingerprints would drop every real message (regression-tested).Benchmark (standard harness, sustained 1s windows, this machine → CF edge)
CF paces DataChannel forwarding at ~1k msg/s per channel — plenty for teleop command planes (50–100 Hz). Same-host loopback RTT through the CF edge: median 15.6 ms. Benchmark knobs are env-overridable (
DIMOS_BENCH_DURATION_S,DIMOS_BENCH_MAX_MESSAGES,DIMOS_BENCH_RECEIVE_TIMEOUT_S) for networked runs.Breaking changes
None — new optional transport (
pip install dimos[webrtc], included inall).dimos/teleop/hosted/(duplicate twist scaler) is deleted; the hosted blueprints from #2411 are untouched.How to test
Fast (CI, no credentials, <2s):
Covers typed/raw modes, fingerprint demux, pickle round-trip, per-process provider sharing, broker credential validation.
Cloudflare integration (tool marker, ~25s):
Blueprint e2e (tool marker, ~17s): deploys two modules through a real
ModuleCoordinator, transport pickled into the worker, TwistStamped over live CF:Follow-ups (next phases)
Video track + state channels (clock sync, telemetry) move into the broker provider, then the hosted blueprints switch to transport-only and
HostedTeleopModuleslims to engagement logic.Contributor License Agreement