telemetry: submit BGP session status onchain per user#3487
Merged
Conversation
31debd6 to
971082d
Compare
snormore
reviewed
Apr 8, 2026
4d1f467 to
c6fd942
Compare
…r instruction Adds the BGPStatus enum (Unknown/Up/Down) with JSON serialization and the SetUserBGPStatus instruction (code 106) to the Go serviceability SDK.
Introduces the bgpstatus package, which reads BGP neighbor state from Arista devices and submits SetUserBGPStatus instructions onchain. Wires it into the telemetry binary with four new flags: --bgp-status-enable, --bgp-status-interval, --bgp-status-refresh-interval, and --bgp-status-down-grace-period.
Adds an end-to-end test that starts a devnet with the BGP status submitter enabled and verifies the onchain state reflects the actual BGP session status. Also extends DeviceTelemetrySpec with the four BGP status flags and updates user list golden files to include the new bgp_status column.
- In the bgpstatus submitter, when FindLocalTunnel returns ErrLocalTunnelNotFound but the last known onchain status is Up, fall through with observedUp=false so the Down transition is submitted. Previously the submitter always continued on tunnel-not-found, so a clean-disconnect scenario (tunnel interface removed) never triggered a Down submission. - Change the e2e disconnect step to kill doublezerod ungracefully (SIGKILL) instead of running "doublezero disconnect": a clean disconnect deletes the user account onchain before the submitter can record Down, whereas killing the daemon drops the BGP session while the user remains activated. - Add volume cleanup to "make e2e-test-cleanup" so persistent ledger volumes from test-keep runs don't carry stale state into subsequent runs.
Introduce a CachingFetcher in controlplane/telemetry/internal/serviceability/ that wraps any ProgramDataProvider and deduplicates RPC calls within a 5s TTL window using sync.RWMutex + singleflight, mirroring the pattern in client/doublezerod/internal/onchain/fetcher.go. Wire the BGP status submitter through the cached client so multiple telemetry components can share a single GetProgramData result per window instead of each issuing independent RPCs.
Add observability for the BGP status pipeline:
CachingFetcher (controlplane/telemetry/internal/serviceability):
- doublezero_telemetry_programdata_fetch_duration_seconds: RPC latency
- doublezero_telemetry_programdata_fetch_total{result}: fetch outcomes
- doublezero_telemetry_programdata_stale_cache_age_seconds: staleness on error
BGP status submitter (controlplane/telemetry/internal/bgpstatus):
- doublezero_bgpstatus_submissions_total{bgp_status,result}: onchain submissions by status and outcome
- doublezero_bgpstatus_submission_duration_seconds: onchain transaction latency
Create a single CachingFetcher instance in main() and pass it to all three consumers — ledger peer discovery, the telemetry collector, and the BGP status submitter — so they share one GetProgramData result per TTL window instead of each issuing independent RPCs. Change Config.ServiceabilityProgramClient from *serviceability.Client to the ServiceabilityProgramClient interface so the cached wrapper can be injected there too.
43ab70b to
cfc6729
Compare
snormore
reviewed
Apr 8, 2026
1. Seed lastOnchainStatus from onchain state on restart. userStateFor now accepts an initialStatus arg. The tick loop seeds it from user.BgpStatus so a restarted submitter correctly transitions Up→Down when the tunnel has disappeared, rather than skipping the user because Unknown != Up. 2. Detach singleflight RPC from the first caller's context. Use context.WithoutCancel inside the singleflight callback so a cancelled caller does not propagate failure to all other waiters. 3. Prune userState on each tick. Delete entries for users no longer activated on this device to prevent unbounded memory growth as users are deactivated or moved.
A user deactivated while a submission was in-flight would retain a stale pending flag after the userState entry was pruned, permanently blocking resubmission if the user was later reactivated.
snormore
reviewed
Apr 9, 2026
snormore
approved these changes
Apr 9, 2026
nikw9944
pushed a commit
that referenced
this pull request
Apr 11, 2026
Resolves: #3465 ## Summary of Changes - Adds a BGP status submitter to the telemetry daemon that periodically checks BGP session state in the BGP VRF namespace and submits \`Up\`/\`Down\` status onchain for each activated user on the device - Adds \`BGPStatus\` type and \`SetUserBGPStatus\` executor instruction to the Go serviceability SDK - Fixes a gap where a disappeared tunnel interface (e.g., after an ungraceful daemon kill) would silently skip the \`Down\` submission — now correctly transitions from \`Up\` to \`Down\` when the tunnel is gone but the user remains activated onchain - Introduces a \`CachingFetcher\` in the telemetry daemon that deduplicates \`GetProgramData\` RPC calls across the BGP status submitter, ledger peer discovery, and collector within a 5s TTL window using \`singleflight\`, mirroring the pattern in the client daemon - Adds Prometheus metrics for both the caching layer and onchain submission outcomes - Adds an e2e test (\`TestE2E_UserBGPStatus\`) verifying the full \`Up\` → \`Down\` lifecycle ## Diff Breakdown | Category | Files | Lines (+/-) | Net | |--------------|-------|-------------|-------| | Core logic | 4 | +526 / -0 | +526 | | Scaffolding | 5 | +161 / -7 | +154 | | Tests | 4 | +770 / -0 | +770 | | Docs | 1 | +4 / -0 | +4 | Mostly new code: a self-contained submitter package, a shared caching layer, and their tests. <details> <summary>Key files (click to expand)</summary> - [\`controlplane/telemetry/internal/bgpstatus/bgpstatus.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-7e435c1fdcd7c7b5767709008629f8022c298656aaa548d1636b162057d91c0f) — \`Submitter\` struct, \`Config\`, and pure helpers (\`buildEstablishedIPSet\`, \`computeEffectiveStatus\`, \`shouldSubmit\`) - [\`controlplane/telemetry/internal/bgpstatus/submitter.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-4f6db3ff3dcb17af124697b7d176a806c8359739f33546f5de230105500cd9c8) — tick/worker loop: collects BGP socket state, resolves tunnel IPs per user, enqueues status updates with retry; includes tunnel-not-found → Down fix - [\`e2e/user_bgp_status_test.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-8b6e16747b259945bc71d9d26b961b1eb8792c690d776448fc8ad26c8209d56a) — e2e test validating \`Up\` detection after BGP session establishes and \`Down\` detection after \`pkill -9 doublezerod\` - [\`controlplane/telemetry/internal/bgpstatus/submitter_test.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-6d59f1028aac0b9b2609f4bbdb7c155dbf6bef15fce256034100a276905f0572) — unit tests for tick logic, grace period, periodic refresh, in-flight deduplication, and tunnel-not-found → Down transition - [\`controlplane/telemetry/internal/serviceability/cache.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-50e9dfc9d61518cf113b7399651a828cdab7f5ff5abfefd443a772139845a315) — \`CachingFetcher\`: \`sync.RWMutex\` fast-path + \`singleflight\` slow-path + stale-on-error fallback for \`GetProgramData\` - [\`controlplane/telemetry/cmd/telemetry/main.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-ee2c048617227e6110d061752fb7e24fc85ff54c7024cea70c3016e976ee9228) — wires \`CachingFetcher\` and BGP status submitter; flags: \`--bgp-status-enable\`, \`--bgp-status-interval\`, \`--bgp-status-refresh-interval\`, \`--bgp-status-down-grace-period\` - [\`smartcontract/sdk/go/serviceability/executor.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-7be6c1e59c995ffb57df9bcf3926738f52220f83f5b4e4747fc0dd2d993d1bc8) — adds \`SetUserBGPStatus\` executor (instruction opcode 106) - [\`smartcontract/sdk/go/serviceability/state.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-057f627e5f2629b35f1976fba6eaa1b3c507c511b0f47d2f80a3a1d0c684c4a1) — adds \`BGPStatus\` type with \`Unknown/Up/Down\` constants and JSON serialization </details> ## Testing Verification - Unit tests cover all tick-loop branches: session established, session down with and without grace period, tunnel not found with last status Up/Down, periodic refresh, in-flight dedup, and submission retry - \`TestE2E_UserBGPStatus\` verified end-to-end: connects a client, waits for BGP session to reach \`Established\` in vrf1, confirms \`BGPStatusUp\` appears onchain, then kills doublezerod ungracefully and confirms \`BGPStatusDown\` appears onchain within 60s
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves: #3465
Summary of Changes
Diff Breakdown
Mostly new code: a self-contained submitter package, a shared caching layer, and their tests.
Key files (click to expand)
Testing Verification