Skip to content

telemetry: submit BGP session status onchain per user#3487

Merged
juan-malbeclabs merged 10 commits into
mainfrom
jo/3465-4
Apr 9, 2026
Merged

telemetry: submit BGP session status onchain per user#3487
juan-malbeclabs merged 10 commits into
mainfrom
jo/3465-4

Conversation

@juan-malbeclabs
Copy link
Copy Markdown
Contributor

@juan-malbeclabs juan-malbeclabs commented Apr 8, 2026

Resolves: #3465

Summary of Changes

  • Adds a BGP status submitter to the telemetry daemon that periodically checks BGP session state in the BGP VRF namespace and submits `Up`/`Down` status onchain for each activated user on the device
  • Adds `BGPStatus` type and `SetUserBGPStatus` executor instruction to the Go serviceability SDK
  • Fixes a gap where a disappeared tunnel interface (e.g., after an ungraceful daemon kill) would silently skip the `Down` submission — now correctly transitions from `Up` to `Down` when the tunnel is gone but the user remains activated onchain
  • Introduces a `CachingFetcher` in the telemetry daemon that deduplicates `GetProgramData` RPC calls across the BGP status submitter, ledger peer discovery, and collector within a 5s TTL window using `singleflight`, mirroring the pattern in the client daemon
  • Adds Prometheus metrics for both the caching layer and onchain submission outcomes
  • Adds an e2e test (`TestE2E_UserBGPStatus`) verifying the full `Up` → `Down` lifecycle

Diff Breakdown

Category Files Lines (+/-) Net
Core logic 4 +526 / -0 +526
Scaffolding 5 +161 / -7 +154
Tests 4 +770 / -0 +770
Docs 1 +4 / -0 +4

Mostly new code: a self-contained submitter package, a shared caching layer, and their tests.

Key files (click to expand)

Testing Verification

  • Unit tests cover all tick-loop branches: session established, session down with and without grace period, tunnel not found with last status Up/Down, periodic refresh, in-flight dedup, and submission retry
  • `TestE2E_UserBGPStatus` verified end-to-end: connects a client, waits for BGP session to reach `Established` in vrf1, confirms `BGPStatusUp` appears onchain, then kills doublezerod ungracefully and confirms `BGPStatusDown` appears onchain within 60s

Base automatically changed from jo/3465-3 to main April 8, 2026 16:42
@juan-malbeclabs juan-malbeclabs marked this pull request as draft April 8, 2026 18:12
@juan-malbeclabs juan-malbeclabs marked this pull request as ready for review April 8, 2026 21:01
Comment thread controlplane/telemetry/internal/bgpstatus/submitter.go
@juan-malbeclabs juan-malbeclabs force-pushed the jo/3465-4 branch 2 times, most recently from 4d1f467 to c6fd942 Compare April 8, 2026 22:04
…r instruction

Adds the BGPStatus enum (Unknown/Up/Down) with JSON serialization and
the SetUserBGPStatus instruction (code 106) to the Go serviceability SDK.
Introduces the bgpstatus package, which reads BGP neighbor state from
Arista devices and submits SetUserBGPStatus instructions onchain.
Wires it into the telemetry binary with four new flags:
--bgp-status-enable, --bgp-status-interval,
--bgp-status-refresh-interval, and --bgp-status-down-grace-period.
Adds an end-to-end test that starts a devnet with the BGP status
submitter enabled and verifies the onchain state reflects the actual
BGP session status. Also extends DeviceTelemetrySpec with the four
BGP status flags and updates user list golden files to include the
new bgp_status column.
- In the bgpstatus submitter, when FindLocalTunnel returns
  ErrLocalTunnelNotFound but the last known onchain status is Up,
  fall through with observedUp=false so the Down transition is submitted.
  Previously the submitter always continued on tunnel-not-found, so a
  clean-disconnect scenario (tunnel interface removed) never triggered
  a Down submission.
- Change the e2e disconnect step to kill doublezerod ungracefully
  (SIGKILL) instead of running "doublezero disconnect": a clean
  disconnect deletes the user account onchain before the submitter can
  record Down, whereas killing the daemon drops the BGP session while
  the user remains activated.
- Add volume cleanup to "make e2e-test-cleanup" so persistent ledger
  volumes from test-keep runs don't carry stale state into subsequent
  runs.
Introduce a CachingFetcher in controlplane/telemetry/internal/serviceability/
that wraps any ProgramDataProvider and deduplicates RPC calls within a 5s
TTL window using sync.RWMutex + singleflight, mirroring the pattern in
client/doublezerod/internal/onchain/fetcher.go.

Wire the BGP status submitter through the cached client so multiple
telemetry components can share a single GetProgramData result per window
instead of each issuing independent RPCs.
Add observability for the BGP status pipeline:

CachingFetcher (controlplane/telemetry/internal/serviceability):
- doublezero_telemetry_programdata_fetch_duration_seconds: RPC latency
- doublezero_telemetry_programdata_fetch_total{result}: fetch outcomes
- doublezero_telemetry_programdata_stale_cache_age_seconds: staleness on error

BGP status submitter (controlplane/telemetry/internal/bgpstatus):
- doublezero_bgpstatus_submissions_total{bgp_status,result}: onchain submissions by status and outcome
- doublezero_bgpstatus_submission_duration_seconds: onchain transaction latency
Create a single CachingFetcher instance in main() and pass it to all
three consumers — ledger peer discovery, the telemetry collector, and
the BGP status submitter — so they share one GetProgramData result per
TTL window instead of each issuing independent RPCs.

Change Config.ServiceabilityProgramClient from *serviceability.Client
to the ServiceabilityProgramClient interface so the cached wrapper can
be injected there too.
Comment thread controlplane/telemetry/internal/bgpstatus/submitter.go
Comment thread controlplane/telemetry/internal/serviceability/cache.go
Comment thread controlplane/telemetry/internal/bgpstatus/bgpstatus.go
1. Seed lastOnchainStatus from onchain state on restart.
   userStateFor now accepts an initialStatus arg. The tick loop seeds it
   from user.BgpStatus so a restarted submitter correctly transitions
   Up→Down when the tunnel has disappeared, rather than skipping the user
   because Unknown != Up.

2. Detach singleflight RPC from the first caller's context.
   Use context.WithoutCancel inside the singleflight callback so a
   cancelled caller does not propagate failure to all other waiters.

3. Prune userState on each tick.
   Delete entries for users no longer activated on this device to prevent
   unbounded memory growth as users are deactivated or moved.
A user deactivated while a submission was in-flight would retain a
stale pending flag after the userState entry was pruned, permanently
blocking resubmission if the user was later reactivated.
Comment thread controlplane/telemetry/internal/serviceability/cache.go
@juan-malbeclabs juan-malbeclabs merged commit 478110c into main Apr 9, 2026
33 checks passed
@juan-malbeclabs juan-malbeclabs deleted the jo/3465-4 branch April 9, 2026 14:47
nikw9944 pushed a commit that referenced this pull request Apr 11, 2026
Resolves: #3465

## Summary of Changes
- Adds a BGP status submitter to the telemetry daemon that periodically
checks BGP session state in the BGP VRF namespace and submits
\`Up\`/\`Down\` status onchain for each activated user on the device
- Adds \`BGPStatus\` type and \`SetUserBGPStatus\` executor instruction
to the Go serviceability SDK
- Fixes a gap where a disappeared tunnel interface (e.g., after an
ungraceful daemon kill) would silently skip the \`Down\` submission —
now correctly transitions from \`Up\` to \`Down\` when the tunnel is
gone but the user remains activated onchain
- Introduces a \`CachingFetcher\` in the telemetry daemon that
deduplicates \`GetProgramData\` RPC calls across the BGP status
submitter, ledger peer discovery, and collector within a 5s TTL window
using \`singleflight\`, mirroring the pattern in the client daemon
- Adds Prometheus metrics for both the caching layer and onchain
submission outcomes
- Adds an e2e test (\`TestE2E_UserBGPStatus\`) verifying the full \`Up\`
→ \`Down\` lifecycle

## Diff Breakdown
| Category     | Files | Lines (+/-) | Net   |
|--------------|-------|-------------|-------|
| Core logic   |     4 | +526 / -0   |  +526 |
| Scaffolding  |     5 | +161 / -7   |  +154 |
| Tests        |     4 | +770 / -0   |  +770 |
| Docs         |     1 | +4   / -0   |    +4 |

Mostly new code: a self-contained submitter package, a shared caching
layer, and their tests.

<details>
<summary>Key files (click to expand)</summary>

-
[\`controlplane/telemetry/internal/bgpstatus/bgpstatus.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-7e435c1fdcd7c7b5767709008629f8022c298656aaa548d1636b162057d91c0f)
— \`Submitter\` struct, \`Config\`, and pure helpers
(\`buildEstablishedIPSet\`, \`computeEffectiveStatus\`,
\`shouldSubmit\`)
-
[\`controlplane/telemetry/internal/bgpstatus/submitter.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-4f6db3ff3dcb17af124697b7d176a806c8359739f33546f5de230105500cd9c8)
— tick/worker loop: collects BGP socket state, resolves tunnel IPs per
user, enqueues status updates with retry; includes tunnel-not-found →
Down fix
-
[\`e2e/user_bgp_status_test.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-8b6e16747b259945bc71d9d26b961b1eb8792c690d776448fc8ad26c8209d56a)
— e2e test validating \`Up\` detection after BGP session establishes and
\`Down\` detection after \`pkill -9 doublezerod\`
-
[\`controlplane/telemetry/internal/bgpstatus/submitter_test.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-6d59f1028aac0b9b2609f4bbdb7c155dbf6bef15fce256034100a276905f0572)
— unit tests for tick logic, grace period, periodic refresh, in-flight
deduplication, and tunnel-not-found → Down transition
-
[\`controlplane/telemetry/internal/serviceability/cache.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-50e9dfc9d61518cf113b7399651a828cdab7f5ff5abfefd443a772139845a315)
— \`CachingFetcher\`: \`sync.RWMutex\` fast-path + \`singleflight\`
slow-path + stale-on-error fallback for \`GetProgramData\`
-
[\`controlplane/telemetry/cmd/telemetry/main.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-ee2c048617227e6110d061752fb7e24fc85ff54c7024cea70c3016e976ee9228)
— wires \`CachingFetcher\` and BGP status submitter; flags:
\`--bgp-status-enable\`, \`--bgp-status-interval\`,
\`--bgp-status-refresh-interval\`, \`--bgp-status-down-grace-period\`
-
[\`smartcontract/sdk/go/serviceability/executor.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-7be6c1e59c995ffb57df9bcf3926738f52220f83f5b4e4747fc0dd2d993d1bc8)
— adds \`SetUserBGPStatus\` executor (instruction opcode 106)
-
[\`smartcontract/sdk/go/serviceability/state.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-057f627e5f2629b35f1976fba6eaa1b3c507c511b0f47d2f80a3a1d0c684c4a1)
— adds \`BGPStatus\` type with \`Unknown/Up/Down\` constants and JSON
serialization

</details>

## Testing Verification
- Unit tests cover all tick-loop branches: session established, session
down with and without grace period, tunnel not found with last status
Up/Down, periodic refresh, in-flight dedup, and submission retry
- \`TestE2E_UserBGPStatus\` verified end-to-end: connects a client,
waits for BGP session to reach \`Established\` in vrf1, confirms
\`BGPStatusUp\` appears onchain, then kills doublezerod ungracefully and
confirms \`BGPStatusDown\` appears onchain within 60s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PR 2 - Account changes & SDK updates

2 participants