Skip to content

fix: prevent lease assignment to non-ready exporters#426

Merged
mangelajo merged 4 commits into
mainfrom
fix/e2e-wait-for-exporter-available
Apr 8, 2026
Merged

fix: prevent lease assignment to non-ready exporters#426
mangelajo merged 4 commits into
mainfrom
fix/e2e-wait-for-exporter-available

Conversation

@ambient-code

@ambient-code ambient-code Bot commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Root cause fix (controller): Add filterOutNotReadyExporters to the lease controller's assignment pipeline so exporters still running hooks (AfterLeaseHook, BeforeLeaseHook, etc.) are not assigned new leases
  • Defense-in-depth (e2e): Wait for exporterStatus=Available in e2e test helpers before starting new tests, preventing races even if a controller bug is reintroduced

Fixes #425

Root Cause

The lease controller's reconcileStatusExporterRef only checked Online + Registered conditions before assigning a lease to an exporter. These conditions are set at initial registration and never cleared between leases. The ExporterStatusValue field (which tracks the exporter's actual operational state — Available, AfterLeaseHook, BeforeLeaseHook, LeaseReady, etc.) was completely ignored during lease assignment.

This meant an exporter still running cleanup from a previous lease (e.g., AfterLeaseHook status) could be assigned a new lease. The client would then try to Dial() the exporter, but the exporter's serve() loop was still blocked on after_lease_hook_done.wait() and couldn't process the new lease. After the 30s dial timeout, the client would fail with "Connection to exporter lost".

Controller Fix

New filterOutNotReadyExporters function in lease_controller.go filters the exporter list after the offline check and before the leased check. Only exporters with Available status (or unset status, for backwards compatibility with older exporters) are eligible for lease assignment. When all online exporters are busy, the lease gets Pending status with reason NotReady and requeues after 1 second.

E2E Fix (defense-in-depth)

The wait_for_exporter / wait_for_hooks_exporter helpers now poll .status.exporterStatus for Available in addition to checking k8s conditions, ensuring tests don't start a new lease while an exporter is still cleaning up.

Changes

  • controller/internal/controller/lease_controller.go: Add filterOutNotReadyExporters, integrate into assignment pipeline
  • controller/internal/controller/lease_controller_test.go: Add setExporterNotReady helper and two test cases (all-not-ready, partial-not-ready)
  • e2e/tests.bats: Refactor wait_for_exporter to poll exporterStatus=Available
  • e2e/tests-hooks.bats: Same exporterStatus=Available polling

Test plan

  • Controller unit tests pass (new tests for NotReady filtering)
  • E2e tests on ubuntu-24.04 — tests 47-48 should no longer flake
  • E2e hooks tests — no regressions

🤖 Generated with Claude Code

The wait_for_exporter helpers only checked Online+Registered k8s
conditions, which are set at initial registration and never cleared
between leases. This meant tests could start creating new leases
while the exporter was still cleaning up from a previous one, causing
intermittent "Connection to exporter lost" failures.

Add polling for exporterStatus=Available after the condition checks
to ensure the exporter's serve() loop has fully processed any
previous lease-end before the next test runs.

Fixes #425

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@netlify

netlify Bot commented Apr 8, 2026

Copy link
Copy Markdown

Deploy Preview for jumpstarter-docs ready!

Name Link
🔨 Latest commit 507a7ad
🔍 Latest deploy log https://app.netlify.com/projects/jumpstarter-docs/deploys/69d63bd60ff9280008078a0f
😎 Deploy Preview https://deploy-preview-426--jumpstarter-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@coderabbitai

coderabbitai Bot commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0c2ed726-4027-4028-817b-802d285a5187

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/e2e-wait-for-exporter-available

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mangelajo

Copy link
Copy Markdown
Member

@ambient-code This will probably just mask the issue. We need to understand why the exporter does not go ready, or why is it leased when still not ready.

The lease controller only checked Online+Registered conditions before
assigning a lease, ignoring the ExporterStatusValue field. This meant
an exporter still running afterLease/beforeLease hooks from a previous
lease could be assigned a new lease, causing "Connection to exporter
lost" errors when the client tried to dial.

Add filterOutNotReadyExporters to the lease assignment pipeline so only
exporters with Available (or unset, for backwards compat) status are
eligible. This is the root cause fix for issue #425.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ambient-code ambient-code Bot changed the title fix(e2e): wait for exporter Available status between lease operations fix: prevent lease assignment to non-ready exporters Apr 8, 2026
@ambient-code

ambient-code Bot commented Apr 8, 2026

Copy link
Copy Markdown
Contributor Author

You're right — the e2e fix was masking the real issue. I've now pushed the root cause fix in the lease controller.

The problem: reconcileStatusExporterRef in lease_controller.go only checked Online + Registered conditions before assigning leases. These conditions are set once at registration and never cleared between leases. The ExporterStatusValue field (which accurately tracks Available / AfterLeaseHook / BeforeLeaseHook / etc.) was completely ignored during lease assignment.

So an exporter still running its afterLease hook cleanup could be assigned a new lease. The client would Dial() but the exporter's serve() loop was blocked on after_lease_hook_done.wait(), couldn't process the new lease, and the client timed out with "Connection to exporter lost".

The fix: Added filterOutNotReadyExporters to the lease assignment pipeline (between filterOutOfflineExporters and filterOutLeasedExporters). Only exporters with Available status (or unset, for backwards compat) are eligible. Non-ready exporters get a Pending/NotReady status with a 1s requeue.

The e2e test changes are kept as defense-in-depth.

Comment thread controller/internal/controller/lease_controller.go Outdated
Address review suggestion for consistency with the existing naming
convention (onlineApprovedExporters).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mangelajo mangelajo enabled auto-merge (squash) April 8, 2026 10:33
if len(readyApprovedExporters) == 0 {
lease.SetStatusPending(
"NotReady",
"There are %d online exporters, but none are ready (still cleaning up previous lease)",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this patch is that the new message "none are ready" will show even if all are leased. We should check this only once we have a list of available exporters to lease. (our current availableExporters)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — moved filterOutNotReadyExporters to run after filterOutLeasedExporters, so the "NotReady" status only fires when there are genuinely non-ready, unleased exporters. Previously it ran on all online exporters, which would incorrectly report "none are ready" when they were all simply leased.

@mangelajo mangelajo left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments.

@mangelajo

Copy link
Copy Markdown
Member

@ambient-code please handle comments

Move filterOutNotReadyExporters to run after filterOutLeasedExporters
so the "none are ready" message only appears for genuinely non-ready,
unleased exporters — not when all exporters are simply already leased.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mangelajo mangelajo disabled auto-merge April 8, 2026 11:33
@mangelajo mangelajo requested a review from kirkbrauer April 8, 2026 11:33
@mangelajo

Copy link
Copy Markdown
Member

I am retriggering E2E a few times to see if stability is back.

@mangelajo

Copy link
Copy Markdown
Member

2 in a row worked, triggering another...

@mangelajo

Copy link
Copy Markdown
Member

ok no reds ❌ 🟢 🟢 🟢 🟢

@mangelajo

Copy link
Copy Markdown
Member

@kirkbrauer based on the issue conversation I am squash-merging it, and hope for the best 🟢

@mangelajo mangelajo merged commit c562cf2 into main Apr 8, 2026
36 checks passed
raballew added a commit to raballew/jumpstarter that referenced this pull request Apr 16, 2026
… states

Verify that filterOutNotReadyExporters correctly excludes exporters in
HOOK_FAILED and OFFLINE states from lease assignment, serving as a
regression test for the server-side safety net added in PR jumpstarter-dev#426.

Ref: jumpstarter-dev#245
Generated-By: Forge/20260416_202053_681470_8c18858d_i245

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
raballew added a commit to raballew/jumpstarter that referenced this pull request Apr 17, 2026
… states

Verify that filterOutNotReadyExporters correctly excludes exporters in
HOOK_FAILED and OFFLINE states from lease assignment, serving as a
regression test for the server-side safety net added in PR jumpstarter-dev#426.

Ref: jumpstarter-dev#245
Generated-By: Forge/20260416_202053_681470_8c18858d_i245

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
raballew added a commit to raballew/jumpstarter that referenced this pull request Apr 17, 2026
… states

Verify that filterOutNotReadyExporters correctly excludes exporters in
HOOK_FAILED and OFFLINE states from lease assignment, serving as a
regression test for the server-side safety net added in PR jumpstarter-dev#426.

Ref: jumpstarter-dev#245
Generated-By: Forge/20260416_202053_681470_8c18858d_i245

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
raballew added a commit to raballew/jumpstarter that referenced this pull request Apr 28, 2026
… states

Verify that filterOutNotReadyExporters correctly excludes exporters in
HOOK_FAILED and OFFLINE states from lease assignment, serving as a
regression test for the server-side safety net added in PR jumpstarter-dev#426.

Ref: jumpstarter-dev#245
Generated-By: Forge/20260416_202053_681470_8c18858d_i245

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@raballew raballew deleted the fix/e2e-wait-for-exporter-available branch June 5, 2026 11:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

e2e: jmp shell fails with "Connection to exporter lost" after ~20s waiting for ready connection

1 participant