fix: prevent lease assignment to non-ready exporters#426
Conversation
The wait_for_exporter helpers only checked Online+Registered k8s conditions, which are set at initial registration and never cleared between leases. This meant tests could start creating new leases while the exporter was still cleaning up from a previous one, causing intermittent "Connection to exporter lost" failures. Add polling for exporterStatus=Available after the condition checks to ensure the exporter's serve() loop has fully processed any previous lease-end before the next test runs. Fixes #425 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
✅ Deploy Preview for jumpstarter-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Important Review skippedBot user detected. To trigger a single review, invoke the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@ambient-code This will probably just mask the issue. We need to understand why the exporter does not go ready, or why is it leased when still not ready. |
The lease controller only checked Online+Registered conditions before assigning a lease, ignoring the ExporterStatusValue field. This meant an exporter still running afterLease/beforeLease hooks from a previous lease could be assigned a new lease, causing "Connection to exporter lost" errors when the client tried to dial. Add filterOutNotReadyExporters to the lease assignment pipeline so only exporters with Available (or unset, for backwards compat) status are eligible. This is the root cause fix for issue #425. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
You're right — the e2e fix was masking the real issue. I've now pushed the root cause fix in the lease controller. The problem: So an exporter still running its The fix: Added The e2e test changes are kept as defense-in-depth. |
Address review suggestion for consistency with the existing naming convention (onlineApprovedExporters). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| if len(readyApprovedExporters) == 0 { | ||
| lease.SetStatusPending( | ||
| "NotReady", | ||
| "There are %d online exporters, but none are ready (still cleaning up previous lease)", |
There was a problem hiding this comment.
The problem with this patch is that the new message "none are ready" will show even if all are leased. We should check this only once we have a list of available exporters to lease. (our current availableExporters)
There was a problem hiding this comment.
Good catch — moved filterOutNotReadyExporters to run after filterOutLeasedExporters, so the "NotReady" status only fires when there are genuinely non-ready, unleased exporters. Previously it ran on all online exporters, which would incorrectly report "none are ready" when they were all simply leased.
|
@ambient-code please handle comments |
Move filterOutNotReadyExporters to run after filterOutLeasedExporters so the "none are ready" message only appears for genuinely non-ready, unleased exporters — not when all exporters are simply already leased. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
I am retriggering E2E a few times to see if stability is back. |
|
2 in a row worked, triggering another... |
|
ok no reds ❌ 🟢 🟢 🟢 🟢 |
|
@kirkbrauer based on the issue conversation I am squash-merging it, and hope for the best 🟢 |
… states Verify that filterOutNotReadyExporters correctly excludes exporters in HOOK_FAILED and OFFLINE states from lease assignment, serving as a regression test for the server-side safety net added in PR jumpstarter-dev#426. Ref: jumpstarter-dev#245 Generated-By: Forge/20260416_202053_681470_8c18858d_i245 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… states Verify that filterOutNotReadyExporters correctly excludes exporters in HOOK_FAILED and OFFLINE states from lease assignment, serving as a regression test for the server-side safety net added in PR jumpstarter-dev#426. Ref: jumpstarter-dev#245 Generated-By: Forge/20260416_202053_681470_8c18858d_i245 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… states Verify that filterOutNotReadyExporters correctly excludes exporters in HOOK_FAILED and OFFLINE states from lease assignment, serving as a regression test for the server-side safety net added in PR jumpstarter-dev#426. Ref: jumpstarter-dev#245 Generated-By: Forge/20260416_202053_681470_8c18858d_i245 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… states Verify that filterOutNotReadyExporters correctly excludes exporters in HOOK_FAILED and OFFLINE states from lease assignment, serving as a regression test for the server-side safety net added in PR jumpstarter-dev#426. Ref: jumpstarter-dev#245 Generated-By: Forge/20260416_202053_681470_8c18858d_i245 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
filterOutNotReadyExportersto the lease controller's assignment pipeline so exporters still running hooks (AfterLeaseHook, BeforeLeaseHook, etc.) are not assigned new leasesexporterStatus=Availablein e2e test helpers before starting new tests, preventing races even if a controller bug is reintroducedFixes #425
Root Cause
The lease controller's
reconcileStatusExporterRefonly checkedOnline+Registeredconditions before assigning a lease to an exporter. These conditions are set at initial registration and never cleared between leases. TheExporterStatusValuefield (which tracks the exporter's actual operational state —Available,AfterLeaseHook,BeforeLeaseHook,LeaseReady, etc.) was completely ignored during lease assignment.This meant an exporter still running cleanup from a previous lease (e.g.,
AfterLeaseHookstatus) could be assigned a new lease. The client would then try toDial()the exporter, but the exporter'sserve()loop was still blocked onafter_lease_hook_done.wait()and couldn't process the new lease. After the 30s dial timeout, the client would fail with "Connection to exporter lost".Controller Fix
New
filterOutNotReadyExportersfunction inlease_controller.gofilters the exporter list after the offline check and before the leased check. Only exporters withAvailablestatus (or unset status, for backwards compatibility with older exporters) are eligible for lease assignment. When all online exporters are busy, the lease getsPendingstatus with reasonNotReadyand requeues after 1 second.E2E Fix (defense-in-depth)
The
wait_for_exporter/wait_for_hooks_exporterhelpers now poll.status.exporterStatusforAvailablein addition to checking k8s conditions, ensuring tests don't start a new lease while an exporter is still cleaning up.Changes
controller/internal/controller/lease_controller.go: AddfilterOutNotReadyExporters, integrate into assignment pipelinecontroller/internal/controller/lease_controller_test.go: AddsetExporterNotReadyhelper and two test cases (all-not-ready, partial-not-ready)e2e/tests.bats: Refactorwait_for_exporterto pollexporterStatus=Availablee2e/tests-hooks.bats: SameexporterStatus=AvailablepollingTest plan
🤖 Generated with Claude Code