shell: keep shell usable after token refreshes#199
Conversation
Currently, when token expires, the shell becomes unusable, and even if we recover by refreshing, the shell will remain unusable until it is recreated. Instead, we do not cancel the monitor when the token expires and give the user a chance to refresh their token (if it wasn't automatically refreshed). fixes jumpstarter-dev#194 Signed-off-by: Benny Zlotnik <bzlotnik@redhat.com> Assisted-by: claude-opus-4.5
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📝 WalkthroughWalkthroughAdds lease-aware token management and proactive expiry monitoring to the CLI shell, introduces Lease.channel refresh and lease-monitoring helpers, and adds comprehensive tests plus an AnyIO pytest backend fixture. Changes
Sequence Diagram(s)sequenceDiagram
participant Shell as Shell (monitor)
participant OIDC as OIDC Client
participant Disk as Token File
participant Lease as Lease (gRPC)
participant Config as Config/Persist
Shell->>OIDC: attempt refresh using refresh_token
alt refresh succeeds
OIDC-->>Shell: new access_token (+maybe refresh_token)
Shell->>Config: update tokens in memory
Shell->>Config: persist tokens to disk
Shell->>Lease: refresh_channel(with updated credentials)
Lease-->>Shell: channel refreshed
else refresh fails
Shell->>Disk: read token file
alt disk token valid and newer
Disk-->>Shell: token data
Shell->>Config: update tokens in memory
Shell->>Lease: refresh_channel(with disk credentials)
else
Shell-->>Shell: emit warning / prompt manual login
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/client/lease.py`:
- Around line 68-76: refresh_channel currently replaces self.channel without
closing the previous gRPC channel, leaking connections; before assigning a new
channel in refresh_channel, if self.channel is not None attempt to call
self.channel.close() (safely catching and logging any exception) and then
replace self.channel, re-create controller via
jumpstarter_pb2_grpc.ControllerServiceStub(channel) and svc via
ClientService(channel=channel, namespace=self.namespace); ensure the close call
is performed even if controller/svc swap fails to avoid leaks.
🧹 Nitpick comments (3)
python/packages/jumpstarter/jumpstarter/client/lease.py (1)
336-344: Threshold notification window can be missed entirely.The window
threshold - check_interval <= remain < threshold(defaults to [4m30s, 5m0s)) relies on a poll landing inside a narrow 30-second band. If the system is under load or the sleep overshoots,remaincould jump from ≥ 5m to < 4m30s between iterations, silently skipping the notification.A dedicated flag would guarantee the callback fires exactly once:
Proposed fix
async def _monitor(): check_interval = 30 last_known_end_time = None + notified_threshold = False while True: ... - # Log once when entering the threshold window - if threshold - timedelta(seconds=check_interval) <= remain < threshold: + if remain < threshold and not notified_threshold: + notified_threshold = True logger.info( ... ) self._notify_lease_ending(remain)python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (2)
216-217: Silentexcept Exception: returnhides monitor failures.If an unexpected error occurs (e.g., an attribute mismatch after a config change, or a bug in recovery logic), the token monitor silently exits and the user gets no auto-refresh or expiry warnings for the rest of the session — with no indication of why.
At minimum, log the exception before bailing:
Proposed fix
- except Exception: - return + except Exception: + logger.debug("Token expiry monitor exiting due to unexpected error", exc_info=True) + return
56-62: Extract hardcodedclient_idto a shared constant.
"jumpstarter-cli"appears hardcoded in at least three locations (login.py:328,shell.py:60,auth.py:124), andlogin.py:328already has a TODO comment noting this should be stored in config. Centralizing this value prevents silent divergence if it changes and reduces duplication.
|
need to add some tests |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/client/lease_test.py`:
- Around line 476-511: The test test_estimates_expiry_from_last_known_end_time
currently returns an already-expired end time on the first get(), so the monitor
immediately calls _notify_lease_ending and never exercises the cached
last_known_end_time path; change the fake get_then_fail so its first return has
an effective_end_time slightly in the future (so lease._get_lease_end_time
caches it and the monitor sleeps), then have the second get() raise the "server
unavailable" exception after the cached time would have passed so that
lease.monitor_async relies on the cached last_known_end_time and calls
_notify_lease_ending(timedelta(0)); adjust call_count logic in get_then_fail and
any timing (past_end -> future_end) to ensure the loop enters the sleep branch
before the simulated server failure.
🧹 Nitpick comments (1)
python/packages/jumpstarter-cli/jumpstarter_cli/shell_test.py (1)
252-377: Good test suite for_monitor_token_expiry; one observation on exit strategy.The tests use
Exception("done")fromget_token_remaining_secondsto break out of the monitor loop, relying on the implementation's bareexceptcatching everything. This couples tests to the currentexcept Exception: returnexit behavior. If the implementation ever narrows the exception handling, these tests would hang instead of failing cleanly. Consider using a sentinel pattern (e.g., a counter-based side effect that setscancel_called=Trueafter the first iteration — as done intest_sleeps_30s_when_above_threshold) for more robust loop termination.Not a blocker — the current tests work correctly with the current implementation.
2f7f8c9 to
1c25bc5
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/client/lease_test.py`:
- Around line 484-506: The test's timing is fragile because future_end is only
50ms ahead of datetime.now(), so increase the margin and/or make the monitor's
sleep deterministic: set future_end to now + timedelta(seconds=5) (replace the
50ms offset) and patch jumpstarter.client.lease.sleep in the test (same approach
used in test_continues_on_get_failure_without_end_time) so the monitor loop uses
the patched sleep; keep get_then_fail and the async with lease.monitor_async()
logic unchanged so the test still exercises the cached end-time error path
reliably on CI.
🧹 Nitpick comments (3)
python/packages/jumpstarter/jumpstarter/client/lease_test.py (2)
176-176: Inconsistent async test markers:asynciovsanyio.Line 176 uses
@pytest.mark.asynciowhile lines 449 and 476 use@pytest.mark.anyio. Sincemonitor_asyncrelies on anyio primitives (create_task_group), theanyiomarker is correct there. However, mixing markers in the same test file can lead to confusing test-runner configuration issues. Consider switching Line 176 to@pytest.mark.anyiofor consistency.Suggested fix
- `@pytest.mark.asyncio` + `@pytest.mark.anyio` async def test_integration_with_async_context(self):Also applies to: 449-449, 476-476
339-363: LGTM —refresh_channeltest correctly avoids closing the old channel.The test verifies that
refresh_channelreplaceschannel,controller, andsvcwithout callingclose()on the old channel, which aligns with the design intent from the retrieved learnings.One minor gap:
mock_svc_cls.assert_called_once()doesn't verify it was called with the expectedchannelandnamespacekeyword arguments.Optional: tighten the assertion
- mock_svc_cls.assert_called_once() + mock_svc_cls.assert_called_once_with(channel=new_channel, namespace="default")python/packages/jumpstarter-cli/jumpstarter_cli/shell_test.py (1)
327-344: Clevercancel_calledproperty trick, but fragile if the implementation checks it more than twice.The counter-based property returns
Falseon the first access andTrueon the second. If the production code adds an extracancel_calledcheck (e.g., in a guard or log statement), this test would silently change behavior. A safer alternative would be to useside_effecton a list:💡 Alternative approach
- call_count = 0 - - def check_cancelled(): - nonlocal call_count - call_count += 1 - return call_count > 1 - - config = _make_config() - cancel_scope = Mock() - type(cancel_scope).cancel_called = property(lambda self: check_cancelled()) + config = _make_config() + cancel_scope = Mock() + type(cancel_scope).cancel_called = PropertyMock(side_effect=[False, True])This makes the exact number of accesses explicit and will raise
StopIterationif there's an unexpected extra access, surfacing breakage rather than hiding it.
| future_end = datetime.now(tz=timezone.utc) + timedelta(milliseconds=50) | ||
| call_count = 0 | ||
|
|
||
| async def get_then_fail(): | ||
| nonlocal call_count | ||
| call_count += 1 | ||
| if call_count == 1: | ||
| return Mock( | ||
| effective_begin_time=future_end - timedelta(hours=1), | ||
| effective_duration=timedelta(hours=1), | ||
| effective_end_time=None, | ||
| duration=timedelta(hours=1), | ||
| ) | ||
| raise Exception("server unavailable") | ||
|
|
||
| lease.get = get_then_fail | ||
|
|
||
| with caplog.at_level(logging.WARNING): | ||
| async with lease.monitor_async(): | ||
| # Keep the body alive long enough for the monitor to loop | ||
| # through the first get(), sleep, second get() (fails), and | ||
| # error handler using the cached end time. | ||
| await asyncio.sleep(0.2) |
There was a problem hiding this comment.
Timing-sensitive test — 50 ms margin is fragile in CI.
future_end is only 50 ms ahead of datetime.now(). On a loaded CI runner the first get() + the monitor's first iteration could easily take longer than 50 ms, causing the lease to appear expired on the first check rather than the second, which would skip the cached-end-time error path the test intends to exercise.
Consider either:
- Patching
jumpstarter.client.lease.sleep(as done intest_continues_on_get_failure_without_end_time) and using a largerfuture_endoffset (e.g. 5 s) so wall-clock jitter is irrelevant, or - Patching
datetime.nowto control time deterministically.
🤖 Prompt for AI Agents
In `@python/packages/jumpstarter/jumpstarter/client/lease_test.py` around lines
484 - 506, The test's timing is fragile because future_end is only 50ms ahead of
datetime.now(), so increase the margin and/or make the monitor's sleep
deterministic: set future_end to now + timedelta(seconds=5) (replace the 50ms
offset) and patch jumpstarter.client.lease.sleep in the test (same approach used
in test_continues_on_get_failure_without_end_time) so the monitor loop uses the
patched sleep; keep get_then_fail and the async with lease.monitor_async() logic
unchanged so the test still exercises the cached end-time error path reliably on
CI.
There was a problem hiding this comment.
if future_end = now + 5s and the monitor's sleep is patched, the monitor spins rapidly through the error handler, but remain stays ~5s because no real time passes
There was a problem hiding this comment.
@bennyz You're absolutely right! That's a key insight—patching sleep alone doesn't advance datetime.now(), so the lease never actually expires. To make this test both deterministic and meaningful, you need to control time progression.
Here are a couple of approaches:
Option 1: Use freezegun with tick=True
from freezegun import freeze_time
`@pytest.mark.anyio`
`@freeze_time`("2025-06-01 10:00:00", tick=True, auto_tick_seconds=0.1)
async def test_estimates_expiry_from_last_known_end_time(self, caplog):
"""When get() fails after we've seen an end time, use cached value."""
lease = self._make_lease_for_monitor()
callback = Mock()
lease.lease_ending_callback = callback
# End time 0.2s in the future
future_end = datetime.now(tz=timezone.utc) + timedelta(milliseconds=200)
call_count = 0
async def get_then_fail():
nonlocal call_count
call_count += 1
if call_count == 1:
return Mock(...)
raise Exception("server unavailable")
lease.get = get_then_fail
with patch("jumpstarter.client.lease.sleep", new_callable=AsyncMock):
async with lease.monitor_async():
await asyncio.sleep(0.3) # freezegun advances time
callback.assert_called()
_, remain_arg = callback.call_args[0]
assert remain_arg == timedelta(0)Option 2: Mock datetime.now to advance time between calls
`@pytest.mark.anyio`
async def test_estimates_expiry_from_last_known_end_time(self, caplog):
lease = self._make_lease_for_monitor()
callback = Mock()
lease.lease_ending_callback = callback
start_time = datetime(2025, 6, 1, 10, 0, 0, tzinfo=timezone.utc)
future_end = start_time + timedelta(milliseconds=100)
# Control time: first call returns start_time, subsequent calls return after expiry
time_sequence = [start_time, start_time, future_end + timedelta(milliseconds=50)]
with patch("jumpstarter.client.lease.datetime") as mock_dt:
mock_dt.now.side_effect = lambda tz=None: time_sequence.pop(0) if time_sequence else future_end + timedelta(seconds=1)
mock_dt.timezone = timezone # preserve timezone
# ... rest of testOption 3: Test the caching behavior without requiring expiry
If verifying cached end-time usage is the primary goal, you could assert on remain being close to the expected value rather than exactly 0:
# Verify cached end time was used (remain should be close to expected)
assert -timedelta(seconds=1) <= remain_arg <= timedelta(milliseconds=100)Which approach would you prefer? The freezegun option is cleanest if you're okay with adding that dependency.
🧠 Learnings used
Learnt from: bennyz
Repo: jumpstarter-dev/jumpstarter PR: 199
File: python/packages/jumpstarter/jumpstarter/client/lease.py:68-76
Timestamp: 2026-02-06T11:33:20.121Z
Learning: In `python/packages/jumpstarter/jumpstarter/client/lease.py`, the `refresh_channel()` method should NOT call `close()` on the old channel before replacing it. Token refresh happens asynchronously while RPCs may be in-flight, and calling `close()` would immediately terminate active RPCs. Instead, rely on Python's reference counting to clean up the old channel after in-flight operations complete.
Learnt from: michalskrivanek
Repo: jumpstarter-dev/jumpstarter PR: 704
File: packages/jumpstarter/jumpstarter/client/grpc.py:100-107
Timestamp: 2025-10-14T17:43:07.788Z
Learning: In the Jumpstarter client lease model (packages/jumpstarter/jumpstarter/client/grpc.py), `effective_duration` represents the elapsed time for an active lease so far, not the total duration. To calculate expected release time, use `effective_begin_time + duration` (where `duration` is the configured/requested duration), not `effective_begin_time + effective_duration`.
There was a problem hiding this comment.
go through all of this for the off chance we miss the 200ms window?
There was a problem hiding this comment.
@bennyz You're right—adding freezegun or complex datetime mocking for this edge case is probably overkill. The test is primarily about verifying that the monitor uses the cached end time when get() fails, not about precise timing guarantees.
Simple middle ground:
Keep the test as-is but bump the margin from 50ms to 200ms:
future_end = datetime.now(tz=timezone.utc) + timedelta(milliseconds=200)This gives you 4× more breathing room on slow CI runners while keeping the test simple. If it does prove flaky in practice (which is unlikely), you can revisit with retries or accept the occasional spurious failure.
The core behavior—error recovery using cached state—will still be exercised reliably. No need to overcomplicate it. 👍
🐰
🧠 Learnings used
Learnt from: bennyz
Repo: jumpstarter-dev/jumpstarter PR: 199
File: python/packages/jumpstarter/jumpstarter/client/lease.py:68-76
Timestamp: 2026-02-06T11:33:20.121Z
Learning: In `python/packages/jumpstarter/jumpstarter/client/lease.py`, the `refresh_channel()` method should NOT call `close()` on the old channel before replacing it. Token refresh happens asynchronously while RPCs may be in-flight, and calling `close()` would immediately terminate active RPCs. Instead, rely on Python's reference counting to clean up the old channel after in-flight operations complete.
Learnt from: michalskrivanek
Repo: jumpstarter-dev/jumpstarter PR: 704
File: packages/jumpstarter/jumpstarter/client/grpc.py:100-107
Timestamp: 2025-10-14T17:43:07.788Z
Learning: In the Jumpstarter client lease model (packages/jumpstarter/jumpstarter/client/grpc.py), `effective_duration` represents the elapsed time for an active lease so far, not the total duration. To calculate expected release time, use `effective_begin_time + duration` (where `duration` is the configured/requested duration), not `effective_begin_time + effective_duration`.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/client/lease.py`:
- Around line 290-296: In _get_lease_end_time replace the incorrect use of
lease.effective_duration (which is elapsed time) with the requested/configured
duration field (lease.duration) when computing the end time: keep the existing
early-return when effective_end_time is present, and if not, validate
lease.effective_begin_time and lease.duration exist and return
lease.effective_begin_time + lease.duration; update the function
(_get_lease_end_time) to check lease.duration instead of
lease.effective_duration to mirror the correct pattern used in grpc.py (line
~107).
6c252e7 to
018adaa
Compare
…ctive_duration Signed-off-by: Benny Zlotnik <bzlotnik@redhat.com>
018adaa to
ffdcf37
Compare
|
@bennyz how does this behave with internal tokens? |
|
I guess since those are also OIDC, and they expire in a long time, it's simply not triggered? |
yeah, it might take a long time |
Currently, when token expires, the shell becomes unusable, and even if we recover by refreshing, the shell will remain unusable until it is recreated.
Instead, we do not cancel the monitor when the token expires and give the user a chance to refresh their token (if it wasn't automatically refreshed).
fixes #194
Summary by CodeRabbit
New Features
Improvements
Tests