Skip to content

shell: keep shell usable after token refreshes#199

Merged
mangelajo merged 3 commits into
jumpstarter-dev:mainfrom
bennyz:jmp-shell-refresh
Feb 10, 2026
Merged

shell: keep shell usable after token refreshes#199
mangelajo merged 3 commits into
jumpstarter-dev:mainfrom
bennyz:jmp-shell-refresh

Conversation

@bennyz
Copy link
Copy Markdown
Member

@bennyz bennyz commented Feb 6, 2026

Currently, when token expires, the shell becomes unusable, and even if we recover by refreshing, the shell will remain unusable until it is recreated.

Instead, we do not cancel the monitor when the token expires and give the user a chance to refresh their token (if it wasn't automatically refreshed).

fixes #194

Summary by CodeRabbit

  • New Features

    • Proactive token monitoring with automatic refresh and disk-reload recovery; seamless reconnection when credentials change.
  • Improvements

    • Clearer, actionable expiry warnings and login guidance.
    • More resilient lease/connection monitoring that tolerates transient outages and better estimates remaining time.
    • Preserves existing credentials on refresh failures to avoid disruption.
  • Tests

    • Extensive unit tests for token refresh/reload, recovery paths, channel updates, and monitoring logic.

Currently, when token expires, the shell becomes unusable, and even if we recover by refreshing, the shell will remain unusable until it is recreated.

Instead, we do not cancel the monitor when the token expires and give the user a chance to refresh their token (if it wasn't automatically refreshed).

fixes jumpstarter-dev#194

Signed-off-by: Benny Zlotnik <bzlotnik@redhat.com>
Assisted-by: claude-opus-4.5
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 6, 2026

Warning

Rate limit exceeded

@bennyz has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 6 minutes and 0 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📝 Walkthrough

Walkthrough

Adds lease-aware token management and proactive expiry monitoring to the CLI shell, introduces Lease.channel refresh and lease-monitoring helpers, and adds comprehensive tests plus an AnyIO pytest backend fixture.

Changes

Cohort / File(s) Summary
CLI shell token management
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py
Added token refresh/reload helpers, lease-aware channel update, expanded _monitor_token_expiry signature/logic, new imports/logger, and updated monitor startup call site.
Lease lifecycle & channel refresh
python/packages/jumpstarter/jumpstarter/client/lease.py
Added refresh_channel, _notify_lease_ending, _get_lease_end_time; refactored monitor_async to cache end time, handle get() failures, notify via callback, and improve sleep/expiry handling.
Tests: CLI shell
python/packages/jumpstarter-cli/jumpstarter_cli/shell_test.py
New async tests covering channel updates, refresh/reload flows, recovery paths, warning messages, and _monitor_token_expiry control flows.
Tests: Lease
python/packages/jumpstarter/jumpstarter/client/lease_test.py
Added tests for refresh_channel, lease-ending callback behavior, end-time derivation, and monitor resilience to transient errors.
Test config
python/packages/jumpstarter-cli/conftest.py
Added anyio_backend pytest fixture returning "asyncio" to set the AnyIO backend for tests.

Sequence Diagram(s)

sequenceDiagram
    participant Shell as Shell (monitor)
    participant OIDC as OIDC Client
    participant Disk as Token File
    participant Lease as Lease (gRPC)
    participant Config as Config/Persist

    Shell->>OIDC: attempt refresh using refresh_token
    alt refresh succeeds
        OIDC-->>Shell: new access_token (+maybe refresh_token)
        Shell->>Config: update tokens in memory
        Shell->>Config: persist tokens to disk
        Shell->>Lease: refresh_channel(with updated credentials)
        Lease-->>Shell: channel refreshed
    else refresh fails
        Shell->>Disk: read token file
        alt disk token valid and newer
            Disk-->>Shell: token data
            Shell->>Config: update tokens in memory
            Shell->>Lease: refresh_channel(with disk credentials)
        else
            Shell-->>Shell: emit warning / prompt manual login
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

backport release-0.7

Suggested reviewers

  • mangelajo
  • NickCao

Poem

"I am a rabbit in a burrow bright,
Tokens hop fresh in the silver light.
Channels spring to a newer tune,
Tests applaud beneath the moon.
Hooray — no stale sessions in sight!" 🐇

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.65% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and concisely describes the main change: preventing the shell from becoming unusable after token refreshes, which aligns with the primary objective of fixing issue #194.
Linked Issues check ✅ Passed The PR implementation addresses all objectives from issue #194: prevents shell cancellation on token expiry, enables token refresh/reload with channel updates, adds auto-refresh capabilities, clarifies messaging, and includes comprehensive tests.
Out of Scope Changes check ✅ Passed All changes align with PR objectives: token management (shell.py, lease.py), test infrastructure (conftest.py, shell_test.py, lease_test.py) support the core goal of keeping the shell usable after token refreshes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/client/lease.py`:
- Around line 68-76: refresh_channel currently replaces self.channel without
closing the previous gRPC channel, leaking connections; before assigning a new
channel in refresh_channel, if self.channel is not None attempt to call
self.channel.close() (safely catching and logging any exception) and then
replace self.channel, re-create controller via
jumpstarter_pb2_grpc.ControllerServiceStub(channel) and svc via
ClientService(channel=channel, namespace=self.namespace); ensure the close call
is performed even if controller/svc swap fails to avoid leaks.
🧹 Nitpick comments (3)
python/packages/jumpstarter/jumpstarter/client/lease.py (1)

336-344: Threshold notification window can be missed entirely.

The window threshold - check_interval <= remain < threshold (defaults to [4m30s, 5m0s)) relies on a poll landing inside a narrow 30-second band. If the system is under load or the sleep overshoots, remain could jump from ≥ 5m to < 4m30s between iterations, silently skipping the notification.

A dedicated flag would guarantee the callback fires exactly once:

Proposed fix
         async def _monitor():
             check_interval = 30
             last_known_end_time = None
+            notified_threshold = False
             while True:
                 ...
-                # Log once when entering the threshold window
-                if threshold - timedelta(seconds=check_interval) <= remain < threshold:
+                if remain < threshold and not notified_threshold:
+                    notified_threshold = True
                     logger.info(
                         ...
                     )
                     self._notify_lease_ending(remain)
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (2)

216-217: Silent except Exception: return hides monitor failures.

If an unexpected error occurs (e.g., an attribute mismatch after a config change, or a bug in recovery logic), the token monitor silently exits and the user gets no auto-refresh or expiry warnings for the rest of the session — with no indication of why.

At minimum, log the exception before bailing:

Proposed fix
-        except Exception:
-            return
+        except Exception:
+            logger.debug("Token expiry monitor exiting due to unexpected error", exc_info=True)
+            return

56-62: Extract hardcoded client_id to a shared constant.

"jumpstarter-cli" appears hardcoded in at least three locations (login.py:328, shell.py:60, auth.py:124), and login.py:328 already has a TODO comment noting this should be stored in config. Centralizing this value prevents silent divergence if it changes and reduces duplication.

Comment thread python/packages/jumpstarter/jumpstarter/client/lease.py
@bennyz
Copy link
Copy Markdown
Member Author

bennyz commented Feb 6, 2026

need to add some tests

@bennyz bennyz requested a review from mangelajo February 6, 2026 11:43
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/client/lease_test.py`:
- Around line 476-511: The test test_estimates_expiry_from_last_known_end_time
currently returns an already-expired end time on the first get(), so the monitor
immediately calls _notify_lease_ending and never exercises the cached
last_known_end_time path; change the fake get_then_fail so its first return has
an effective_end_time slightly in the future (so lease._get_lease_end_time
caches it and the monitor sleeps), then have the second get() raise the "server
unavailable" exception after the cached time would have passed so that
lease.monitor_async relies on the cached last_known_end_time and calls
_notify_lease_ending(timedelta(0)); adjust call_count logic in get_then_fail and
any timing (past_end -> future_end) to ensure the loop enters the sleep branch
before the simulated server failure.
🧹 Nitpick comments (1)
python/packages/jumpstarter-cli/jumpstarter_cli/shell_test.py (1)

252-377: Good test suite for _monitor_token_expiry; one observation on exit strategy.

The tests use Exception("done") from get_token_remaining_seconds to break out of the monitor loop, relying on the implementation's bare except catching everything. This couples tests to the current except Exception: return exit behavior. If the implementation ever narrows the exception handling, these tests would hang instead of failing cleanly. Consider using a sentinel pattern (e.g., a counter-based side effect that sets cancel_called=True after the first iteration — as done in test_sleeps_30s_when_above_threshold) for more robust loop termination.

Not a blocker — the current tests work correctly with the current implementation.

Comment thread python/packages/jumpstarter/jumpstarter/client/lease_test.py
Signed-off-by: Benny Zlotnik <bzlotnik@redhat.com>
@bennyz bennyz force-pushed the jmp-shell-refresh branch from 2f7f8c9 to 1c25bc5 Compare February 8, 2026 08:35
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/client/lease_test.py`:
- Around line 484-506: The test's timing is fragile because future_end is only
50ms ahead of datetime.now(), so increase the margin and/or make the monitor's
sleep deterministic: set future_end to now + timedelta(seconds=5) (replace the
50ms offset) and patch jumpstarter.client.lease.sleep in the test (same approach
used in test_continues_on_get_failure_without_end_time) so the monitor loop uses
the patched sleep; keep get_then_fail and the async with lease.monitor_async()
logic unchanged so the test still exercises the cached end-time error path
reliably on CI.
🧹 Nitpick comments (3)
python/packages/jumpstarter/jumpstarter/client/lease_test.py (2)

176-176: Inconsistent async test markers: asyncio vs anyio.

Line 176 uses @pytest.mark.asyncio while lines 449 and 476 use @pytest.mark.anyio. Since monitor_async relies on anyio primitives (create_task_group), the anyio marker is correct there. However, mixing markers in the same test file can lead to confusing test-runner configuration issues. Consider switching Line 176 to @pytest.mark.anyio for consistency.

Suggested fix
-    `@pytest.mark.asyncio`
+    `@pytest.mark.anyio`
     async def test_integration_with_async_context(self):

Also applies to: 449-449, 476-476


339-363: LGTM — refresh_channel test correctly avoids closing the old channel.

The test verifies that refresh_channel replaces channel, controller, and svc without calling close() on the old channel, which aligns with the design intent from the retrieved learnings.

One minor gap: mock_svc_cls.assert_called_once() doesn't verify it was called with the expected channel and namespace keyword arguments.

Optional: tighten the assertion
-        mock_svc_cls.assert_called_once()
+        mock_svc_cls.assert_called_once_with(channel=new_channel, namespace="default")
python/packages/jumpstarter-cli/jumpstarter_cli/shell_test.py (1)

327-344: Clever cancel_called property trick, but fragile if the implementation checks it more than twice.

The counter-based property returns False on the first access and True on the second. If the production code adds an extra cancel_called check (e.g., in a guard or log statement), this test would silently change behavior. A safer alternative would be to use side_effect on a list:

💡 Alternative approach
-        call_count = 0
-
-        def check_cancelled():
-            nonlocal call_count
-            call_count += 1
-            return call_count > 1
-
-        config = _make_config()
-        cancel_scope = Mock()
-        type(cancel_scope).cancel_called = property(lambda self: check_cancelled())
+        config = _make_config()
+        cancel_scope = Mock()
+        type(cancel_scope).cancel_called = PropertyMock(side_effect=[False, True])

This makes the exact number of accesses explicit and will raise StopIteration if there's an unexpected extra access, surfacing breakage rather than hiding it.

Comment on lines +484 to +506
future_end = datetime.now(tz=timezone.utc) + timedelta(milliseconds=50)
call_count = 0

async def get_then_fail():
nonlocal call_count
call_count += 1
if call_count == 1:
return Mock(
effective_begin_time=future_end - timedelta(hours=1),
effective_duration=timedelta(hours=1),
effective_end_time=None,
duration=timedelta(hours=1),
)
raise Exception("server unavailable")

lease.get = get_then_fail

with caplog.at_level(logging.WARNING):
async with lease.monitor_async():
# Keep the body alive long enough for the monitor to loop
# through the first get(), sleep, second get() (fails), and
# error handler using the cached end time.
await asyncio.sleep(0.2)
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Timing-sensitive test — 50 ms margin is fragile in CI.

future_end is only 50 ms ahead of datetime.now(). On a loaded CI runner the first get() + the monitor's first iteration could easily take longer than 50 ms, causing the lease to appear expired on the first check rather than the second, which would skip the cached-end-time error path the test intends to exercise.

Consider either:

  1. Patching jumpstarter.client.lease.sleep (as done in test_continues_on_get_failure_without_end_time) and using a larger future_end offset (e.g. 5 s) so wall-clock jitter is irrelevant, or
  2. Patching datetime.now to control time deterministically.
🤖 Prompt for AI Agents
In `@python/packages/jumpstarter/jumpstarter/client/lease_test.py` around lines
484 - 506, The test's timing is fragile because future_end is only 50ms ahead of
datetime.now(), so increase the margin and/or make the monitor's sleep
deterministic: set future_end to now + timedelta(seconds=5) (replace the 50ms
offset) and patch jumpstarter.client.lease.sleep in the test (same approach used
in test_continues_on_get_failure_without_end_time) so the monitor loop uses the
patched sleep; keep get_then_fail and the async with lease.monitor_async() logic
unchanged so the test still exercises the cached end-time error path reliably on
CI.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if future_end = now + 5s and the monitor's sleep is patched, the monitor spins rapidly through the error handler, but remain stays ~5s because no real time passes

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bennyz You're absolutely right! That's a key insight—patching sleep alone doesn't advance datetime.now(), so the lease never actually expires. To make this test both deterministic and meaningful, you need to control time progression.

Here are a couple of approaches:

Option 1: Use freezegun with tick=True

from freezegun import freeze_time

`@pytest.mark.anyio`
`@freeze_time`("2025-06-01 10:00:00", tick=True, auto_tick_seconds=0.1)
async def test_estimates_expiry_from_last_known_end_time(self, caplog):
    """When get() fails after we've seen an end time, use cached value."""
    lease = self._make_lease_for_monitor()
    callback = Mock()
    lease.lease_ending_callback = callback

    # End time 0.2s in the future
    future_end = datetime.now(tz=timezone.utc) + timedelta(milliseconds=200)
    call_count = 0

    async def get_then_fail():
        nonlocal call_count
        call_count += 1
        if call_count == 1:
            return Mock(...)
        raise Exception("server unavailable")

    lease.get = get_then_fail

    with patch("jumpstarter.client.lease.sleep", new_callable=AsyncMock):
        async with lease.monitor_async():
            await asyncio.sleep(0.3)  # freezegun advances time

    callback.assert_called()
    _, remain_arg = callback.call_args[0]
    assert remain_arg == timedelta(0)

Option 2: Mock datetime.now to advance time between calls

`@pytest.mark.anyio`
async def test_estimates_expiry_from_last_known_end_time(self, caplog):
    lease = self._make_lease_for_monitor()
    callback = Mock()
    lease.lease_ending_callback = callback

    start_time = datetime(2025, 6, 1, 10, 0, 0, tzinfo=timezone.utc)
    future_end = start_time + timedelta(milliseconds=100)
    
    # Control time: first call returns start_time, subsequent calls return after expiry
    time_sequence = [start_time, start_time, future_end + timedelta(milliseconds=50)]
    
    with patch("jumpstarter.client.lease.datetime") as mock_dt:
        mock_dt.now.side_effect = lambda tz=None: time_sequence.pop(0) if time_sequence else future_end + timedelta(seconds=1)
        mock_dt.timezone = timezone  # preserve timezone
        
        # ... rest of test

Option 3: Test the caching behavior without requiring expiry
If verifying cached end-time usage is the primary goal, you could assert on remain being close to the expected value rather than exactly 0:

# Verify cached end time was used (remain should be close to expected)
assert -timedelta(seconds=1) <= remain_arg <= timedelta(milliseconds=100)

Which approach would you prefer? The freezegun option is cleanest if you're okay with adding that dependency.


🧠 Learnings used
Learnt from: bennyz
Repo: jumpstarter-dev/jumpstarter PR: 199
File: python/packages/jumpstarter/jumpstarter/client/lease.py:68-76
Timestamp: 2026-02-06T11:33:20.121Z
Learning: In `python/packages/jumpstarter/jumpstarter/client/lease.py`, the `refresh_channel()` method should NOT call `close()` on the old channel before replacing it. Token refresh happens asynchronously while RPCs may be in-flight, and calling `close()` would immediately terminate active RPCs. Instead, rely on Python's reference counting to clean up the old channel after in-flight operations complete.

Learnt from: michalskrivanek
Repo: jumpstarter-dev/jumpstarter PR: 704
File: packages/jumpstarter/jumpstarter/client/grpc.py:100-107
Timestamp: 2025-10-14T17:43:07.788Z
Learning: In the Jumpstarter client lease model (packages/jumpstarter/jumpstarter/client/grpc.py), `effective_duration` represents the elapsed time for an active lease so far, not the total duration. To calculate expected release time, use `effective_begin_time + duration` (where `duration` is the configured/requested duration), not `effective_begin_time + effective_duration`.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go through all of this for the off chance we miss the 200ms window?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bennyz You're right—adding freezegun or complex datetime mocking for this edge case is probably overkill. The test is primarily about verifying that the monitor uses the cached end time when get() fails, not about precise timing guarantees.

Simple middle ground:
Keep the test as-is but bump the margin from 50ms to 200ms:

future_end = datetime.now(tz=timezone.utc) + timedelta(milliseconds=200)

This gives you 4× more breathing room on slow CI runners while keeping the test simple. If it does prove flaky in practice (which is unlikely), you can revisit with retries or accept the occasional spurious failure.

The core behavior—error recovery using cached state—will still be exercised reliably. No need to overcomplicate it. 👍

🐰


🧠 Learnings used
Learnt from: bennyz
Repo: jumpstarter-dev/jumpstarter PR: 199
File: python/packages/jumpstarter/jumpstarter/client/lease.py:68-76
Timestamp: 2026-02-06T11:33:20.121Z
Learning: In `python/packages/jumpstarter/jumpstarter/client/lease.py`, the `refresh_channel()` method should NOT call `close()` on the old channel before replacing it. Token refresh happens asynchronously while RPCs may be in-flight, and calling `close()` would immediately terminate active RPCs. Instead, rely on Python's reference counting to clean up the old channel after in-flight operations complete.

Learnt from: michalskrivanek
Repo: jumpstarter-dev/jumpstarter PR: 704
File: packages/jumpstarter/jumpstarter/client/grpc.py:100-107
Timestamp: 2025-10-14T17:43:07.788Z
Learning: In the Jumpstarter client lease model (packages/jumpstarter/jumpstarter/client/grpc.py), `effective_duration` represents the elapsed time for an active lease so far, not the total duration. To calculate expected release time, use `effective_begin_time + duration` (where `duration` is the configured/requested duration), not `effective_begin_time + effective_duration`.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/client/lease.py`:
- Around line 290-296: In _get_lease_end_time replace the incorrect use of
lease.effective_duration (which is elapsed time) with the requested/configured
duration field (lease.duration) when computing the end time: keep the existing
early-return when effective_end_time is present, and if not, validate
lease.effective_begin_time and lease.duration exist and return
lease.effective_begin_time + lease.duration; update the function
(_get_lease_end_time) to check lease.duration instead of
lease.effective_duration to mirror the correct pattern used in grpc.py (line
~107).

Comment thread python/packages/jumpstarter/jumpstarter/client/lease.py Outdated
@bennyz bennyz force-pushed the jmp-shell-refresh branch from 6c252e7 to 018adaa Compare February 8, 2026 14:13
…ctive_duration

Signed-off-by: Benny Zlotnik <bzlotnik@redhat.com>
@bennyz bennyz force-pushed the jmp-shell-refresh branch from 018adaa to ffdcf37 Compare February 8, 2026 14:14
@mangelajo
Copy link
Copy Markdown
Member

@bennyz how does this behave with internal tokens?

@mangelajo
Copy link
Copy Markdown
Member

I guess since those are also OIDC, and they expire in a long time, it's simply not triggered?

@bennyz
Copy link
Copy Markdown
Member Author

bennyz commented Feb 9, 2026

I guess since those are also OIDC, and they expire in a long time, it's simply not triggered?

yeah, it might take a long time

@mangelajo mangelajo merged commit 3b8238e into jumpstarter-dev:main Feb 10, 2026
25 of 28 checks passed
@bennyz bennyz deleted the jmp-shell-refresh branch February 10, 2026 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix lease expiration inside jumpstarter shell

2 participants