fix: Gateway channel supervision: unlimited retries with error classification and operator controls by praisonai-triage-agent[bot] · Pull Request #1918 · MervinPraison/PraisonAI

praisonai-triage-agent · 2026-06-12T22:57:05Z

Fixes #1913

Summary

Implements a comprehensive solution for gateway channel supervision to fix the critical issue where channels die permanently after network outages.

Changes Made

ChannelSupervisor class with error classification and unlimited retries
Gateway integration replacing bounded retry logic
REST API endpoints for channel control operations
CLI commands for pause/resume/reconnect operations
Enhanced health endpoint with supervision state

Problem Solved

Before: Channels died permanently after 5 retries requiring manual restart
After: Channels automatically self-heal with proper error classification

Architecture

Follows AGENTS.md guidelines - implemented in wrapper layer using existing resilience primitives

Generated with Claude Code

Summary by CodeRabbit

New Features
- Added three new gateway CLI commands—pause, resume, and reconnect—for managing channel operations directly from the command line.
- Implemented channel supervision with intelligent error handling, automatic retry strategies, and detailed recovery tracking to improve reliability.
- New API endpoints enable monitoring of channel supervision status and control of individual channel states.

…ling - Add ChannelSupervisor class for unlimited retries with error classification - Integrate supervisor with WebSocketGateway to replace bounded retry logic - Add operator controls for pause/resume/reconnect via CLI and REST API - Update health endpoint to include supervision state and error details - Use existing resilience primitives from bots/_resilience.py - Ensures channels self-heal after network outages automatically Fixes #1913 Co-authored-by: MervinPraison <MervinPraison@users.noreply.github.com>

MervinPraison · 2026-06-12T22:57:13Z

@coderabbitai review

MervinPraison · 2026-06-12T22:57:14Z

/review

qodo-code-review · 2026-06-12T22:57:17Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

coderabbitai · 2026-06-12T22:57:19Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2026-06-12T22:57:27Z

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cce11648-d3fa-4372-9bb9-6fc964b33ba0

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR adds resilient channel supervision to the gateway. A new ChannelSupervisor replaces the bounded 5-retry loop with infinite retries for recoverable errors and introduces error classification (fatal vs. recoverable). The supervisor is integrated into the gateway server with HTTP endpoints and public control methods, and CLI commands are added for operator pause/resume/reconnect.

Changes

Gateway Channel Supervision Lifecycle

Layer / File(s)	Summary
ChannelSupervisor model and core logic `src/praisonai/praisonai/gateway/supervisor.py`	`ChannelState` enum and `ChannelStatus` dataclass track per-channel supervision state, error details, and retry timing. `ChannelSupervisor` maintains status/monitor/abort registries, provides operator methods (pause/resume/reconnect) to update state and signal interrupts, and implements an async `run` loop that retries recoverable errors indefinitely with capped exponential backoff, classifies conflict and non-recoverable errors as fatal, and supports manual pause by waiting on abort signals until resumed.
Gateway server integration and HTTP endpoints `src/praisonai/praisonai/gateway/server.py`	Imports and initializes `ChannelSupervisor` in the gateway constructor. Replaces the bounded retry logic in `_run_bot_safe` with `supervisor.run(...)`. Adds public control methods (`pause_channel`, `resume_channel`, `reconnect_channel`, `get_channel_supervision_status`) and HTTP handlers that invoke them. Registers new routes (`POST /api/channels/{name}/pause`, `/resume`, `/reconnect`) and extends `/health` to include per-channel supervision state from the supervisor.
CLI pause/resume/reconnect subcommands `src/praisonai/praisonai/cli/commands/gateway.py`	Adds three new gateway CLI subcommands that accept a channel name and convert the WS gateway URL to HTTP, then POST to the corresponding REST endpoint. Each command reports success or failure based on the JSON response and handles exceptions.

Sequence Diagram

sequenceDiagram
  participant Operator
  participant CLICommand
  participant HTTPEndpoint
  participant WebSocketGateway
  participant ChannelSupervisor
  participant ChannelBot
  
  Operator->>CLICommand: gateway pause mybot
  CLICommand->>HTTPEndpoint: POST /api/channels/mybot/pause
  HTTPEndpoint->>WebSocketGateway: pause_channel("mybot")
  WebSocketGateway->>ChannelSupervisor: pause("mybot")
  ChannelSupervisor->>ChannelSupervisor: set state PAUSED, signal abort
  ChannelSupervisor-->>WebSocketGateway: True
  WebSocketGateway-->>HTTPEndpoint: {success: true}
  HTTPEndpoint-->>CLICommand: {success: true}
  CLICommand-->>Operator: Paused channel 'mybot'
  
  Note over ChannelBot: Bot awaits on abort signal
  
  Operator->>CLICommand: gateway resume mybot
  CLICommand->>HTTPEndpoint: POST /api/channels/mybot/resume
  HTTPEndpoint->>WebSocketGateway: resume_channel("mybot")
  WebSocketGateway->>ChannelSupervisor: resume("mybot")
  ChannelSupervisor->>ChannelSupervisor: clear manual_pause, state STOPPED
  ChannelSupervisor-->>ChannelBot: clear abort signal
  ChannelSupervisor-->>WebSocketGateway: True
  WebSocketGateway-->>HTTPEndpoint: {success: true}
  HTTPEndpoint-->>CLICommand: {success: true}
  CLICommand-->>Operator: Resumed channel 'mybot'
  
  Note over ChannelBot: Supervision loop restarts bot

🎯 4 (Complex) | ⏱️ ~60 minutes

🐰 A channel that crashed and burned now springs back to life,
With infinite retries and wisdom to classify strife,
Pause, resume, reconnect—operators take the wheel,
While recoverable errors heal themselves for real. 🚀✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately summarizes the primary change: implementing gateway channel supervision with unlimited retries, error classification, and operator controls. It directly corresponds to the main objective of issue `#1913`.
Linked Issues check	✅ Passed	All key requirements from issue `#1913` are met: ChannelSupervisor replaces bounded retries with unlimited retries on recoverable errors [supervisor.py]; error classification distinguishes fatal vs. recoverable errors [supervisor.py]; per-channel state surfaces in health data [server.py]; operator controls (pause/resume/reconnect) added via REST API and CLI [server.py, gateway.py]; gateway reuses existing resilience primitives [supervisor.py uses BackoffPolicy, ConnectionMonitor].
Out of Scope Changes check	✅ Passed	All changes are directly scoped to issue `#1913`: three new files/modules (supervisor.py, three CLI commands, REST endpoints) implement the core supervision feature. No extraneous refactoring, unrelated bug fixes, or optional enhancements detected.
Docstring Coverage	✅ Passed	Docstring coverage is 96.88% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/issue-1913-20260612-2251

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

MervinPraison · 2026-06-12T22:57:39Z

@copilot Do a thorough review of this PR. Read ALL existing reviewer comments above from Qodo, Coderabbit, and Gemini first — incorporate their findings.

Review areas:

Bloat check: Are changes minimal and focused? Any unnecessary code or scope creep?
Security: Any hardcoded secrets, unsafe eval/exec, missing input validation?
Performance: Any module-level heavy imports? Hot-path regressions?
Tests: Are tests included? Do they cover the changes adequately?
Backward compat: Any public API changes without deprecation?
Code quality: DRY violations, naming conventions, error handling?
Address reviewer feedback: If Qodo, Coderabbit, or Gemini flagged valid issues, include them in your review
Suggest specific improvements with code examples where possible

greptile-apps · 2026-06-12T23:01:23Z

Greptile Summary

This PR replaces the gateway's bounded 5-retry loop with a ChannelSupervisor that supports unlimited retries with exponential backoff, error classification (recoverable vs. fatal), and manual operator controls (pause/resume/reconnect) exposed via new REST endpoints and CLI commands. The resume() signal bug and missing total_recoveries sync from the prior review round are addressed.

supervisor.py: New ChannelSupervisor with a supervision while True loop, BackoffPolicy/ConnectionMonitor integration, and pause/resume/reconnect synchronous control methods.
server.py: Wires ChannelSupervisor into _run_bot_safe, adds three POST /api/channels/{name}/* endpoints, and enriches the /health response with supervision state per channel.
gateway.py (CLI): Three new typer sub-commands (pause, resume, reconnect) that convert the WebSocket base URL to HTTP and call the new REST endpoints.

Confidence Score: 3/5

Safe to merge only after fixing reconnect on fatally-failed channels; the operation returns success but does nothing, leaving the bot dead with a misleadingly clean state.

The core supervision loop and the previously-flagged resume/recovery-count bugs are now correct. However, reconnect() after a fatal error silently does nothing: the supervision coroutine has already returned, the abort signal has no receiver, the bot stays dead, and the state is flipped from FAILED to STOPPED — masking the original failure.

supervisor.py — specifically the reconnect() method and its interaction with the terminated supervision loop after fatal errors.

Important Files Changed

Filename	Overview
src/praisonai/praisonai/gateway/supervisor.py	New ChannelSupervisor class with supervision loop, error classification, and operator controls. resume() and total_recoveries bugs from prior review are addressed; reconnect() on a FAILED channel is a silent no-op that falsely reports success; state stays RUNNING during backoff windows; _tasks dict remains unpopulated.
src/praisonai/praisonai/gateway/server.py	Integrates ChannelSupervisor replacing bounded retry logic; adds REST endpoints for pause/resume/reconnect; health endpoint now exposes supervision state. Control endpoints delegate to supervisor methods which inherit the reconnect-on-FAILED no-op issue.
src/praisonai/praisonai/cli/commands/gateway.py	Adds pause, resume, reconnect CLI commands with WebSocket-to-HTTP URL conversion. Logic is straightforward and correct; reconnect will show false success on a permanently-failed channel due to upstream supervisor issue.

Sequence Diagram

sequenceDiagram
    participant GW as WebSocketGateway
    participant SUP as ChannelSupervisor
    participant BOT as Bot (start_fn)
    participant OPS as Operator (CLI/REST)

    GW->>SUP: run(name, bot, start_fn)
    loop Supervision loop
        SUP->>SUP: check manual_pause
        SUP->>BOT: await start_fn(name, bot)
        alt Clean exit
            BOT-->>SUP: returns normally
            SUP->>SUP: "record_success(), state=STOPPED"
            Note over SUP: loop exits
        else Recoverable error
            BOT-->>SUP: raises Exception
            SUP->>SUP: "record_error(), state=RUNNING (misleading during sleep)"
            SUP->>SUP: sleep_with_abort(delay, abort_signal)
            Note over SUP: retries indefinitely
        else Fatal / Conflict error
            BOT-->>SUP: raises Exception
            SUP->>SUP: "state=FAILED, loop exits"
            Note over SUP: coroutine done
        end
    end

    OPS->>SUP: reconnect(name) after FAILED
    SUP->>SUP: "state=STOPPED, abort_signal.set()"
    Note over SUP: no-op: loop already exited, bot never restarts

_{Reviews (2): Last reviewed commit: "fix: resolve critical gateway channel su..." | Re-trigger Greptile}

greptile-apps · 2026-06-12T23:01:28Z

+    last_error: Optional[str] = None
+    last_error_time: Optional[float] = None
+    next_retry_at: Optional[float] = None
+    total_recoveries: int = 0
+    manual_pause: bool = False


total_recoveries in ChannelStatus is never incremented

ConnectionMonitor.record_success() tracks recoveries in its own total_recoveries field, but ChannelStatus.total_recoveries (what the health endpoint and get_channel_supervision_status() actually return) is never updated. Any operator tool or dashboard consuming the health endpoint will always see "total_recoveries": 0, masking actual recovery events.

greptile-apps · 2026-06-12T23:01:30Z

+    def pause(self, name: str) -> bool:
+        """Manually pause a channel.
+
+        Args:
+            name: Channel name
+
+        Returns:
+            True if channel was running and paused, False otherwise
+        """
+        if name not in self._channels:
+            return False
+
+        status = self._channels[name]
+        if status.state == ChannelState.RUNNING:
+            status.state = ChannelState.PAUSED
+            status.manual_pause = True
+
+            # Signal abort to stop current operations
+            if name in self._abort_signals:
+                self._abort_signals[name].set()
+
+            logger.info(f"Channel '{name}' manually paused")
+            return True
+
+        return False


pause() cannot stop an actively running bot

pause() sets manual_pause = True and fires abort_signal — but abort_signal is only checked inside sleep_with_abort during the backoff window. When a bot is actively running (i.e. await start_fn(name, bot) on line 208 has not yet returned), there is no cooperative yield back to the supervision loop. The state is set to PAUSED and the signal is set, but the bot keeps polling/receiving messages until it crashes or exits on its own. A user calling praisonai gateway pause telegram will see a success response while the Telegram bot continues sending traffic.

greptile-apps · 2026-06-12T23:01:31Z

+    def cleanup(self, name: str) -> None:
+        """Clean up supervision state for a channel."""
+        self._channels.pop(name, None)
+        self._monitors.pop(name, None)
+        if name in self._abort_signals:
+            self._abort_signals[name].set()
+            self._abort_signals.pop(name, None)
+        if name in self._tasks:
+            task = self._tasks.pop(name)
+            if not task.done():
+                task.cancel()


_tasks dict is populated nowhere

cleanup() loops over self._tasks to cancel outstanding asyncio.Task objects, but nothing in the class ever adds entries to _tasks. Bot coroutines are await-ed directly inside run() rather than wrapped with asyncio.create_task(). The dead reference means that when cleanup() is called (e.g., on shutdown), tasks are not cancelled through this code path.

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/praisonai/praisonai/gateway/server.py (1)
2054-2073: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Clean up supervisor state when channels stop or reload.

stop_channels() cancels the bot tasks and clears _channel_bots, but it never calls ChannelSupervisor.cleanup(). Because run() reuses any existing per-channel entry on the next start, stale manual_pause, error, and monitor state can survive a reload and keep a restarted channel parked or misreported.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/praisonai/praisonai/gateway/server.py` around lines 2054 - 2073,
stop_channels currently cancels tasks and clears _channel_bots but never invokes
ChannelSupervisor.cleanup, leaving per-channel supervisor state (e.g.,
manual_pause, error, monitor) stale across restarts; update stop_channels to
iterate existing supervisors (from _channel_bots or wherever supervisors are
stored), call each supervisor's cleanup() (or ChannelSupervisor.cleanup) before
clearing _channel_bots/_routing_rules, and ensure run() will create fresh
supervisor entries when restarted so stale state can't persist.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/praisonai/praisonai/cli/commands/gateway.py`:
- Around line 217-224: The response handlers that parse response.json()
currently only print failures and return which leaves the process exit code as
0; update each handler (the pause/resume/delete channel response blocks shown
where result = response.json() and prints success/failure — also at the other
two similar blocks referenced) to: when result.get("success") is false, print
the server-provided message using result.get("message", result.get("error",
"Unknown error")) and then terminate with a failing exit code (use sys.exit(1)
or raise SystemExit(1)); also ensure the except Exception as e blocks print the
error and similarly exit with a non-zero code.
- Around line 209-214: The current ws->http conversion simply string-replaces
the prefix and appends "/api/..." which corrupts URLs that already include a
path or query (e.g., ws://host:8765/ws or tokenized URLs); fix by parsing `url`
with urllib.parse.urlparse, switch scheme from "ws"->"http" and "wss"->"https",
rebuild a base REST URL using urlunparse (preserving existing path and query),
ensure the path ends with "/" and then join the API suffix (e.g.,
"api/channels/{name}/pause") using urllib.parse.urljoin or by concatenating onto
the normalized path so the final `rest_url` passed to `requests.post` is correct
for `rest_url` (the variable), replacing the current naive replace-and-append
logic.

In `@src/praisonai/praisonai/gateway/server.py`:
- Around line 726-751: The three new async handlers pause_channel_handler,
resume_channel_handler, and reconnect_channel_handler lack the existing
authentication gate; call the existing _check_auth(request) at the start of each
handler (await it if _check_auth is async) and if it fails return the same
unauthorized JSONResponse/behavior used by other /api handlers so
unauthenticated callers cannot pause/resume/reconnect channels; keep the rest of
each handler unchanged and reference the same channel methods
(self.pause_channel, self.resume_channel, self.reconnect_channel).

In `@src/praisonai/praisonai/gateway/supervisor.py`:
- Around line 211-212: ConnectionMonitor.total_recoveries is incremented via
monitor.record_success() but ChannelStatus.total_recoveries is never updated, so
health endpoints read zero; update the exported status by copying the monitor's
counter into the ChannelStatus instance (e.g., assign status.total_recoveries =
monitor.total_recoveries) when you build/return the status in
get_channel_supervision_status() (or immediately after monitor.record_success()
where status/state are updated) so /health and get_channel_supervision_status()
reflect real recovery counts.
- Around line 121-128: The resume/reconnect logic is clearing the per-channel
abort Event instead of setting it, so the supervision loop (which awaits
abort_signal.wait() in _supervise loop) never wakes; also reconnect leaves
status.manual_pause true so a resumed channel immediately re-enters PAUSED. Fix
resume() (method name) and reconnect() to call .set() on
self._abort_signals[name] to wake the supervisor and ensure status.manual_pause
is set to False when restarting; update handling around ChannelState transitions
in the block that touches self._channels[name], status.state, and
status.manual_pause so a paused channel is unpaused and its abort event is
signaled to allow immediate restart by the supervisor.
- Around line 226-242: The code later unconditionally sets status.state =
ChannelState.STOPPED, which overwrites terminal FAILED states set for
conflict/fatal errors; modify that assignment in supervisor.py so it does not
overwrite an existing terminal failure: check status.state and only set
ChannelState.STOPPED if status.state is not ChannelState.FAILED (and not any
other terminal state you treat as final), e.g., wrap the STOPPED assignment in a
guard like "if status.state != ChannelState.FAILED:" (or check a set of terminal
states) so the FAILED last_error/last_error_time/next_retry_at are preserved for
operators; references: status, ChannelState, name, logger.
- Around line 203-208: The code awaits start_fn(name, bot) directly so
pause()/reconnect() can't interrupt a long-lived bot start; change to run
start_fn in a background asyncio.Task (e.g., create_task(start_fn(name, bot)))
and race that task against the abort/stop signal from the monitor (use
asyncio.wait or wait_for on monitor.abort_event/monitor.stop_future), storing
the task on the monitor (monitor.task) so other code can cancel it; if the abort
wins, cancel the task, await it to propagate cancellation (handle
asyncio.CancelledError), update status.state and cleanup, and if the start task
finishes first propagate success or errors as before. Ensure you reference
start_fn, monitor (monitor.abort_event / monitor.task), pause()/reconnect(), and
status.state/ChannelState.RUNNING when locating and updating the logic.

---

Outside diff comments:
In `@src/praisonai/praisonai/gateway/server.py`:
- Around line 2054-2073: stop_channels currently cancels tasks and clears
_channel_bots but never invokes ChannelSupervisor.cleanup, leaving per-channel
supervisor state (e.g., manual_pause, error, monitor) stale across restarts;
update stop_channels to iterate existing supervisors (from _channel_bots or
wherever supervisors are stored), call each supervisor's cleanup() (or
ChannelSupervisor.cleanup) before clearing _channel_bots/_routing_rules, and
ensure run() will create fresh supervisor entries when restarted so stale state
can't persist.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2c119175-908f-43a3-89c4-4b1f33832fb3

📥 Commits

Reviewing files that changed from the base of the PR and between 6f2df7d and 19b8f29.

📒 Files selected for processing (3)

src/praisonai/praisonai/cli/commands/gateway.py
src/praisonai/praisonai/gateway/server.py
src/praisonai/praisonai/gateway/supervisor.py

coderabbitai · 2026-06-12T23:03:38Z

+        # Convert WebSocket URL to HTTP for REST API
+        rest_url = url.replace("ws://", "http://").replace("wss://", "https://")
+        if not rest_url.endswith("/"):
+            rest_url += "/"
+
+        response = requests.post(f"{rest_url}api/channels/{name}/pause", timeout=10)


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Parse --url before appending /api/...; the current join breaks real WS URLs and tokenized URLs.

Lines 210, 242, and 274 only work for a bare ws://host:port string. ws://host:8765/ws becomes http://host:8765/ws/api/..., and ?token=... ends up before the appended path, so these commands cannot target the actual WebSocket URL or an authenticated gateway reliably.

Also applies to: 241-246, 273-278

🧰 Tools

🪛 ast-grep (0.43.0)

[warning] 209-209: Do not make http calls without encryption
Context: "http://"
Note: [CWE-319].

(requests-http)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/praisonai/praisonai/cli/commands/gateway.py` around lines 209 - 214, The current ws->http conversion simply string-replaces the prefix and appends "/api/..." which corrupts URLs that already include a path or query (e.g., ws://host:8765/ws or tokenized URLs); fix by parsing `url` with urllib.parse.urlparse, switch scheme from "ws"->"http" and "wss"->"https", rebuild a base REST URL using urlunparse (preserving existing path and query), ensure the path ends with "/" and then join the API suffix (e.g., "api/channels/{name}/pause") using urllib.parse.urljoin or by concatenating onto the normalized path so the final `rest_url` passed to `requests.post` is correct for `rest_url` (the variable), replacing the current naive replace-and-append logic.

coderabbitai · 2026-06-12T23:03:38Z

+        result = response.json()
+        if result.get("success"):
+            print(f"✅ Channel '{name}' paused successfully")
+        else:
+            print(f"❌ Failed to pause channel '{name}': {result.get('error', 'Unknown error')}")
+
+    except Exception as e:
+        print(f"❌ Error pausing channel '{name}': {str(e)}")


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Return a failing exit code when the action is rejected.

When success is false, these handlers just print and return, so the shell still gets exit code 0. They also discard the server's message field and fall back to error, which turns a normal rejection into Unknown error.

Also applies to: 249-256, 281-288

🧰 Tools

🪛 Ruff (0.15.15)

[warning] 223-223: Do not catch blind exception: Exception

(BLE001)

[warning] 224-224: Use explicit conversion flag

Replace with conversion flag

(RUF010)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/praisonai/praisonai/cli/commands/gateway.py` around lines 217 - 224, The response handlers that parse response.json() currently only print failures and return which leaves the process exit code as 0; update each handler (the pause/resume/delete channel response blocks shown where result = response.json() and prints success/failure — also at the other two similar blocks referenced) to: when result.get("success") is false, print the server-provided message using result.get("message", result.get("error", "Unknown error")) and then terminate with a failing exit code (use sys.exit(1) or raise SystemExit(1)); also ensure the except Exception as e blocks print the error and similarly exit with a non-zero code.

coderabbitai · 2026-06-12T23:03:38Z

+        async def pause_channel_handler(request) -> JSONResponse:
+            """POST /api/channels/{name}/pause — pause a channel."""
+            channel_name = request.path_params["name"]
+            success = self.pause_channel(channel_name)
+            return JSONResponse({
+                "success": success,
+                "message": f"Channel '{channel_name}' {'paused' if success else 'could not be paused'}"
+            })
+
+        async def resume_channel_handler(request) -> JSONResponse:
+            """POST /api/channels/{name}/resume — resume a paused channel."""
+            channel_name = request.path_params["name"]
+            success = self.resume_channel(channel_name)
+            return JSONResponse({
+                "success": success,
+                "message": f"Channel '{channel_name}' {'resumed' if success else 'could not be resumed'}"
+            })
+
+        async def reconnect_channel_handler(request) -> JSONResponse:
+            """POST /api/channels/{name}/reconnect — reconnect a channel."""
+            channel_name = request.path_params["name"]
+            success = self.reconnect_channel(channel_name)
+            return JSONResponse({
+                "success": success,
+                "message": f"Channel '{channel_name}' {'reconnected' if success else 'could not be reconnected'}"
+            })


🔒 Security & Privacy | 🔴 Critical | ⚡ Quick win

Protect the new channel-control endpoints with the existing auth check.

Unlike the other /api/* handlers in this file, Lines 726-751 never call _check_auth(request). Any unauthenticated caller that can reach the gateway can pause, resume, or reconnect channels.

Suggested fix

async def pause_channel_handler(request) -> JSONResponse: """POST /api/channels/{name}/pause — pause a channel.""" + auth_err = _check_auth(request) + if auth_err: + return auth_err channel_name = request.path_params["name"] success = self.pause_channel(channel_name) return JSONResponse({ @@ async def resume_channel_handler(request) -> JSONResponse: """POST /api/channels/{name}/resume — resume a paused channel.""" + auth_err = _check_auth(request) + if auth_err: + return auth_err channel_name = request.path_params["name"] success = self.resume_channel(channel_name) return JSONResponse({ @@ async def reconnect_channel_handler(request) -> JSONResponse: """POST /api/channels/{name}/reconnect — reconnect a channel.""" + auth_err = _check_auth(request) + if auth_err: + return auth_err channel_name = request.path_params["name"] success = self.reconnect_channel(channel_name) return JSONResponse({

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/praisonai/praisonai/gateway/server.py` around lines 726 - 751, The three new async handlers pause_channel_handler, resume_channel_handler, and reconnect_channel_handler lack the existing authentication gate; call the existing _check_auth(request) at the start of each handler (await it if _check_auth is async) and if it fails return the same unauthorized JSONResponse/behavior used by other /api handlers so unauthenticated callers cannot pause/resume/reconnect channels; keep the rest of each handler unchanged and reference the same channel methods (self.pause_channel, self.resume_channel, self.reconnect_channel).

coderabbitai · 2026-06-12T23:03:38Z

+        status = self._channels[name]
+        if status.state == ChannelState.PAUSED and status.manual_pause:
+            status.state = ChannelState.STOPPED  # Will be restarted by supervision
+            status.manual_pause = False
+
+            # Clear abort signal to allow restart
+            if name in self._abort_signals:
+                self._abort_signals[name].clear()


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Resume/reconnect never wake a fully paused channel.

Line 195 parks the supervision loop on await abort_signal.wait(), but resume() on Lines 126-128 clears that event instead of setting it, so an already-paused channel stays blocked forever. reconnect() on Lines 153-161 has the same wake-up problem and also leaves manual_pause set, so reconnecting a paused channel just falls back into the paused branch.

Suggested fix

if status.state == ChannelState.PAUSED and status.manual_pause: status.state = ChannelState.STOPPED # Will be restarted by supervision status.manual_pause = False - # Clear abort signal to allow restart + # Wake the paused supervision loop so it can re-enter startup if name in self._abort_signals: - self._abort_signals[name].clear() + self._abort_signals[name].set() @@ status = self._channels[name] + status.manual_pause = False status.state = ChannelState.STOPPED status.last_error = None status.last_error_time = None status.next_retry_at = None

Also applies to: 153-161, 193-196

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/praisonai/praisonai/gateway/supervisor.py` around lines 121 - 128, The resume/reconnect logic is clearing the per-channel abort Event instead of setting it, so the supervision loop (which awaits abort_signal.wait() in _supervise loop) never wakes; also reconnect leaves status.manual_pause true so a resumed channel immediately re-enters PAUSED. Fix resume() (method name) and reconnect() to call .set() on self._abort_signals[name] to wake the supervisor and ensure status.manual_pause is set to False when restarting; update handling around ChannelState transitions in the block that touches self._channels[name], status.state, and status.manual_pause so a paused channel is unpaused and its abort event is signaled to allow immediate restart by the supervisor.

coderabbitai · 2026-06-12T23:03:38Z

+                status.state = ChannelState.RUNNING
+                logger.info(f"Starting channel '{name}'..." + 
+                           (f" (attempt {monitor.attempt + 1})" if monitor.attempt > 0 else ""))
+
+                # Start the bot
+                await start_fn(name, bot)


🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift

Pause/reconnect still can't interrupt a healthy running bot.

Line 208 awaits start_fn() inline, and the abort signal is only observed before startup or during backoff sleep. In the current gateway integration, start_fn is a long-lived await bot.start() / await _start_telegram_bot_polling(), so pause() and reconnect() will not actually stop a healthy channel until it exits on its own.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/praisonai/praisonai/gateway/supervisor.py` around lines 203 - 208, The code awaits start_fn(name, bot) directly so pause()/reconnect() can't interrupt a long-lived bot start; change to run start_fn in a background asyncio.Task (e.g., create_task(start_fn(name, bot))) and race that task against the abort/stop signal from the monitor (use asyncio.wait or wait_for on monitor.abort_event/monitor.stop_future), storing the task on the monitor (monitor.task) so other code can cancel it; if the abort wins, cancel the task, await it to propagate cancellation (handle asyncio.CancelledError), update status.state and cleanup, and if the start task finishes first propagate success or errors as before. Ensure you reference start_fn, monitor (monitor.abort_event / monitor.task), pause()/reconnect(), and status.state/ChannelState.RUNNING when locating and updating the logic.

coderabbitai · 2026-06-12T23:03:39Z

+                monitor.record_success()
+                status.state = ChannelState.STOPPED


🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

total_recoveries never reaches the exported status object.

Line 211 increments ConnectionMonitor.total_recoveries, but ChannelStatus.total_recoveries is never synchronized from it. /health and get_channel_supervision_status() read the status object, so they will report 0 forever even after successful recoveries.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/praisonai/praisonai/gateway/supervisor.py` around lines 211 - 212, ConnectionMonitor.total_recoveries is incremented via monitor.record_success() but ChannelStatus.total_recoveries is never updated, so health endpoints read zero; update the exported status by copying the monitor's counter into the ChannelStatus instance (e.g., assign status.total_recoveries = monitor.total_recoveries) when you build/return the status in get_channel_supervision_status() (or immediately after monitor.record_success() where status/state are updated) so /health and get_channel_supervision_status() reflect real recovery counts.

coderabbitai · 2026-06-12T23:03:39Z

+                if is_conflict:
+                    # Conflict errors are fatal - another bot instance using same token
+                    status.state = ChannelState.FAILED
+                    status.last_error = f"Conflict error (fatal): {str(e)}"
+                    status.last_error_time = time.time()
+                    status.next_retry_at = None
+                    logger.error(f"Channel '{name}' failed with conflict error: {e}")
+                    break
+
+                elif not is_recoverable:
+                    # Non-recoverable error - treat as fatal
+                    status.state = ChannelState.FAILED  
+                    status.last_error = f"Fatal error: {str(e)}"
+                    status.last_error_time = time.time()
+                    status.next_retry_at = None
+                    logger.error(f"Channel '{name}' failed with fatal error: {e}")
+                    break


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Don't overwrite fatal channel state with STOPPED.

Lines 226-242 correctly classify conflict and fatal errors as FAILED, but Line 261 unconditionally rewrites the terminal state to STOPPED. That erases the parked-failure state this PR is supposed to surface to operators.

Also applies to: 260-262

🧰 Tools

🪛 Ruff (0.15.15)

[warning] 229-229: Use explicit conversion flag

Replace with conversion flag

(RUF010)

[warning] 238-238: Use explicit conversion flag

Replace with conversion flag

(RUF010)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/praisonai/praisonai/gateway/supervisor.py` around lines 226 - 242, The code later unconditionally sets status.state = ChannelState.STOPPED, which overwrites terminal FAILED states set for conflict/fatal errors; modify that assignment in supervisor.py so it does not overwrite an existing terminal failure: check status.state and only set ChannelState.STOPPED if status.state is not ChannelState.FAILED (and not any other terminal state you treat as final), e.g., wrap the STOPPED assignment in a guard like "if status.state != ChannelState.FAILED:" (or check a set of terminal states) so the FAILED last_error/last_error_time/next_retry_at are preserved for operators; references: status, ChannelState, name, logger.

MervinPraison · 2026-06-12T23:07:30Z

@claude You are the FINAL architecture reviewer. If the branch is under MervinPraison/PraisonAI (not a fork), you are able to make modifications to this branch and push directly. SCOPE: Focus ONLY on Python packages (praisonaiagents, praisonai). Do NOT modify praisonai-rust or praisonai-ts. Read ALL comments above from Gemini, Qodo, CodeRabbit, and Copilot carefully before responding.

Phase 1: Review per AGENTS.md

Protocol-driven: check heavy implementations vs core SDK
Backward compatible: ensure zero feature regressions
Performance: no hot-path regressions

Phase 2: FIX Valid Issues
4. For any VALID bugs or architectural flaws found by Gemini, CodeRabbit, Qodo, Copilot, or any other reviewer: implement the fix
5. Push all code fixes directly to THIS branch (do NOT create a new PR)
6. Comment a summary of exact files modified and what you skipped

Phase 3: Final Verdict
7. If all issues are resolved, approve the PR / close the Issue
8. If blocking issues remain, request changes / leave clear action items

praisonai-triage-agent · 2026-06-12T23:07:52Z

- Fix resume() method logic inversion that permanently froze paused channels - Add missing authentication to channel control REST endpoints - Sync total_recoveries metric from monitor to status object - Fix CLI URL parsing to handle paths/queries correctly with urlparse - Add proper exit codes to CLI commands on failure - Preserve terminal FAILED states in supervision cleanup - Clear manual_pause flag in reconnect() method - Add supervisor state cleanup in stop_channels() Fixes critical issues identified by Greptile, CodeRabbit, and Copilot reviewers. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-authored-by: Mervin Praison <MervinPraison@users.noreply.github.com>

greptile-apps · 2026-06-12T23:15:48Z

+    def reconnect(self, name: str) -> bool:
+        """Force reconnect of a channel.
+
+        Args:
+            name: Channel name
+
+        Returns:
+            True if channel exists, False otherwise
+        """
+        if name not in self._channels:
+            return False
+
+        # Reset monitor state and force restart
+        if name in self._monitors:
+            self._monitors[name].attempt = 0
+            self._monitors[name].last_error = None
+            self._monitors[name].last_error_time = None
+
+        status = self._channels[name]
+        status.state = ChannelState.STOPPED
+        status.manual_pause = False  # Clear manual pause flag
+        status.last_error = None
+        status.last_error_time = None
+        status.next_retry_at = None
+
+        # Signal abort to stop current operations
+        if name in self._abort_signals:
+            self._abort_signals[name].set()
+
+        logger.info(f"Channel '{name}' manually reconnected")
+        return True


reconnect() on a FAILED channel is a silent no-op that falsely reports success

After a conflict or non-recoverable error, the supervision while True loop exits via break and run() returns. At that point, self._abort_signals[name] still exists in the dict but nothing is await-ing it. Calling reconnect() updates state fields and calls self._abort_signals[name].set(), but since the coroutine driving the loop has already returned, the signal has no receiver. The state is flipped from FAILED to STOPPED (hiding the failure), yet the bot never actually restarts. The CLI returns ✅ Channel '...' reconnected successfully while the channel stays dead.

A true reconnect on a FAILED channel would need to re-schedule the run() coroutine as a new task. As written, reconnect() only works when the loop is currently sleeping in sleep_with_abort (i.e., during a transient-error backoff), not after the loop has terminated.

praisonai-triage-agent Bot mentioned this pull request Jun 12, 2026

Gateway channel supervision: bots die permanently after 5 retries, with no error classification, reconnect watcher, or operator pause/resume #1913

Closed

Copilot AI requested a review from MervinPraison June 12, 2026 22:57

greptile-apps Bot reviewed Jun 12, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

greptile-apps Bot reviewed Jun 12, 2026

View reviewed changes

MervinPraison merged commit 34d60ae into main Jun 13, 2026
16 checks passed

Uh oh!

Conversation

praisonai-triage-agent Bot commented Jun 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes Made

Problem Solved

Architecture

Summary by CodeRabbit

Uh oh!

MervinPraison commented Jun 12, 2026

Uh oh!

MervinPraison commented Jun 12, 2026

Uh oh!

qodo-code-review Bot commented Jun 12, 2026

Qodo reviews are paused for this user.

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram

Uh oh!

MervinPraison commented Jun 12, 2026

Uh oh!

greptile-apps Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

MervinPraison commented Jun 12, 2026

Uh oh!

praisonai-triage-agent Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Architecture Review & Bug Fixes ✅

✅ APPROVED - Ready to Merge

Uh oh!

greptile-apps Bot Jun 12, 2026

praisonai-triage-agent Bot commented Jun 12, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

greptile-apps Bot commented Jun 12, 2026 •

edited

Loading

praisonai-triage-agent Bot commented Jun 12, 2026 •

edited

Loading