fix(supervisor): Gracefully handle connection errors when interacting with supervisor#298
Conversation
| self._get_rpc_client().supervisor.shutdown() | ||
| except xmlrpc.client.Fault as e: | ||
| raise SupervisorError(f"Failed to stop supervisor: {e.faultString}") | ||
| except (SupervisorConnectionError, socket.error, ConnectionRefusedError): | ||
| # Supervisor is already down, nothing to do | ||
| pass | ||
|
|
||
| def start_process(self, name: str) -> None: | ||
| if self._is_program_running(name): |
There was a problem hiding this comment.
Bug: The start_process method lacks error handling for connection failures, creating a race condition where a thread can crash if the supervisor becomes unavailable after the initial status check.
Severity: HIGH | Confidence: High
🔍 Detailed Analysis
A race condition exists in the start_process method. The function first checks if a process is running and then attempts to start it. However, the supervisor daemon could become unavailable between these two operations. Because the underlying XML-RPC client connects lazily, a socket.error or ConnectionRefusedError can be raised during the start attempt. The current implementation only handles xmlrpc.client.Fault, leaving these connection errors unhandled. This is especially problematic as start_process is called from concurrent threads in up.py, which could lead to thread crashes.
💡 Suggested Fix
Update the try...except block in the start_process method to also catch SupervisorConnectionError, socket.error, and ConnectionRefusedError, similar to the error handling already implemented in the stop_process method. This will prevent thread crashes if the supervisor daemon is unreachable.
🤖 Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: devservices/utils/supervisor.py#L300-L308
Potential issue: A race condition exists in the `start_process` method. The function
first checks if a process is running and then attempts to start it. However, the
supervisor daemon could become unavailable between these two operations. Because the
underlying XML-RPC client connects lazily, a `socket.error` or `ConnectionRefusedError`
can be raised during the start attempt. The current implementation only handles
`xmlrpc.client.Fault`, leaving these connection errors unhandled. This is especially
problematic as `start_process` is called from concurrent threads in `up.py`, which could
lead to thread crashes.
Did we get this right? 👍 / 👎 to inform future reviews.
Reference ID: 7674921
Fixes DEVSERVICES-7Y. The issue was that:
_is_program_runningfails to catch socket connection errors when supervisor is unavailable, crashing concurrent threads._is_program_runningto catchSupervisorConnectionError,socket.error, andConnectionRefusedError, assuming the process is not running if the supervisor is unreachable.shutdownto silently pass if connection errors occur, indicating the supervisor is already down.stop_processto silently pass if connection errors occur, indicating the supervisor is unavailable and the process is stopped.This fix was generated by Seer in Sentry, triggered by hubert.deng@sentry.io. 👁️ Run ID: 7674042
Not quite right? Click here to continue debugging with Seer.