Skip to content

fix: add retry with backoff to _update_submission and stop swallowing…#2416

Open
dconstancy wants to merge 2 commits into
codalab:developfrom
dconstancy:develop
Open

fix: add retry with backoff to _update_submission and stop swallowing…#2416
dconstancy wants to merge 2 commits into
codalab:developfrom
dconstancy:develop

Conversation

@dconstancy

@dconstancy dconstancy commented Jun 16, 2026

Copy link
Copy Markdown

Title: fix: add retry with backoff to _update_submission() and stop swallowing terminal exceptions


Description

_update_submission() performs a single PATCH request with no retry. On any transient network error (502, timeout, connection reset), the submission result is permanently lost. The caller _update_status() silently swallows all exceptions, so the compute worker moves on while Django never receives the final status. The submission stays stuck in Scoring or Running forever with no user-facing error.

This PR adds exponential-backoff retry to _update_submission() and re-raises exceptions for terminal statuses so submissions no longer hang silently.

Issues this PR resolves

Fixes #2415

Changes

compute_worker/compute_worker.py:

  • _update_submission(): add retry loop with exponential backoff (5 attempts, 2^n seconds + jitter). Catches both HTTP error responses and network-level RequestException. Only raises after all retries are exhausted.
  • _update_status(): re-raise exceptions for terminal statuses (Finished, Failed) so Celery marks the task as failed and the submission transitions to Failed in the UI instead of hanging silently. Intermediate statuses (Preparing, Running, Scoring) remain silently caught — losing an intermediate status update is not critical.
  • HTTP session adapter: increase total retries from 3 to 5, add status_forcelist=[502, 503, 504] and allowed_methods=["PATCH", "GET", "PUT", "POST"] to automatically retry on server errors at the transport level.

A checklist for hand testing

  • Trigger a submission and observe successful status updates in compute worker logs
  • Simulate a 502 response (e.g. restart Django mid-submission) and confirm retry succeeds
  • Confirm submissions reach Finished or Failed state and no longer hang in Scoring/Running

Backward compatibility

  • No API changes, no migration needed
  • Retry is transparent to Django — same PATCH payload, just sent multiple times if needed
  • PATCH to /api/submissions/{id}/ is idempotent (sets status to a fixed value), so retries are safe
  • Intermediate status exceptions remain silently caught (no behavior change for non-terminal updates)
  • Only terminal status failures (Finished, Failed) now propagate — this is strictly better than the previous silent loss

Checklist

  • Code review by me
  • Hand tested by me
  • Code review by reviewer
  • Hand tested by reviewer
  • CircleCi tests are passing
  • Ready to merge

@IdirLISN

IdirLISN commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Hello, i reviewed the code, i think the changes are good but some details seems strange like:

        url = f"{self.submissions_api_url}/submissions/{self.submission_id}/"
        data["secret"] = self.secret

        logger.info(f"Updating submission @ {url} with data = {data}")

This seems to print some secret in the logs.

Secondly it seems you use a custom retry but urllib retry is alreay imported, also you add inside of the allowed methods, the method post: allowed_methods=["PATCH", "GET", "PUT", "POST"]
This shouldn't be there and the request should be hard implemented. This reduce the risk of post abuse.

@dconstancy

Copy link
Copy Markdown
Author

Hello, thanks for the feedback! I've updated the PR accordingly:

  • The secret was already present in the logged data — removed it from the log line in _update_submission.
  • Dropped the custom retry loop — no point in reinventing it when urllib3.Retry already does the job at the session level.
  • Removed POST from allowed_methods since we don't retry POST calls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

_update_submission() has no retry and _update_status() silently swallows exceptions, causing permanent submission loss

2 participants