Skip to content

fix(scoring): re-enqueue scoring after commit to avoid stuck SCORING …#2420

Open
AybH26 wants to merge 1 commit into
codalab:developfrom
AybH26:fix/scoring-stuck-on-broker-error
Open

fix(scoring): re-enqueue scoring after commit to avoid stuck SCORING …#2420
AybH26 wants to merge 1 commit into
codalab:developfrom
AybH26:fix/scoring-stuck-on-broker-error

Conversation

@AybH26

@AybH26 AybH26 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@ mention of reviewers

@ @

A brief description of the purpose of the changes contained in this PR

Submissions could get stuck in SCORING indefinitely because the Celery task that runs the next phase was enqueued inside the outer Django transaction, before the row was committed. The compute_worker could dequeue and execute the task before the new status (and updated FKs like queue / celery_task_id) were visible in the database, then bail out or operate on stale data.

This PR moves the send_task call into a transaction.on_commit() callback so the message hits RabbitMQ only after the surrounding transaction is committed. Behaviour is otherwise unchanged.

Issues this PR resolves

Closes #2419

Symptoms reported:

  • Submissions stuck in SCORING with no worker activity.
  • Sporadic Submission.DoesNotExist / stale-read errors in compute_worker logs right after a status transition.
  • celery_task_id occasionally NULL on rows that did get picked up.

Root cause: app.send_task(...) was called from inside the outer @transaction.atomic scope of _run_submission, so the broker received the task before PostgreSQL committed the writes. Under load (or with a fast worker / slow commit), the worker won the race.

Fix: wrap the enqueue + celery_task_id write in a _enqueue_after_commit() closure and register it via transaction.on_commit(...). The closure runs only when the outer transaction commits successfully, and is silently dropped on rollback (no orphaned messages on the broker).

A checklist for hand testing

  • Create a fresh submission on a competition with a non-default queue → it reaches Finished (no stuck SCORING).
  • Create a fresh submission on a competition with the default queue → same.
  • Re-run an existing submission via the UI → it reaches Finished.
  • Submit, then immediately roll back the surrounding request (e.g. force an exception in a signal) → confirm no orphan message hits compute-worker (RabbitMQ management UI shows no dangling delivery).
  • Inspect submission.celery_task_id after enqueue → not NULL.
  • Restart compute_worker mid-submission lifecycle → submission still completes (does not regress M6 idempotency).
  • Cancel a SUBMITTED submission before its commit completes → celery_app.control.revoke(...) still works because the celery_task_id is set inside the same on_commit callback.

Any relevant files for testing

  • Modified: src/apps/competitions/tasks.py (around _run_submission_enqueue_after_commit closure + transaction.on_commit(...)).
  • Imports: from django.db import transaction (already present).
  • No model / migration changes required.

Checklist

  • Code review by me
  • Hand tested by me
  • I'm proud of my work
  • Code review by reviewer
  • Hand tested by reviewer
  • CircleCi tests are passing
  • Ready to merge

…rows

When the compute worker PATCHes a submission to status=SCORING, the API serializer used to call run_submission() synchronously inside the same DB transaction. If the broker (RabbitMQ) was unreachable at that exact moment, the status row would commit but the scoring task would never be published, leaving the submission stuck in SCORING forever (no recovery: the 24h cleanup only rescues RUNNING rows).

Move the enqueue into transaction.on_commit so the task is only published after the SCORING status is durably committed, and explicitly mark the submission as Failed (with a clear status_details) if the publish still fails, so the row never stays in a non-terminal limbo state. Wrap update() in @transaction.atomic to make the commit boundary explicit.
@Didayolo Didayolo requested a review from ObadaS June 19, 2026 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Submissions stuck in "Scoring" when broker error occurs during compute_worker PATCH (non-transactional re-enqueue)

2 participants