-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Describe the bug
Some jobs are paused and pulsar tries to stage in the files anyway. Pulsar gets a 403 response from galaxy and the job state is set to error.
Here is some handler log for a recent kraken job ID (IDs changed)
Jul 09 16:34:50 galaxy-handlers galaxyctl[152497]: galaxy.jobs.handler DEBUG 2025-07-09 16:34:50,101 [pN:handler_3,p:152497,tN:JobHandlerQueue.monitor_thread] Grabbed Job(s): 12345678
Jul 09 16:44:33 galaxy-handlers galaxyctl[152497]: galaxy.jobs.mapper DEBUG 2025-07-09 16:44:33,229 [pN:handler_3,p:152497,tN:JobHandlerQueue.monitor_thread] (12345678) Mapped job to destination id: pulsar-qld-high-mem2
Jul 09 16:44:33 galaxy-handlers galaxyctl[152497]: galaxy.jobs.handler DEBUG 2025-07-09 16:44:33,232 [pN:handler_3,p:152497,tN:JobHandlerQueue.monitor_thread] (12345678) Dispatching to pulsar-qld-high-mem2_runner runner
Jul 09 16:44:33 galaxy-handlers galaxyctl[152497]: galaxy.objectstore DEBUG 2025-07-09 16:44:33,304 [pN:handler_3,p:152497,tN:JobHandlerQueue.monitor_thread] Using preferred backend 'data30' for creation of Job 12345678
Jul 09 16:44:33 galaxy-handlers galaxyctl[152497]: galaxy.jobs DEBUG 2025-07-09 16:44:33,306 [pN:handler_3,p:152497,tN:JobHandlerQueue.monitor_thread] (12345678) Working directory for job is: /mnt/scratch/job_working_directory/012/287/12345678
Jul 09 16:44:33 galaxy-handlers galaxyctl[152497]: galaxy.jobs INFO 2025-07-09 16:44:33,310 [pN:handler_3,p:152497,tN:JobHandlerQueue.monitor_thread] (12345678) User (99999) is over quota: job paused
Jul 09 16:44:33 galaxy-handlers galaxyctl[152497]: galaxy.jobs DEBUG 2025-07-09 16:44:33,310 [pN:handler_3,p:152497,tN:JobHandlerQueue.monitor_thread] Pausing Job '12345678', Execution of this dataset's job is paused because you were over your disk quota at the time it was ready to run
Jul 09 16:44:33 galaxy-handlers galaxyctl[152497]: galaxy.jobs.runners DEBUG 2025-07-09 16:44:33,325 [pN:handler_3,p:152497,tN:JobHandlerQueue.monitor_thread] Job [12345678] queued (92.527 ms)
Jul 09 16:44:33 galaxy-handlers galaxyctl[152497]: galaxy.jobs.handler INFO 2025-07-09 16:44:33,331 [pN:handler_3,p:152497,tN:JobHandlerQueue.monitor_thread] (12345678) Job dispatched
Jul 09 16:44:33 galaxy-handlers galaxyctl[152497]: galaxy.jobs DEBUG 2025-07-09 16:44:33,408 [pN:handler_3,p:152497,tN:PulsarJobRunner.work_thread-2] Job wrapper for Job [12345678] prepared (59.916 ms)
Jul 09 16:44:33 galaxy-handlers galaxyctl[152497]: galaxy.jobs.command_factory INFO 2025-07-09 16:44:33,416 [pN:handler_3,p:152497,tN:PulsarJobRunner.work_thread-2] Built script [/mnt/scratch/job_working_directory/012/287/12345678/tool_script.sh] for tool command [export LC_ALL=C && kraken --version > /mnt/pulsar/files/staging/12345678/outputs/COMMAND_VERSION 2>&1;
Jul 09 16:44:33 galaxy-handlers galaxyctl[152497]: if [ -d '/cvmfs/data.galaxyproject.org/managed/kraken_database/viruses/Viruses' ]; then export KRAKEN_DEFAULT_DB='/cvmfs/data.galaxyproject.org/managed/kraken_database/viruses/Viruses'; else export KRAKEN_DEFAULT_DB='/cvmfs/data.galaxyproject.org/managed/kraken_database/viruses'; fi && kraken --threads ${GALAXY_SLOTS:-1} --db "$KRAKEN_DEFAULT_DB" --fastq-input '/mnt/pulsar/files/staging/12345678/inputs/dataset_X.dat' '/mnt/pulsar/files/staging/12345678/inputs/dataset_Y.dat' --paired > '/mnt/pulsar/files/staging/12345678/outputs/dataset_Z.dat']
Jul 09 16:44:33 galaxy-handlers galaxyctl[152497]: galaxy.jobs.runners.pulsar DEBUG 2025-07-09 16:44:33,451 [pN:handler_3,p:152497,tN:PulsarJobRunner.work_thread-2] Registering tool_script for Pulsar transfer [/mnt/scratch/job_working_directory/012/287/12345678/tool_script.sh]
Jul 09 16:44:33 galaxy-handlers galaxyctl[152497]: pulsar.client.client INFO 2025-07-09 16:44:33,464 [pN:handler_3,p:152497,tN:PulsarJobRunner.work_thread-2] Job published to setup message queue: 12345678
Jul 09 16:44:33 galaxy-handlers galaxyctl[152497]: galaxy.jobs.runners.pulsar INFO 2025-07-09 16:44:33,465 [pN:handler_3,p:152497,tN:PulsarJobRunner.work_thread-2] Pulsar job submitted with job_id 12345678
Jul 09 16:44:33 galaxy-handlers galaxyctl[152497]: galaxy.jobs DEBUG 2025-07-09 16:44:33,465 [pN:handler_3,p:152497,tN:PulsarJobRunner.work_thread-2] (12345678) Persisting job destination (destination id: pulsar-qld-high-mem2)
There is nothing in the handler log for the job ID after this. Its state is now ‘error’ with an update time of 17:21. The pulsar logs show that pulsar tried to stage in tool_script.sh 20 times (as per our settings) then Publishing Pulsar state change with status failed for the job. (PulsarMQJobRunner).
Galaxy Version and/or server at which you observed the bug
Galaxy Version: release_25.0
Commit: bd965fc7334699b6c91cff9a0677081a6d11bdc5
Expected behavior
A paused job is not dispatched.
Pulsar cannot change the state of a paused job.