Skip to content

More atomic poller operations#843

Merged
mperham merged 2 commits intosidekiq:masterfrom
weheartit:master
Apr 12, 2013
Merged

More atomic poller operations#843
mperham merged 2 commits intosidekiq:masterfrom
weheartit:master

Conversation

@bdurand
Copy link

@bdurand bdurand commented Apr 11, 2013

I recently had an issue where my application lost a very large number of jobs. The problem was with future scheduled jobs being popped from the scheduled queue but never pushed onto one of the worker queues. Here are the steps that happened:

  1. We performed some redis maintenance which took the server down for a few minutes.
  2. Due to a bug in scheduled.rb (since fixed see issue ERROR: scheduling poller thread died! #309) the poller stopped querying the scheduled queue.
  3. When redis came back online, sidekiq started processing all workers executed with perform_async. Any workers called with perform_in were put on the scheduled queue but never popped off due to the stopped poller.
  4. Because sidekiq was processing some jobs just fine, monitoring did not pick up that there was a problem.
  5. After almost 20 hours sidekiq got restarted. This started the poller again.
  6. At this point there were over 2 million jobs in the scheduled queue.
  7. The poller on one of the sidekiq processes would have tried to pop all 2 million jobs at once off the scheduled queue. It than would have tried to insert them one at a time on to the appropriate worker queues. Something happened in the middle of this that aborted the process some where between the pop and finishing the pushes. It could have been an out of memory error or it could have been monitoring thinking there was a run away process due to the ballooning memory.

The end result was that most of the jobs were irretrievably lost. This had also happened two weeks earlier during some other redis maintenance.

This pull request changes the logic in Poller to pop messages one at a time from the retry and schedule queues and immediately push them to the appropriate worker queue. The new logic is:

  1. Get the next item in the queue if it's score (time to execute) is <= now
  2. If an item exists, then try to remove the item from the queue
  3. If item successfully removed, then add it to the appropriate worker queue
  4. If item not successfully removed, then try again (it was likely removed by another process)
  5. Repeat until no items have a score that matches <= now

… something goes wrong during a large pop operation.
@coveralls
Copy link

Coverage increased (+0.42%) when pulling 0c76c3b on weheartit:master into 387a646 on mperham:master.

View Details

@mperham
Copy link
Collaborator

mperham commented Apr 11, 2013

@bdurand Ugh, sorry to hear about that.

What do you think about using small batch size of, say, 20 rather than one at a time?

@bdurand
Copy link
Author

bdurand commented Apr 11, 2013

I couldn't figure out a way to do small batches since the zrembyscore method doesn't take a limit. I think popping them one at a time would do the most to reduce race conditions. The best way to solve the issue would be an atomic pop and push lua script but that would tie sidekiq to redis 2.6 (maybe a feature for 3.0).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the if is redundant, yes? The while should stop if message is falsy.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the if predates the while. I'll clean it up.

@mperham
Copy link
Collaborator

mperham commented Apr 11, 2013

Yeah, you're right about the Redis commands. Maybe it's by design but this loop feels dirty and I guess we can't make it any cleaner.

Would you add a comment with your five item list so the scheduler loop is documented in the code? Tests look fine.

@coveralls
Copy link

Coverage increased (+0.45%) when pulling 8918096 on weheartit:master into 387a646 on mperham:master.

View Details

mperham added a commit that referenced this pull request Apr 12, 2013
More atomic poller operations
@mperham mperham merged commit 5a2ea7a into sidekiq:master Apr 12, 2013
@mperham
Copy link
Collaborator

mperham commented Apr 12, 2013

Thank you for the hard work in finding and fixing this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants