rabbitmq: Make sure rabbitmq is running on cluster HA#1396
Conversation
| # check that at least we have confirmed that the app is up for 5 times in a row as when | ||
| # clustering we restart rabbit several times | ||
| success = 0 | ||
| while success < 5 |
There was a problem hiding this comment.
I am struggling to understand why 5 is a good value here. I don't mean to say that it's necessarily a bad value, just trying to see whether there is more meaning behind it 😄.
There was a problem hiding this comment.
Not really but 5 * 2 seconds per check makes a total of 10 seconds where the app has been running, and that should be time enough to make sure that indeed rabbitmq is already up and not on the start-stop-reset-start phase when creating a cluster.
totally random but I think its just a good value between making the check too long and making it too short that it could lead to failures again :)
hope that makes sense?
There was a problem hiding this comment.
Ok, that makes sense. And don't give me this "5 was totally random" story! I don't buy it 😉
I am concerned about what happens in CI though, especially when it is oversubscribed and timings just go to out the window. A very quick Google search turned up this post and I am wondering now whether it would be possible to use that to notify the ruby_block in the recipe so that we don't have to guess how long to wait until pacemaker is finally done restarting things. What do you think?
There was a problem hiding this comment.
well I think that in the case of rabbit the issue cannot be solved by that as on my tests, pacemaker would return a simple "$service is running on $host" even if the service is restarting. and the service restarting is part of the agent script, which I dont think pacemaker has any visibility of, so it may not be possible :(
There was a problem hiding this comment.
Although reading the page you linked, it seems like a nice way of notifying that a resource is stable, although it would need more changes than this, we should definitely explore that. Maybe my last post is not correct as its not using this but the generic crm resource show $RESOURCE
Im thinking follow up PR after some investigation into that, sounds good. Thoughts @nicolasbock ?
| Chef::Log.debug("#{ms_name} still not answering") | ||
| success = 0 | ||
| end | ||
| sleep(2) |
There was a problem hiding this comment.
Not a big fan of this. Can't we look at some crm output to see if the resource is stable? Right now, this means we wait 10s for no great reason every time we run chef...
34621d4 to
708d813
Compare
|
@vuntz @nicolasbock updated with a more simple check for master, then local rabbit. The combination of both should make sure that we only continue when things are calm and after restarts. |
|
@Itxaka the CI is failing and that looks related to the patch |
|
This doesnt really work either, as soon as pacemaker has chosen a master the output of |
|
time chef continued the run: at that same time, rabbitmq was still restarting: |
|
parsing the output of |
8fdb2c2 to
363afb0
Compare
vuntz
left a comment
There was a problem hiding this comment.
Let me approve, even though I have some minor nitpicks.
| end | ||
| # Check that we dont have any pending resource operations | ||
| cmd = "crm resource operations #{ms_name} 2> /dev/null " | ||
| cmd << "|grep -q \"pending\"" |
| end | ||
| rescue Timeout::Error | ||
| message = "The #{ms_name} pacemaker resource is not started. Please manually check for an error." | ||
| message = "The #{ms_name} pacemaker resource is not started or doesnt have a master yet." |
|
IIRC this also didnt catch all scenarios as there was a brief period of time where allt he checks can pass and we are still restarting ¬_¬ |
363afb0 to
17a6010
Compare
| # Check that the service has a master | ||
| cmd = "crm resource show #{ms_name} 2> /dev/null " | ||
| cmd << "| grep -q \"is running on\"" | ||
| cmd << "| grep -q \"is running on\" | grep -q \"Master\"" |
There was a problem hiding this comment.
I think that if the first grep uses -q won't pass anything to next grep and it will always fail. I think the correct line is:
cmd << "| grep "is running on" | grep -q "Master""
17a6010 to
a000382
Compare
a000382 to
369fe08
Compare
| # Check that the service has a master | ||
| cmd = "crm resource show #{ms_name} 2> /dev/null " | ||
| cmd << "| grep -q \"is running on\"" | ||
| cmd << "| grep \"is running on\" | grep -q \"Master\"" |
There was a problem hiding this comment.
I would swap the check for Master and the check for locally running. You require rabbitmq in all nodes locally before you can get a Master.
ba7f458 to
5dc32c7
Compare
5dc32c7 to
74fc830
Compare
74fc830 to
45c60eb
Compare
|
@vuntz @AbelNavarro re-review? Main changes are, ordering of the checks as Abel suggested, only trigger the check for rabbit when the transaction is triggered, and sync at the end of the recipe so all the checks have passed on all the nodes. |
45c60eb to
15ebe61
Compare
| end | ||
| end # block | ||
| action :nothing | ||
| end # ruby_block |
There was a problem hiding this comment.
Style/CommentedKeyword: Do not place comments on the same line as the end keyword.
| Chef::Log.fatal(message) | ||
| raise message | ||
| end | ||
| end # block |
There was a problem hiding this comment.
Style/CommentedKeyword: Do not place comments on the same line as the end keyword.
| end | ||
|
|
||
| # wait for service to have a master, and to be active | ||
| ruby_block "wait for #{ms_name} to be started" do |
There was a problem hiding this comment.
Metrics/BlockLength: Block has too many lines. [42/40]
4a3de28 to
d7f7f76
Compare
|
btw this is conflicting with #1637 .. I assume one needs to be rebased after first is merged... |
|
@jsuchome I will rebase this afterwards I guess, no problems there |
|
You can rebase now |
d7f7f76 to
4bde235
Compare
|
|
||
| # wait for service to have a master, and to be active | ||
| ruby_block "wait for #{ms_name} to be started" do | ||
| block do |
There was a problem hiding this comment.
Metrics/BlockLength: Block has too many lines. [42/40]
| end | ||
|
|
||
| # wait for service to have a master, and to be active | ||
| ruby_block "wait for #{ms_name} to be started" do |
There was a problem hiding this comment.
Metrics/BlockLength: Block has too many lines. [45/40]
There was a problem hiding this comment.
I am using this PR as requisite of other and these violations doesn't affect to the execution
|
rebased! @jsuchome the check for upgrade was moved to the top, as the first check now checks if the service is running locally, which in the upgrade case we dont want to do so |
4bde235 to
13e4f2b
Compare
|
Needs a rebase again. |
As the resource agent for rabbitmq with cluster HA restart the rabbitmq service several times, the current check can fail to validate rabbitmq status, as it could do the check just on one of those times that rabbit is up while creating/joining the cluster. Then if the check passed and continued the chef execution, the next steps could fail as they are dependant on having a running rabbitmq, while the rabbitmq server may still be restarting. Instead expand the checks to first look for a rabbit master for the resource and expand the check for a local runing rabbit to make sure we are checking for the local copy. Also add an extra check after the crm checks to make sure there are no pending operations for the resource so we can try to avoid continuing if there is a promotion going on.
As the other checks are not enough, as pacemaker keeps restarting rabbitmq, we need a more robust way of checking that rabbit has entered an stable situation. So check that rabbit is up 5 times in a row with a delay of 2 seconds between checks to make sure pacemaker has left it alone. Also, only trigger that check for rabbit if the pacemaker_transaction is updated, otherwise there is no need to do so
13e4f2b to
8b56894
Compare
|
Was this tested during the upgrade? |
only the backport may affect the upgrade wont it? In any case, testing it. |
Nope. This part is executed on the first upgraded node during the chef-client run.
Mee too. Actually not only it does not seem to break anything, it's possible it's fixing one of our other rabbitmq-related problems... |
Solving stuff and not even knowing it 👯♂️ |
|
I was too optimistic, it does not seem to solve the problem I was watching. But it also does not seem to break anything... |
As the resource agent for rabbitmq with cluster HA restart the rabbitmq
service several times, the current check can fail to validate rabbitmq
status, as it could do the check just on one of those times that rabbit
is up while creating/joining the cluster. Then if the check passed and
continued the chef execution, the next steps could fail as they are
dependant on having a running rabbitmq, while the rabbitmq server may
still be restarting.
Instead expand the checks to first look for a rabbit master for the
resource and expand the check for a local runing rabbit to make sure
we are checking for the local copy.