Skip to content

rabbitmq: Make sure rabbitmq is running on cluster HA#1396

Merged
Itxaka merged 2 commits into
crowbar:masterfrom
Itxaka:increase_rabbitmq_cluster_check_ready
Jul 3, 2018
Merged

rabbitmq: Make sure rabbitmq is running on cluster HA#1396
Itxaka merged 2 commits into
crowbar:masterfrom
Itxaka:increase_rabbitmq_cluster_check_ready

Conversation

@Itxaka

@Itxaka Itxaka commented Oct 24, 2017

Copy link
Copy Markdown
Member

As the resource agent for rabbitmq with cluster HA restart the rabbitmq
service several times, the current check can fail to validate rabbitmq
status, as it could do the check just on one of those times that rabbit
is up while creating/joining the cluster. Then if the check passed and
continued the chef execution, the next steps could fail as they are
dependant on having a running rabbitmq, while the rabbitmq server may
still be restarting.

Instead expand the checks to first look for a rabbit master for the
resource and expand the check for a local runing rabbit to make sure
we are checking for the local copy.

# check that at least we have confirmed that the app is up for 5 times in a row as when
# clustering we restart rabbit several times
success = 0
while success < 5

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am struggling to understand why 5 is a good value here. I don't mean to say that it's necessarily a bad value, just trying to see whether there is more meaning behind it 😄.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really but 5 * 2 seconds per check makes a total of 10 seconds where the app has been running, and that should be time enough to make sure that indeed rabbitmq is already up and not on the start-stop-reset-start phase when creating a cluster.

totally random but I think its just a good value between making the check too long and making it too short that it could lead to failures again :)

hope that makes sense?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that makes sense. And don't give me this "5 was totally random" story! I don't buy it 😉

I am concerned about what happens in CI though, especially when it is oversubscribed and timings just go to out the window. A very quick Google search turned up this post and I am wondering now whether it would be possible to use that to notify the ruby_block in the recipe so that we don't have to guess how long to wait until pacemaker is finally done restarting things. What do you think?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well I think that in the case of rabbit the issue cannot be solved by that as on my tests, pacemaker would return a simple "$service is running on $host" even if the service is restarting. and the service restarting is part of the agent script, which I dont think pacemaker has any visibility of, so it may not be possible :(

@Itxaka Itxaka Oct 25, 2017

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although reading the page you linked, it seems like a nice way of notifying that a resource is stable, although it would need more changes than this, we should definitely explore that. Maybe my last post is not correct as its not using this but the generic crm resource show $RESOURCE

Im thinking follow up PR after some investigation into that, sounds good. Thoughts @nicolasbock ?

Chef::Log.debug("#{ms_name} still not answering")
success = 0
end
sleep(2)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big fan of this. Can't we look at some crm output to see if the resource is stable? Right now, this means we wait 10s for no great reason every time we run chef...

@Itxaka Itxaka force-pushed the increase_rabbitmq_cluster_check_ready branch from 34621d4 to 708d813 Compare December 13, 2017 11:48
@Itxaka

Itxaka commented Dec 13, 2017

Copy link
Copy Markdown
Member Author

@vuntz @nicolasbock updated with a more simple check for master, then local rabbit. The combination of both should make sure that we only continue when things are calm and after restarts.

vuntz
vuntz previously approved these changes Dec 13, 2017
@vuntz

vuntz commented Dec 15, 2017

Copy link
Copy Markdown
Member

@Itxaka the CI is failing and that looks related to the patch

@Itxaka

Itxaka commented Dec 15, 2017

Copy link
Copy Markdown
Member Author

This doesnt really work either, as soon as pacemaker has chosen a master the output of crm resource show rabbitmq returns a "valid" master but rabbit its still restarting and promoting :(

@Itxaka

Itxaka commented Dec 15, 2017

Copy link
Copy Markdown
Member Author

time chef continued the run:

[2017-12-15T14:58:08+00:00] INFO: ruby_block[wait for ms-rabbitmq to be started] called

at that same time, rabbitmq was still restarting:

lrmd:   2017/12/15_14:58:25 INFO: rabbitmq[27984]: promote: Updating cluster master attribute

@Itxaka

Itxaka commented Dec 15, 2017

Copy link
Copy Markdown
Member Author

parsing the output of rabbitmqctl cluster_status is also absurd, because for a brief moment you can get the proper output before the services start restarting

@Itxaka Itxaka force-pushed the increase_rabbitmq_cluster_check_ready branch 2 times, most recently from 8fdb2c2 to 363afb0 Compare December 15, 2017 15:37
vuntz
vuntz previously approved these changes Feb 8, 2018

@vuntz vuntz left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me approve, even though I have some minor nitpicks.

end
# Check that we dont have any pending resource operations
cmd = "crm resource operations #{ms_name} 2> /dev/null "
cmd << "|grep -q \"pending\""

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/|grep/| grep/

end
rescue Timeout::Error
message = "The #{ms_name} pacemaker resource is not started. Please manually check for an error."
message = "The #{ms_name} pacemaker resource is not started or doesnt have a master yet."

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/doesnt/does not/ (or doesn't)

@Itxaka

Itxaka commented Feb 8, 2018

Copy link
Copy Markdown
Member Author

IIRC this also didnt catch all scenarios as there was a brief period of time where allt he checks can pass and we are still restarting ¬_¬

@Itxaka Itxaka force-pushed the increase_rabbitmq_cluster_check_ready branch from 363afb0 to 17a6010 Compare March 22, 2018 12:22
# Check that the service has a master
cmd = "crm resource show #{ms_name} 2> /dev/null "
cmd << "| grep -q \"is running on\""
cmd << "| grep -q \"is running on\" | grep -q \"Master\""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if the first grep uses -q won't pass anything to next grep and it will always fail. I think the correct line is:
cmd << "| grep "is running on" | grep -q "Master""

@Itxaka Itxaka force-pushed the increase_rabbitmq_cluster_check_ready branch from 17a6010 to a000382 Compare March 23, 2018 13:38
@Itxaka Itxaka force-pushed the increase_rabbitmq_cluster_check_ready branch from a000382 to 369fe08 Compare March 23, 2018 15:13
# Check that the service has a master
cmd = "crm resource show #{ms_name} 2> /dev/null "
cmd << "| grep -q \"is running on\""
cmd << "| grep \"is running on\" | grep -q \"Master\""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would swap the check for Master and the check for locally running. You require rabbitmq in all nodes locally before you can get a Master.

@Itxaka Itxaka force-pushed the increase_rabbitmq_cluster_check_ready branch from ba7f458 to 5dc32c7 Compare April 24, 2018 12:14
@Itxaka Itxaka dismissed ilausuch’s stale review April 26, 2018 10:06

Ivan is out for some time :)

@Itxaka Itxaka force-pushed the increase_rabbitmq_cluster_check_ready branch from 5dc32c7 to 74fc830 Compare April 26, 2018 10:13
@Itxaka Itxaka force-pushed the increase_rabbitmq_cluster_check_ready branch from 74fc830 to 45c60eb Compare April 26, 2018 10:16
@crowbar crowbar deleted a comment from houndci-bot Apr 26, 2018
@Itxaka

Itxaka commented Apr 26, 2018

Copy link
Copy Markdown
Member Author

@vuntz @AbelNavarro re-review?

Main changes are, ordering of the checks as Abel suggested, only trigger the check for rabbit when the transaction is triggered, and sync at the end of the recipe so all the checks have passed on all the nodes.

AbelNavarro
AbelNavarro previously approved these changes Apr 26, 2018
@Itxaka Itxaka force-pushed the increase_rabbitmq_cluster_check_ready branch from 45c60eb to 15ebe61 Compare April 26, 2018 12:07
end
end # block
action :nothing
end # ruby_block

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/CommentedKeyword: Do not place comments on the same line as the end keyword.

Chef::Log.fatal(message)
raise message
end
end # block

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/CommentedKeyword: Do not place comments on the same line as the end keyword.

end

# wait for service to have a master, and to be active
ruby_block "wait for #{ms_name} to be started" do

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics/BlockLength: Block has too many lines. [42/40]

@Itxaka Itxaka force-pushed the increase_rabbitmq_cluster_check_ready branch 3 times, most recently from 4a3de28 to d7f7f76 Compare April 26, 2018 14:40
@jsuchome

Copy link
Copy Markdown
Member

btw this is conflicting with #1637 .. I assume one needs to be rebased after first is merged...

@Itxaka

Itxaka commented Apr 27, 2018

Copy link
Copy Markdown
Member Author

@jsuchome I will rebase this afterwards I guess, no problems there

@jsuchome

Copy link
Copy Markdown
Member

You can rebase now

@Itxaka Itxaka force-pushed the increase_rabbitmq_cluster_check_ready branch from d7f7f76 to 4bde235 Compare April 27, 2018 11:33

# wait for service to have a master, and to be active
ruby_block "wait for #{ms_name} to be started" do
block do

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics/BlockLength: Block has too many lines. [42/40]

end

# wait for service to have a master, and to be active
ruby_block "wait for #{ms_name} to be started" do

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics/BlockLength: Block has too many lines. [45/40]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using this PR as requisite of other and these violations doesn't affect to the execution

@Itxaka

Itxaka commented Apr 27, 2018

Copy link
Copy Markdown
Member Author

rebased! @jsuchome the check for upgrade was moved to the top, as the first check now checks if the service is running locally, which in the upgrade case we dont want to do so

@Itxaka Itxaka force-pushed the increase_rabbitmq_cluster_check_ready branch from 4bde235 to 13e4f2b Compare April 27, 2018 12:27
@rsalevsky

Copy link
Copy Markdown
Member

Needs a rebase again.

Itxaka added 2 commits June 21, 2018 12:37
As the resource agent for rabbitmq with cluster HA restart the rabbitmq
service several times, the current check can fail to validate rabbitmq
status, as it could do the check just on one of those times that rabbit
is up while creating/joining the cluster. Then if the check passed and
continued the chef execution, the next steps could fail as they are
dependant on having a running rabbitmq, while the rabbitmq server may
still be restarting.

Instead expand the checks to first look for a rabbit master for the
resource and expand the check for a local runing rabbit to make sure
we are checking for the local copy. Also add an extra check after
the crm checks to make sure there are no pending operations for the
resource so we can try to avoid continuing if there is a promotion
going on.
As the other checks are not enough, as pacemaker keeps restarting
rabbitmq, we need a more robust way of checking that rabbit has entered
an stable situation.

So check that rabbit is up 5 times in a row with a delay of 2 seconds
between checks to make sure pacemaker has left it alone.

Also, only trigger that check for rabbit if the pacemaker_transaction is
updated, otherwise there is no need to do so
@Itxaka Itxaka force-pushed the increase_rabbitmq_cluster_check_ready branch from 13e4f2b to 8b56894 Compare June 21, 2018 10:41
@jsuchome

Copy link
Copy Markdown
Member

Was this tested during the upgrade?

@Itxaka

Itxaka commented Jun 28, 2018

Copy link
Copy Markdown
Member Author

Was this tested during the upgrade?

only the backport may affect the upgrade wont it?

In any case, testing it.

@jsuchome

Copy link
Copy Markdown
Member

only the backport may affect the upgrade wont it?

Nope. This part is executed on the first upgraded node during the chef-client run.

In any case, testing it.

Mee too. Actually not only it does not seem to break anything, it's possible it's fixing one of our other rabbitmq-related problems...

@Itxaka

Itxaka commented Jun 28, 2018

Copy link
Copy Markdown
Member Author

Mee too. Actually not only it does not seem to break anything, it's possible it's fixing one of our other rabbitmq-related problems...

Solving stuff and not even knowing it 👯‍♂️

@jsuchome

Copy link
Copy Markdown
Member

I was too optimistic, it does not seem to solve the problem I was watching. But it also does not seem to break anything...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

8 participants