rabbitmq: Make sure rabbitmq is running on cluster HA by Itxaka · Pull Request #1396 · crowbar/crowbar-openstack

Itxaka · 2017-10-24T14:40:46Z

As the resource agent for rabbitmq with cluster HA restart the rabbitmq
service several times, the current check can fail to validate rabbitmq
status, as it could do the check just on one of those times that rabbit
is up while creating/joining the cluster. Then if the check passed and
continued the chef execution, the next steps could fail as they are
dependant on having a running rabbitmq, while the rabbitmq server may
still be restarting.

Instead expand the checks to first look for a rabbit master for the
resource and expand the check for a local runing rabbit to make sure
we are checking for the local copy.

nicolasbock · 2017-10-24T16:49:36Z

+        # check that at least we have confirmed that the app is up for 5 times in a row as when
+        # clustering we restart rabbit several times
+        success = 0
+        while success < 5


I am struggling to understand why 5 is a good value here. I don't mean to say that it's necessarily a bad value, just trying to see whether there is more meaning behind it 😄.

Not really but 5 * 2 seconds per check makes a total of 10 seconds where the app has been running, and that should be time enough to make sure that indeed rabbitmq is already up and not on the start-stop-reset-start phase when creating a cluster.

totally random but I think its just a good value between making the check too long and making it too short that it could lead to failures again :)

hope that makes sense?

Ok, that makes sense. And don't give me this "5 was totally random" story! I don't buy it 😉

I am concerned about what happens in CI though, especially when it is oversubscribed and timings just go to out the window. A very quick Google search turned up this post and I am wondering now whether it would be possible to use that to notify the ruby_block in the recipe so that we don't have to guess how long to wait until pacemaker is finally done restarting things. What do you think?

well I think that in the case of rabbit the issue cannot be solved by that as on my tests, pacemaker would return a simple "$service is running on $host" even if the service is restarting. and the service restarting is part of the agent script, which I dont think pacemaker has any visibility of, so it may not be possible :(

Although reading the page you linked, it seems like a nice way of notifying that a resource is stable, although it would need more changes than this, we should definitely explore that. Maybe my last post is not correct as its not using this but the generic crm resource show $RESOURCE

Im thinking follow up PR after some investigation into that, sounds good. Thoughts @nicolasbock ?

vuntz · 2017-10-26T16:45:34Z

+            Chef::Log.debug("#{ms_name} still not answering")
+            success = 0
+          end
          sleep(2)


Not a big fan of this. Can't we look at some crm output to see if the resource is stable? Right now, this means we wait 10s for no great reason every time we run chef...

Itxaka · 2017-12-13T11:50:12Z

@vuntz @nicolasbock updated with a more simple check for master, then local rabbit. The combination of both should make sure that we only continue when things are calm and after restarts.

vuntz · 2017-12-15T08:40:54Z

@Itxaka the CI is failing and that looks related to the patch

Itxaka · 2017-12-15T14:59:46Z

This doesnt really work either, as soon as pacemaker has chosen a master the output of crm resource show rabbitmq returns a "valid" master but rabbit its still restarting and promoting :(

Itxaka · 2017-12-15T15:04:00Z

time chef continued the run:

[2017-12-15T14:58:08+00:00] INFO: ruby_block[wait for ms-rabbitmq to be started] called

at that same time, rabbitmq was still restarting:

lrmd:   2017/12/15_14:58:25 INFO: rabbitmq[27984]: promote: Updating cluster master attribute

Itxaka · 2017-12-15T15:15:06Z

parsing the output of rabbitmqctl cluster_status is also absurd, because for a brief moment you can get the proper output before the services start restarting

vuntz

Let me approve, even though I have some minor nitpicks.

vuntz · 2018-02-08T08:55:17Z

+        end
+        # Check that we dont have any pending resource operations
+        cmd = "crm resource operations #{ms_name} 2> /dev/null "
+        cmd << "|grep -q \"pending\""


s/|grep/| grep/

vuntz · 2018-02-08T08:55:36Z

      end
    rescue Timeout::Error
-      message = "The #{ms_name} pacemaker resource is not started. Please manually check for an error."
+      message = "The #{ms_name} pacemaker resource is not started or doesnt have a master yet."


s/doesnt/does not/ (or doesn't)

Itxaka · 2018-02-08T10:57:48Z

IIRC this also didnt catch all scenarios as there was a brief period of time where allt he checks can pass and we are still restarting ¬_¬

ilausuch · 2018-03-23T11:37:49Z

+        # Check that the service has a master
        cmd = "crm resource show #{ms_name} 2> /dev/null "
-        cmd << "| grep -q \"is running on\""
+        cmd << "| grep -q \"is running on\" | grep -q \"Master\""


I think that if the first grep uses -q won't pass anything to next grep and it will always fail. I think the correct line is:
cmd << "| grep "is running on" | grep -q "Master""

AbelNavarro · 2018-04-19T13:30:59Z

+        # Check that the service has a master
        cmd = "crm resource show #{ms_name} 2> /dev/null "
-        cmd << "| grep -q \"is running on\""
+        cmd << "| grep \"is running on\" | grep -q \"Master\""


I would swap the check for Master and the check for locally running. You require rabbitmq in all nodes locally before you can get a Master.

Ivan is out for some time :)

Itxaka · 2018-04-26T10:19:11Z

@vuntz @AbelNavarro re-review?

Main changes are, ordering of the checks as Abel suggested, only trigger the check for rabbit when the transaction is triggered, and sync at the end of the recipe so all the checks have passed on all the nodes.

houndci-bot · 2018-04-26T12:08:03Z

+    end
+  end # block
+  action :nothing
+end # ruby_block


Style/CommentedKeyword: Do not place comments on the same line as the end keyword.

houndci-bot · 2018-04-26T12:08:03Z

+      Chef::Log.fatal(message)
+      raise message
+    end
+  end # block


Style/CommentedKeyword: Do not place comments on the same line as the end keyword.

houndci-bot · 2018-04-26T12:08:03Z

 end

+# wait for service to have a master, and to be active
+ruby_block "wait for #{ms_name} to be started" do


Metrics/BlockLength: Block has too many lines. [42/40]

jsuchome · 2018-04-27T06:17:58Z

btw this is conflicting with #1637 .. I assume one needs to be rebased after first is merged...

Itxaka · 2018-04-27T08:02:20Z

@jsuchome I will rebase this afterwards I guess, no problems there

jsuchome · 2018-04-27T08:49:56Z

You can rebase now

houndci-bot · 2018-04-27T11:33:46Z


+# wait for service to have a master, and to be active
+ruby_block "wait for #{ms_name} to be started" do
+  block do


Metrics/BlockLength: Block has too many lines. [42/40]

houndci-bot · 2018-04-27T11:33:46Z

 end

+# wait for service to have a master, and to be active
+ruby_block "wait for #{ms_name} to be started" do


Metrics/BlockLength: Block has too many lines. [45/40]

I am using this PR as requisite of other and these violations doesn't affect to the execution

Itxaka · 2018-04-27T11:35:41Z

rebased! @jsuchome the check for upgrade was moved to the top, as the first check now checks if the service is running locally, which in the upgrade case we dont want to do so

rsalevsky · 2018-05-04T13:11:03Z

Needs a rebase again.

As the resource agent for rabbitmq with cluster HA restart the rabbitmq service several times, the current check can fail to validate rabbitmq status, as it could do the check just on one of those times that rabbit is up while creating/joining the cluster. Then if the check passed and continued the chef execution, the next steps could fail as they are dependant on having a running rabbitmq, while the rabbitmq server may still be restarting. Instead expand the checks to first look for a rabbit master for the resource and expand the check for a local runing rabbit to make sure we are checking for the local copy. Also add an extra check after the crm checks to make sure there are no pending operations for the resource so we can try to avoid continuing if there is a promotion going on.

As the other checks are not enough, as pacemaker keeps restarting rabbitmq, we need a more robust way of checking that rabbit has entered an stable situation. So check that rabbit is up 5 times in a row with a delay of 2 seconds between checks to make sure pacemaker has left it alone. Also, only trigger that check for rabbit if the pacemaker_transaction is updated, otherwise there is no need to do so

jsuchome · 2018-06-26T15:22:02Z

Was this tested during the upgrade?

Itxaka · 2018-06-28T06:55:03Z

Was this tested during the upgrade?

only the backport may affect the upgrade wont it?

In any case, testing it.

jsuchome · 2018-06-28T07:05:00Z

only the backport may affect the upgrade wont it?

Nope. This part is executed on the first upgraded node during the chef-client run.

In any case, testing it.

Mee too. Actually not only it does not seem to break anything, it's possible it's fixing one of our other rabbitmq-related problems...

Itxaka · 2018-06-28T07:12:59Z

Mee too. Actually not only it does not seem to break anything, it's possible it's fixing one of our other rabbitmq-related problems...

Solving stuff and not even knowing it 👯‍♂️

jsuchome · 2018-06-28T07:17:27Z

I was too optimistic, it does not seem to solve the problem I was watching. But it also does not seem to break anything...

Itxaka added the needs backport to SOC7 label Oct 24, 2017

nicolasbock reviewed Oct 24, 2017

View reviewed changes

Itxaka requested review from AbelNavarro, scottwulf and vuntz October 24, 2017 16:51

vuntz reviewed Oct 26, 2017

View reviewed changes

Itxaka force-pushed the increase_rabbitmq_cluster_check_ready branch from 34621d4 to 708d813 Compare December 13, 2017 11:48

vuntz previously approved these changes Dec 13, 2017

View reviewed changes

Itxaka dismissed vuntz’s stale review via 8fdb2c2 December 15, 2017 15:22

Itxaka force-pushed the increase_rabbitmq_cluster_check_ready branch 2 times, most recently from 8fdb2c2 to 363afb0 Compare December 15, 2017 15:37

vuntz previously approved these changes Feb 8, 2018

View reviewed changes

Itxaka added the do not merge yet label Feb 8, 2018

Itxaka dismissed vuntz’s stale review via 17a6010 March 22, 2018 12:22

Itxaka force-pushed the increase_rabbitmq_cluster_check_ready branch from 363afb0 to 17a6010 Compare March 22, 2018 12:22

ilausuch previously requested changes Mar 23, 2018

View reviewed changes

Itxaka force-pushed the increase_rabbitmq_cluster_check_ready branch from 17a6010 to a000382 Compare March 23, 2018 13:38

Itxaka removed the do not merge yet label Mar 23, 2018

Itxaka force-pushed the increase_rabbitmq_cluster_check_ready branch from a000382 to 369fe08 Compare March 23, 2018 15:13

AbelNavarro reviewed Apr 19, 2018

View reviewed changes

Itxaka force-pushed the increase_rabbitmq_cluster_check_ready branch from ba7f458 to 5dc32c7 Compare April 24, 2018 12:14

ilausuch mentioned this pull request Apr 24, 2018

rabbitmq: active the haproxy balancing #1571

Closed

Itxaka force-pushed the increase_rabbitmq_cluster_check_ready branch from 5dc32c7 to 74fc830 Compare April 26, 2018 10:13

Itxaka force-pushed the increase_rabbitmq_cluster_check_ready branch from 74fc830 to 45c60eb Compare April 26, 2018 10:16

crowbar deleted a comment from houndci-bot Apr 26, 2018

AbelNavarro previously approved these changes Apr 26, 2018

View reviewed changes

Itxaka dismissed AbelNavarro’s stale review via 15ebe61 April 26, 2018 12:07

Itxaka force-pushed the increase_rabbitmq_cluster_check_ready branch from 45c60eb to 15ebe61 Compare April 26, 2018 12:07

houndci-bot reviewed Apr 26, 2018

View reviewed changes

Itxaka force-pushed the increase_rabbitmq_cluster_check_ready branch 3 times, most recently from 4a3de28 to d7f7f76 Compare April 26, 2018 14:40

Itxaka force-pushed the increase_rabbitmq_cluster_check_ready branch from d7f7f76 to 4bde235 Compare April 27, 2018 11:33

houndci-bot reviewed Apr 27, 2018

View reviewed changes

Itxaka force-pushed the increase_rabbitmq_cluster_check_ready branch from 4bde235 to 13e4f2b Compare April 27, 2018 12:27

Itxaka added 2 commits June 21, 2018 12:37

Itxaka force-pushed the increase_rabbitmq_cluster_check_ready branch from 13e4f2b to 8b56894 Compare June 21, 2018 10:41

ilausuch approved these changes Jun 26, 2018

View reviewed changes

AbelNavarro approved these changes Jun 26, 2018

View reviewed changes

Itxaka merged commit b2b26cb into crowbar:master Jul 3, 2018

Itxaka mentioned this pull request Jul 3, 2018

[4.0] rabbitmq: Make sure rabbitmq is running on cluster HA #1697

Merged

Conversation

Itxaka commented Oct 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Itxaka Oct 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Itxaka commented Dec 13, 2017

Uh oh!

vuntz commented Dec 15, 2017

Uh oh!

Itxaka commented Dec 15, 2017

Uh oh!

Itxaka commented Dec 15, 2017

Uh oh!

Itxaka commented Dec 15, 2017

Uh oh!

vuntz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Itxaka commented Feb 8, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Itxaka commented Apr 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsuchome commented Apr 27, 2018

Uh oh!

Itxaka commented Apr 27, 2018

Uh oh!

jsuchome commented Apr 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Itxaka commented Apr 27, 2018

Uh oh!

rsalevsky commented May 4, 2018

Uh oh!

jsuchome commented Jun 26, 2018

Uh oh!

Itxaka commented Jun 28, 2018

Uh oh!

jsuchome commented Jun 28, 2018

Uh oh!

Itxaka commented Jun 28, 2018

Uh oh!

jsuchome commented Jun 28, 2018

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Itxaka commented Oct 24, 2017 •

edited

Loading

Itxaka Oct 25, 2017 •

edited

Loading