keystone: avoid race condition during admin password change by stefannica · Pull Request #1668 · crowbar/crowbar-openstack

stefannica · 2018-05-08T11:06:27Z

Workaround for: https://bugzilla.suse.com/show_bug.cgi?id=1091829

Calling the keystone_register 'wakeup' action immediately after
the admin password has been updated can sometimes result in timeout
errors on non-founder nodes, while the founder node is stuck doing
retry iterations with an expired token due to bsc#1091829.

As a workaround for bsc#1091829, a small time delay is inserted after
the admin password is changed, to give keystone time to expire all the
admin tokens already issued before creating new ones.

stefannica · 2018-05-08T11:08:58Z

Bug 1091829 - Keystone barclamp fails when changing admin password in HA setup

nicolasbock · 2018-05-08T12:50:39Z

+# admin tokens already issued before creating new ones.
+ruby_block "wait for keystone to expire current admin tokens" do
+  block do
+    sleep(5)


Given that we don't actually know how long it will take for keystone to finally straighten out its keys, I think it would be safer to use a retry loop here instead of a wait with a fixed time. In case we are on super slow hardware or something else is disturbing keystone the above timeout might just be too short.

@nicolasbock a retry loop is not required, because the next keystone_register wakeup action already includes something like that: https://github.com/crowbar/crowbar-openstack/blob/master/chef/cookbooks/keystone/providers/register.rb#L26

So you could also wait less than 5 seconds?

So you could also wait less than 5 seconds?

@nicolasbock yes, I used a 2 second delay in my manual tests. The idea is to wait a short while before making the first _build_connection call, otherwise we risk entering a loop of 50 retry actions that will all fail because they use an expired token. A better way to fix this would require me to change the register.rb code, which I would like to avoid, to reduce the impact this would have on other parts of keystone functionality.

Thanks I see. Basically I am trying to understand how "safe" a 5 second delay would be. Will we potentially run into a situation where it's too short?

@nicolasbock I went ahead and modified the keystone_register resource to fix this without introducing additional waiting periods. Hopefully, this addresses all your concerns.

cmurphy · 2018-05-08T14:38:01Z

Is the problem reproducible using the v3 API?

https://developer.openstack.org/api-ref/identity/v3/index.html#password-authentication-with-scoped-authorization
https://developer.openstack.org/api-ref/identity/v3/index.html#change-password-for-user

stefannica · 2018-05-08T17:28:43Z

Is the problem reproducible using the v3 API?

https://developer.openstack.org/api-ref/identity/v3/index.html#password-authentication-with-scoped-authorization
https://developer.openstack.org/api-ref/identity/v3/index.html#change-password-for-user

@cmurphy yes, I retested it using the v3 API with the same results

By the way, the SOC8 keystone_register provider code doesn't use /v3/users/{user_id}/password tp change the user password, it uses PATCH over /v3/users/{user_id}, which is what I used in my tests.

stefannica · 2018-05-08T17:52:47Z

By the way, the SOC8 keystone_register provider code doesn't use /v3/users/{user_id}/password tp change the user password, it uses PATCH over /v3/users/{user_id}, which is what I used in my tests.

@cmurphy I even tried with /v3/users/{user_id}/password with the same result

houndci-bot · 2018-05-09T11:03:26Z

-    sleep 1 if error
+    if error
+      sleep 1
+      if count < 50 and new_resource.reissue_token_on_error


Style/AndOr: Use && instead of and. (https://github.com/bbatsov/ruby-style-guide#no-and-or-or)

houndci-bot · 2018-05-09T11:03:26Z

    count = count + 1
    _, error = _get_service_id(http, headers, "fred")
-    sleep 1 if error
+    if error


Style/Next: Use next to skip iteration. (https://github.com/bbatsov/ruby-style-guide#no-nested-conditionals)

nicolasbock

Thanks! This looks much nice!

cmurphy

As I mentioned on the LP bug I think adding a sleep is the right call but I think I spotted a couple of implementation issues here.

cmurphy · 2018-05-09T14:17:40Z

    count = count + 1
    _, error = _get_service_id(http, headers, "fred")
-    sleep 1 if error
+    next unless error


Using next will cause this to return to the iterator and start the loop again, causing this to iterate 50 times even when there is no error. Should this be break instead?

The only change that I'm trying to introduce here is to regenerate the token (i.e. run _build_connection again) if _get_service_id fails. Also, I don't want to do use this altered behavior all the time, but only when so enabled via this new reissue_token_on_error parameter.

The next unless stuff comes from a rubocop complaint saying that I should use next/break instead of doing:

if error sleep 1 if count < 50 && new_resource.reissue_token_on_error http, headers = _build_connection(new_resource) end end

ah I was thinking about this but I think that the while error part will drop the iteration

The next unless stuff comes from a rubocop complaint saying that I should use next/break instead of doing:

Ugh I really hate the ruby style guide sometimes. If you had written it that way I don't think I would have misread it.

Ugh I really hate the ruby style guide sometimes. If you had written it that way I don't think I would have misread it.

Tell me about it :)

cmurphy · 2018-05-09T14:22:28Z

-    sleep 1 if error
+    next unless error
+    sleep 1
+    next unless count < 50 && new_resource.reissue_token_on_error


This is also going to go back to the iterator. I think it would make more sense to use break or even better to wrap the _build_connection in a conditional.

Also the iterator is already checking the count, does it really need to be checked again here? The important part is really the new resource attribute, right?

This is the same next unless weirdness at play here (see my previous comment).

I recheck the count here just to avoid regenerating one last token when the loop ends in failure - a token that will not be used anywhere. Sure, this might be overdoing it a bit. I mean... if we already generate 50 tokens, one more token isn't going to make a difference.

I changed the conditions to be more compact.

houndci-bot · 2018-05-09T14:48:46Z

  count = 0
  error = true
-  while error and count < 50 do
+  while true


Style/InfiniteLoop: Use Kernel#loop for infinite loops. (https://github.com/bbatsov/ruby-style-guide#infinite-loop)
Lint/LiteralAsCondition: Literal true appeared as a condition.

Indeed there is a loop, as there is no do while in Ruby (not at least something that is recommended)

…829) Issuing a new keystone token immediately after updating the admin user password may sometimes return an invalid token. In the context of crowbar, this issue can be triggered when calling the keystone_register 'wakeup' action immediately after the admin password has been updated. When triggered, it results in timeout errors on non-founder nodes, while the founder node is stuck doing retry iterations with an expired token. As a workaround for bsc#1091829, the 'wakeup' action is updated with an optional 'reissue_token_on_error' argument, which, when set, will re-issue a token *before* checking the keystone API again, instead of reusing the same token for subsequent attempts.

Itxaka · 2018-05-14T08:15:04Z

+update_admin_password = node[:keystone][:bootstrap] &&
+  (!ha_enabled || CrowbarPacemakerHelper.is_cluster_founder?(node)) &&
+  old_password && !old_password.empty? &&
+  old_password != node[:keystone][:admin][:password]


Just a note here, this has been moved from the execution phase to the compilation phase, so it will resolve at the start of the chef run, instead of just before executing the resource.

Does not seem to have any issues right now, but it might do in the future.

stefannica added the needs backport to SOC7 label May 8, 2018

stefannica assigned cmurphy and guangyee and unassigned cmurphy and guangyee May 8, 2018

stefannica requested review from cmurphy and guangyee May 8, 2018 11:07

stefannica mentioned this pull request May 8, 2018

keystone: re-enable update password test SUSE-Cloud/automation#2243

Merged

6 tasks

nicolasbock suggested changes May 8, 2018

View reviewed changes

stefannica force-pushed the bsc#1091829 branch from d3eccdf to 832f1c0 Compare May 9, 2018 11:03

houndci-bot reviewed May 9, 2018

View reviewed changes

stefannica force-pushed the bsc#1091829 branch from 832f1c0 to 2b9ccb5 Compare May 9, 2018 11:09

nicolasbock previously approved these changes May 9, 2018

View reviewed changes

aplanas previously approved these changes May 9, 2018

View reviewed changes

cmurphy requested changes May 9, 2018

View reviewed changes

stefannica dismissed stale reviews from aplanas and nicolasbock via a4c62ec May 9, 2018 14:48

stefannica force-pushed the bsc#1091829 branch from 2b9ccb5 to a4c62ec Compare May 9, 2018 14:48

houndci-bot reviewed May 9, 2018

View reviewed changes

stefannica force-pushed the bsc#1091829 branch from a4c62ec to 82c8091 Compare May 9, 2018 14:51

stefannica force-pushed the bsc#1091829 branch from 82c8091 to 3d664ed Compare May 9, 2018 14:57

cmurphy approved these changes May 9, 2018

View reviewed changes

aplanas approved these changes May 9, 2018

View reviewed changes

stefannica merged commit a291f6f into crowbar:master May 11, 2018

stefannica deleted the bsc#1091829 branch May 11, 2018 19:28

stefannica mentioned this pull request May 11, 2018

[4.0] keystone: avoid race condition during admin password change #1670

Merged

Itxaka reviewed May 14, 2018

View reviewed changes

Conversation

stefannica commented May 8, 2018

Uh oh!

stefannica commented May 8, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmurphy commented May 8, 2018

Uh oh!

stefannica commented May 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefannica commented May 8, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicolasbock left a comment

Choose a reason for hiding this comment

Uh oh!

cmurphy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aplanas May 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

7 participants

stefannica commented May 8, 2018 •

edited

Loading

aplanas May 9, 2018 •

edited

Loading