keystone: avoid race condition during admin password change#1668
Conversation
| # admin tokens already issued before creating new ones. | ||
| ruby_block "wait for keystone to expire current admin tokens" do | ||
| block do | ||
| sleep(5) |
There was a problem hiding this comment.
Given that we don't actually know how long it will take for keystone to finally straighten out its keys, I think it would be safer to use a retry loop here instead of a wait with a fixed time. In case we are on super slow hardware or something else is disturbing keystone the above timeout might just be too short.
There was a problem hiding this comment.
@nicolasbock a retry loop is not required, because the next keystone_register wakeup action already includes something like that: https://github.com/crowbar/crowbar-openstack/blob/master/chef/cookbooks/keystone/providers/register.rb#L26
There was a problem hiding this comment.
So you could also wait less than 5 seconds?
There was a problem hiding this comment.
So you could also wait less than 5 seconds?
@nicolasbock yes, I used a 2 second delay in my manual tests. The idea is to wait a short while before making the first _build_connection call, otherwise we risk entering a loop of 50 retry actions that will all fail because they use an expired token. A better way to fix this would require me to change the register.rb code, which I would like to avoid, to reduce the impact this would have on other parts of keystone functionality.
There was a problem hiding this comment.
Thanks I see. Basically I am trying to understand how "safe" a 5 second delay would be. Will we potentially run into a situation where it's too short?
There was a problem hiding this comment.
@nicolasbock I went ahead and modified the keystone_register resource to fix this without introducing additional waiting periods. Hopefully, this addresses all your concerns.
@cmurphy yes, I retested it using the v3 API with the same results By the way, the SOC8 keystone_register provider code doesn't use |
@cmurphy I even tried with |
| sleep 1 if error | ||
| if error | ||
| sleep 1 | ||
| if count < 50 and new_resource.reissue_token_on_error |
There was a problem hiding this comment.
Style/AndOr: Use && instead of and. (https://github.com/bbatsov/ruby-style-guide#no-and-or-or)
| count = count + 1 | ||
| _, error = _get_service_id(http, headers, "fred") | ||
| sleep 1 if error | ||
| if error |
There was a problem hiding this comment.
Style/Next: Use next to skip iteration. (https://github.com/bbatsov/ruby-style-guide#no-nested-conditionals)
nicolasbock
left a comment
There was a problem hiding this comment.
Thanks! This looks much nice!
cmurphy
left a comment
There was a problem hiding this comment.
As I mentioned on the LP bug I think adding a sleep is the right call but I think I spotted a couple of implementation issues here.
| count = count + 1 | ||
| _, error = _get_service_id(http, headers, "fred") | ||
| sleep 1 if error | ||
| next unless error |
There was a problem hiding this comment.
Using next will cause this to return to the iterator and start the loop again, causing this to iterate 50 times even when there is no error. Should this be break instead?
There was a problem hiding this comment.
The only change that I'm trying to introduce here is to regenerate the token (i.e. run _build_connection again) if _get_service_id fails. Also, I don't want to do use this altered behavior all the time, but only when so enabled via this new reissue_token_on_error parameter.
The next unless stuff comes from a rubocop complaint saying that I should use next/break instead of doing:
if error
sleep 1
if count < 50 && new_resource.reissue_token_on_error
http, headers = _build_connection(new_resource)
end
end
There was a problem hiding this comment.
ah I was thinking about this but I think that the while error part will drop the iteration
There was a problem hiding this comment.
The next unless stuff comes from a rubocop complaint saying that I should use next/break instead of doing:
Ugh I really hate the ruby style guide sometimes. If you had written it that way I don't think I would have misread it.
There was a problem hiding this comment.
Ugh I really hate the ruby style guide sometimes. If you had written it that way I don't think I would have misread it.
Tell me about it :)
| sleep 1 if error | ||
| next unless error | ||
| sleep 1 | ||
| next unless count < 50 && new_resource.reissue_token_on_error |
There was a problem hiding this comment.
This is also going to go back to the iterator. I think it would make more sense to use break or even better to wrap the _build_connection in a conditional.
Also the iterator is already checking the count, does it really need to be checked again here? The important part is really the new resource attribute, right?
There was a problem hiding this comment.
This is the same next unless weirdness at play here (see my previous comment).
I recheck the count here just to avoid regenerating one last token when the loop ends in failure - a token that will not be used anywhere. Sure, this might be overdoing it a bit. I mean... if we already generate 50 tokens, one more token isn't going to make a difference.
There was a problem hiding this comment.
I changed the conditions to be more compact.
| count = 0 | ||
| error = true | ||
| while error and count < 50 do | ||
| while true |
There was a problem hiding this comment.
Style/InfiniteLoop: Use Kernel#loop for infinite loops. (https://github.com/bbatsov/ruby-style-guide#infinite-loop)
Lint/LiteralAsCondition: Literal true appeared as a condition.
There was a problem hiding this comment.
Indeed there is a loop, as there is no do while in Ruby (not at least something that is recommended)
…829) Issuing a new keystone token immediately after updating the admin user password may sometimes return an invalid token. In the context of crowbar, this issue can be triggered when calling the keystone_register 'wakeup' action immediately after the admin password has been updated. When triggered, it results in timeout errors on non-founder nodes, while the founder node is stuck doing retry iterations with an expired token. As a workaround for bsc#1091829, the 'wakeup' action is updated with an optional 'reissue_token_on_error' argument, which, when set, will re-issue a token *before* checking the keystone API again, instead of reusing the same token for subsequent attempts.
| update_admin_password = node[:keystone][:bootstrap] && | ||
| (!ha_enabled || CrowbarPacemakerHelper.is_cluster_founder?(node)) && | ||
| old_password && !old_password.empty? && | ||
| old_password != node[:keystone][:admin][:password] |
There was a problem hiding this comment.
Just a note here, this has been moved from the execution phase to the compilation phase, so it will resolve at the start of the chef run, instead of just before executing the resource.
Does not seem to have any issues right now, but it might do in the future.
Workaround for: https://bugzilla.suse.com/show_bug.cgi?id=1091829
Calling the keystone_register 'wakeup' action immediately after
the admin password has been updated can sometimes result in timeout
errors on non-founder nodes, while the founder node is stuck doing
retry iterations with an expired token due to bsc#1091829.
As a workaround for bsc#1091829, a small time delay is inserted after
the admin password is changed, to give keystone time to expire all the
admin tokens already issued before creating new ones.