Skip to content

keystone: avoid race condition during admin password change#1668

Merged
stefannica merged 1 commit into
crowbar:masterfrom
stefannica:bsc#1091829
May 11, 2018
Merged

keystone: avoid race condition during admin password change#1668
stefannica merged 1 commit into
crowbar:masterfrom
stefannica:bsc#1091829

Conversation

@stefannica

Copy link
Copy Markdown
Contributor

Workaround for: https://bugzilla.suse.com/show_bug.cgi?id=1091829

Calling the keystone_register 'wakeup' action immediately after
the admin password has been updated can sometimes result in timeout
errors on non-founder nodes, while the founder node is stuck doing
retry iterations with an expired token due to bsc#1091829.

As a workaround for bsc#1091829, a small time delay is inserted after
the admin password is changed, to give keystone time to expire all the
admin tokens already issued before creating new ones.

@stefannica

Copy link
Copy Markdown
Contributor Author

# admin tokens already issued before creating new ones.
ruby_block "wait for keystone to expire current admin tokens" do
block do
sleep(5)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we don't actually know how long it will take for keystone to finally straighten out its keys, I think it would be safer to use a retry loop here instead of a wait with a fixed time. In case we are on super slow hardware or something else is disturbing keystone the above timeout might just be too short.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicolasbock a retry loop is not required, because the next keystone_register wakeup action already includes something like that: https://github.com/crowbar/crowbar-openstack/blob/master/chef/cookbooks/keystone/providers/register.rb#L26

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you could also wait less than 5 seconds?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you could also wait less than 5 seconds?

@nicolasbock yes, I used a 2 second delay in my manual tests. The idea is to wait a short while before making the first _build_connection call, otherwise we risk entering a loop of 50 retry actions that will all fail because they use an expired token. A better way to fix this would require me to change the register.rb code, which I would like to avoid, to reduce the impact this would have on other parts of keystone functionality.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I see. Basically I am trying to understand how "safe" a 5 second delay would be. Will we potentially run into a situation where it's too short?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicolasbock I went ahead and modified the keystone_register resource to fix this without introducing additional waiting periods. Hopefully, this addresses all your concerns.

@cmurphy

cmurphy commented May 8, 2018

Copy link
Copy Markdown
Contributor

@stefannica

stefannica commented May 8, 2018

Copy link
Copy Markdown
Contributor Author

Is the problem reproducible using the v3 API?

https://developer.openstack.org/api-ref/identity/v3/index.html#password-authentication-with-scoped-authorization
https://developer.openstack.org/api-ref/identity/v3/index.html#change-password-for-user

@cmurphy yes, I retested it using the v3 API with the same results

By the way, the SOC8 keystone_register provider code doesn't use /v3/users/{user_id}/password tp change the user password, it uses PATCH over /v3/users/{user_id}, which is what I used in my tests.

@stefannica

Copy link
Copy Markdown
Contributor Author

By the way, the SOC8 keystone_register provider code doesn't use /v3/users/{user_id}/password tp change the user password, it uses PATCH over /v3/users/{user_id}, which is what I used in my tests.

@cmurphy I even tried with /v3/users/{user_id}/password with the same result

sleep 1 if error
if error
sleep 1
if count < 50 and new_resource.reissue_token_on_error

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/AndOr: Use && instead of and. (https://github.com/bbatsov/ruby-style-guide#no-and-or-or)

count = count + 1
_, error = _get_service_id(http, headers, "fred")
sleep 1 if error
if error

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nicolasbock
nicolasbock previously approved these changes May 9, 2018

@nicolasbock nicolasbock left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks much nice!

aplanas
aplanas previously approved these changes May 9, 2018

@cmurphy cmurphy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned on the LP bug I think adding a sleep is the right call but I think I spotted a couple of implementation issues here.

count = count + 1
_, error = _get_service_id(http, headers, "fred")
sleep 1 if error
next unless error

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using next will cause this to return to the iterator and start the loop again, causing this to iterate 50 times even when there is no error. Should this be break instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only change that I'm trying to introduce here is to regenerate the token (i.e. run _build_connection again) if _get_service_id fails. Also, I don't want to do use this altered behavior all the time, but only when so enabled via this new reissue_token_on_error parameter.

The next unless stuff comes from a rubocop complaint saying that I should use next/break instead of doing:

if error
  sleep 1
  if count < 50 && new_resource.reissue_token_on_error
    http, headers = _build_connection(new_resource)
  end
end

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I was thinking about this but I think that the while error part will drop the iteration

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next unless stuff comes from a rubocop complaint saying that I should use next/break instead of doing:

Ugh I really hate the ruby style guide sometimes. If you had written it that way I don't think I would have misread it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh I really hate the ruby style guide sometimes. If you had written it that way I don't think I would have misread it.

Tell me about it :)

sleep 1 if error
next unless error
sleep 1
next unless count < 50 && new_resource.reissue_token_on_error

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also going to go back to the iterator. I think it would make more sense to use break or even better to wrap the _build_connection in a conditional.

Also the iterator is already checking the count, does it really need to be checked again here? The important part is really the new resource attribute, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same next unless weirdness at play here (see my previous comment).

I recheck the count here just to avoid regenerating one last token when the loop ends in failure - a token that will not be used anywhere. Sure, this might be overdoing it a bit. I mean... if we already generate 50 tokens, one more token isn't going to make a difference.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the conditions to be more compact.

@stefannica stefannica dismissed stale reviews from aplanas and nicolasbock via a4c62ec May 9, 2018 14:48
count = 0
error = true
while error and count < 50 do
while true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/InfiniteLoop: Use Kernel#loop for infinite loops. (https://github.com/bbatsov/ruby-style-guide#infinite-loop)
Lint/LiteralAsCondition: Literal true appeared as a condition.

@aplanas aplanas May 9, 2018

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed there is a loop, as there is no do while in Ruby (not at least something that is recommended)

…829)

Issuing a new keystone token immediately after updating the
admin user password may sometimes return an invalid token.

In the context of crowbar, this issue can be triggered when calling
the keystone_register 'wakeup' action immediately after the admin
password has been updated. When triggered, it results in timeout errors
on non-founder nodes, while the founder node is stuck doing retry
iterations with an expired token.

As a workaround for bsc#1091829, the 'wakeup' action is updated with
an optional 'reissue_token_on_error' argument, which, when set, will
re-issue a token *before* checking the keystone API again, instead of
reusing the same token for subsequent attempts.
@stefannica stefannica merged commit a291f6f into crowbar:master May 11, 2018
@stefannica stefannica deleted the bsc#1091829 branch May 11, 2018 19:28
update_admin_password = node[:keystone][:bootstrap] &&
(!ha_enabled || CrowbarPacemakerHelper.is_cluster_founder?(node)) &&
old_password && !old_password.empty? &&
old_password != node[:keystone][:admin][:password]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note here, this has been moved from the execution phase to the compilation phase, so it will resolve at the start of the chef run, instead of just before executing the resource.

Does not seem to have any issues right now, but it might do in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

7 participants