Skip to content

Enable highstate failure retcode#10504

Closed
techdragon wants to merge 2 commits into
saltstack:developfrom
techdragon:enable-highstate-failure-retcode
Closed

Enable highstate failure retcode#10504
techdragon wants to merge 2 commits into
saltstack:developfrom
techdragon:enable-highstate-failure-retcode

Conversation

@techdragon
Copy link
Copy Markdown
Contributor

  • new config variable to enable/disable the new functionality
  • added new function to determine if the highstate return data indicates
    an error or failure has occurred.
  • added new logic to the end of the salt cli call that if enabled using
    the config variable, determines if salt should exit with a non zero
    exit code and then will exit if any failures or errors are found in
    the highstate return data.

Should fix issue #7013

- new config variable to enable/disable the new functionality
- added new function to determine if the highstate return data indicates
  an error or failure has occurred.
- added new logic to the end of the salt cli call that if enabled using
  the config variable, determines if salt should exit with a non zero
  exit code and then will exit if any failures or errors are found in
  the highstate return data.
@kiorky
Copy link
Copy Markdown
Contributor

kiorky commented Feb 17, 2014

Can the bool value be true as default, as anyway the retcode was inconsistent before ?

@kiorky
Copy link
Copy Markdown
Contributor

kiorky commented Feb 17, 2014

Really love to see that pretty soon reviewed and merged !
/cc @regilero @Gagaro

@ghost
Copy link
Copy Markdown

ghost commented Feb 17, 2014

Test FAILed.
Refer to this link for build results: http://jenkins.saltstack.com/job/salt-pr-build/1540/

@kiorky
Copy link
Copy Markdown
Contributor

kiorky commented Feb 17, 2014

Im checking before enabling that as default if there are some performances troubles with big highstates, see the gist: https://gist.github.com/kiorky/9052549

Test is running ATM, please ping me if i forget to post the result ^.^

@kiorky
Copy link
Copy Markdown
Contributor

kiorky commented Feb 17, 2014

SAMPLES == An highstate of N minions which 1000 states for each minion

[PROCESSING THE 10000 SAMPLES of 1000 states] finished in 4.5949280262s
[PROCESSING THE 1000 SAMPLES of 1000 states] finished in 0.462728977203s
[PROCESSING THE 100 SAMPLES of 1000 states] finished in 0.0464670658112s
[PROCESSING THE 10 SAMPLES of 1000 states] finished in 0.00506496429443s
[PROCESSING THE 1 SAMPLES of 1000 states] finished in 0.000871896743774s

[PROCESSING THE 10000 SAMPLES of 3000 states] finished in 14.2559530735s
[PROCESSING THE 1000 SAMPLES of 3000 states] finished in 1.43430519104s
[PROCESSING THE 100 SAMPLES of 3000 states] finished in 0.144124984741s
[PROCESSING THE 10 SAMPLES of 3000 states] finished in 0.015741109848s
[PROCESSING THE 1 SAMPLES of 3000 states] finished in 0.00277495384216s

[PROCESSING THE 30000 SAMPLES of 1500 states] finished in 59.8476610184s
[PROCESSING THE 3000 SAMPLES of 1500 states] finished in 2.21659708023s
[PROCESSING THE 300 SAMPLES of 1500 states] finished in 0.223330974579s
[PROCESSING THE 30 SAMPLES of 1500 states] finished in 0.0301289558411s
[PROCESSING THE 3 SAMPLES of 1500 states] finished in 0.00558114051819s

@kiorky
Copy link
Copy Markdown
Contributor

kiorky commented Feb 17, 2014

For our deployments, we have ~1000 states per minion, that why i choosed 1000 in the first place, 3000 seem irrealistic.

@kiorky
Copy link
Copy Markdown
Contributor

kiorky commented Feb 17, 2014

So for me the overhead of having the highstate results seems acceptable even with large highstates which even here seems a little irrealistic even for big deployments.
/cc @techdragon

@thatch45
Copy link
Copy Markdown
Contributor

This check should already be getting performed on the minion end and setting the retcode field in the return data, so we should not have to do it again. We just need to re-visit retcode-passthrough for the salt command. I think that this is not the best approach here. Is there anything wrong with _set_retcode in salt/modules/state.py?

@kiorky
Copy link
Copy Markdown
Contributor

kiorky commented Feb 17, 2014

/cc @thatch45

cat /srv/salt/toto.sls;salt '*' state.sls toto;echo "retcode: $?"
/bin/false:
  cmd.run: []
devhost2:
----------
          ID: /bin/false
    Function: cmd.run
      Result: False
     Comment: Command "/bin/false" run
     Changes:
              ----------
              pid:
                  5351
              retcode:
                  1
              stderr:

              stdout:


Summary
------------
Succeeded: 0
Failed:    1
------------
Total:     1
retcode: 0

@techdragon
Copy link
Copy Markdown
Contributor Author

@thatch45 I didn't even know that function was there, but after taking some time to investigate it, yes something is wrong with that function. Its broken... to quite a surprising extent.

Each minion when requested to process a state, be it state.highstate, state.sls , state.single, state.top, state.template_str, state.template, or state.high has a fairly predictable code path with regards to the return code.

After executing the states, each minion while building the return dictionary object ret, will call _set_retcode(ret).

_set_retcode(ret) then does one of two things.

First it does a quick check to see if ret is a list, and if it is a list, it sets the return code key in the __context__ dict to 1 and then returns.

if isinstance(ret, list):
    __context__['retcode'] = 1
    return

If ret is not a list, it will then be passed to another function.

if not salt.utils.check_state_result(ret):
    __context__['retcode'] = 2

So far so good. The first check works as expected. Our second check however...

if not isinstance(running, dict):
    return False
if not running:
    return False

All good so far, just simple sanity checks check.
Now for the problem. The minion has passed a dict that looks like this to check_state_result(ret).

running = {
    'file_|-deliberate-pass_|-/Users/guest/deployall.jpg_|-exists': {
        'comment': 'Path /Users/guest/deployall.jpg exists', 
        '__run_num__': 0, 
        'changes': {}, 
        'name': '/Users/guest/deployall.jpg', 
        'result': True
        }, 
    'test_|-test-passes_|-foobar_|-succeed_without_changes': {
        'comment': '', 
        '__run_num__': 3, 
        'changes': {}, 
        'name': 'foobar', 
        'result': True
        }, 
    'layman_|-deliberate-fail_|-sunrise_|-present': {
        'comment': 'State layman.present found in sls test is unavailable\n', 
        '__run_num__': 2, 
        'changes': {}, 
        'result': False, 
        'name': 'sunrise'
        }, 
    'process_|-deliberate-error_|-apache2_|-absent': {
        'comment': 'An exception occurred in this state: Traceback (most recent call last):\n  File "/Users/user/dev/github.com/user/salt/salt/state.py", line 1371, in call\n    **cdata[\'kwargs\'])\n  File "/Users/user/dev/github.com/user/salt/salt/states/process.py", line 41, in absent\n    status = __salt__[\'ps.pkill\'](name, full=True)\n  File "/Users/user/dev/github.com/user/salt/salt/modules/ps.py", line 184, in pkill\n    name_match = pattern in \' \'.join(proc.cmdline) if full \\\n  File "/Users/user/.virtualenvs/salt-dev/lib/python2.7/site-packages/psutil/__init__.py", line 402, in cmdline\n    return self._platform_impl.get_process_cmdline()\n  File "/Users/user/.virtualenvs/salt-dev/lib/python2.7/site-packages/psutil/_psosx.py", line 172, in wrapper\n    raise AccessDenied(self.pid, self._process_name)\nAccessDenied: (pid=1)\n', '__run_num__': 1, 'changes': {}, 'result': False, 'name': 'apache2'
        }
    }

So whats wrong with that? Nothing. Its the function's checking logic that is the broken part, its written assuming that we have a different dictionary structure...

for host in running:
    if not isinstance(running[host], dict):
        return False
    if host.find('_|-') >= 3:
        # This is a single ret, no host associated
        rets = running[host]
    else:
        rets = running[host].values()

    if isinstance(rets, dict) and 'result' in rets:
        if rets['result'] is False:
            return False
        return True

    for ret in rets:
        if not isinstance(ret, dict):
            return False
        if 'result' not in ret:
            return False
        if ret['result'] is False:
            return False
return True

Running contains only the state results for each individual minion, it represents 1 host, so itterating over 'hosts' is wrong. In addition, the if isinstance conditional always exits true since it is always a dict, and since our dictionary is unsorted, it returns True or False based on what the result of the first state in the ditionary is. If your 1st state is true and then you have 100 broken ones, check_state_result(ret) returns True and _set_retcode(ret) passes back without setting a return code assuming all is good. If we skip over this conditional we have more problems. for ret in rets: with our dictionary structure is going to loop over the keys 'comment', 'changes', 'name', 'result', etc, that make up the results from each state call. the first conditional will always fail unless it happens to get 'changes' as the first item out of the dict, the second conditional will always fail unless your result happens to be the string 'result', and the third conditional will never evaluate to true and wont be executed making it pointless. Assuming we've run the gauntlet (which we never do with the current code) we are supposed to return True which is as expected.

After all this, we have one last problem with the current code path, nothing from __context__ is passed back in the job information dictionary which prevents the client/caller from ever knowing what the value produced by _set_retcode is.

Some good news however, I'll be updating the pull request to address all this and use the existing 'intended' results checking pathway.

@thatch45
Copy link
Copy Markdown
Contributor

Thanks for the catch, I think that this iteration was added when we originally did these checks and were looking at the return data on the master not on the minion, so we should be safe in making these changes. Thanks for the detective work here! It is greatly appreciated!

@techdragon
Copy link
Copy Markdown
Contributor Author

At the heart of this problem is a need for a complete set of tests for the logic used here. This pull request can fix one facet of a larger problem. I think we need to better document the larger use case set and then setup a tracking issue so that in addition to my pull request here fixing one aspect, we can ensure that all the other aspects of the return code story are working.

@thatch45
Copy link
Copy Markdown
Contributor

Sounds like a good approach

@thatch45
Copy link
Copy Markdown
Contributor

Let me know when this is ready for review again

@thatch45
Copy link
Copy Markdown
Contributor

I am going to close this out since we have not heard for a while, please resubmit any updates along these lines

@thatch45 thatch45 closed this Feb 26, 2014
@thedrow
Copy link
Copy Markdown
Contributor

thedrow commented Mar 22, 2014

@techdragon Was this PR superseded by #11337?

@techdragon
Copy link
Copy Markdown
Contributor Author

@thedrow yes this was superseded by #11337

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants