Enable highstate failure retcode by techdragon · Pull Request #10504 · saltstack/salt

techdragon · 2014-02-17T13:28:20Z

new config variable to enable/disable the new functionality
added new function to determine if the highstate return data indicates
an error or failure has occurred.
added new logic to the end of the salt cli call that if enabled using
the config variable, determines if salt should exit with a non zero
exit code and then will exit if any failures or errors are found in
the highstate return data.

Should fix issue #7013

- new config variable to enable/disable the new functionality - added new function to determine if the highstate return data indicates an error or failure has occurred. - added new logic to the end of the salt cli call that if enabled using the config variable, determines if salt should exit with a non zero exit code and then will exit if any failures or errors are found in the highstate return data.

…-failure-retcode

kiorky · 2014-02-17T14:06:25Z

Can the bool value be true as default, as anyway the retcode was inconsistent before ?

kiorky · 2014-02-17T14:06:58Z

Really love to see that pretty soon reviewed and merged !
/cc @regilero @Gagaro

ghost · 2014-02-17T14:12:54Z

Test FAILed.
Refer to this link for build results: http://jenkins.saltstack.com/job/salt-pr-build/1540/

kiorky · 2014-02-17T17:33:00Z

Im checking before enabling that as default if there are some performances troubles with big highstates, see the gist: https://gist.github.com/kiorky/9052549

Test is running ATM, please ping me if i forget to post the result ^.^

kiorky · 2014-02-17T18:51:02Z

SAMPLES == An highstate of N minions which 1000 states for each minion

[PROCESSING THE 10000 SAMPLES of 1000 states] finished in 4.5949280262s
[PROCESSING THE 1000 SAMPLES of 1000 states] finished in 0.462728977203s
[PROCESSING THE 100 SAMPLES of 1000 states] finished in 0.0464670658112s
[PROCESSING THE 10 SAMPLES of 1000 states] finished in 0.00506496429443s
[PROCESSING THE 1 SAMPLES of 1000 states] finished in 0.000871896743774s

[PROCESSING THE 10000 SAMPLES of 3000 states] finished in 14.2559530735s
[PROCESSING THE 1000 SAMPLES of 3000 states] finished in 1.43430519104s
[PROCESSING THE 100 SAMPLES of 3000 states] finished in 0.144124984741s
[PROCESSING THE 10 SAMPLES of 3000 states] finished in 0.015741109848s
[PROCESSING THE 1 SAMPLES of 3000 states] finished in 0.00277495384216s

[PROCESSING THE 30000 SAMPLES of 1500 states] finished in 59.8476610184s
[PROCESSING THE 3000 SAMPLES of 1500 states] finished in 2.21659708023s
[PROCESSING THE 300 SAMPLES of 1500 states] finished in 0.223330974579s
[PROCESSING THE 30 SAMPLES of 1500 states] finished in 0.0301289558411s
[PROCESSING THE 3 SAMPLES of 1500 states] finished in 0.00558114051819s

kiorky · 2014-02-17T18:56:09Z

For our deployments, we have ~1000 states per minion, that why i choosed 1000 in the first place, 3000 seem irrealistic.

kiorky · 2014-02-17T19:15:28Z

So for me the overhead of having the highstate results seems acceptable even with large highstates which even here seems a little irrealistic even for big deployments.
/cc @techdragon

thatch45 · 2014-02-17T22:01:59Z

This check should already be getting performed on the minion end and setting the retcode field in the return data, so we should not have to do it again. We just need to re-visit retcode-passthrough for the salt command. I think that this is not the best approach here. Is there anything wrong with _set_retcode in salt/modules/state.py?

kiorky · 2014-02-17T22:35:44Z

/cc @thatch45

cat /srv/salt/toto.sls;salt '*' state.sls toto;echo "retcode: $?"
/bin/false:
  cmd.run: []
devhost2:
----------
          ID: /bin/false
    Function: cmd.run
      Result: False
     Comment: Command "/bin/false" run
     Changes:
              ----------
              pid:
                  5351
              retcode:
                  1
              stderr:

              stdout:


Summary
------------
Succeeded: 0
Failed:    1
------------
Total:     1
retcode: 0

techdragon · 2014-02-18T09:33:59Z

@thatch45 I didn't even know that function was there, but after taking some time to investigate it, yes something is wrong with that function. Its broken... to quite a surprising extent.

Each minion when requested to process a state, be it state.highstate, state.sls , state.single, state.top, state.template_str, state.template, or state.high has a fairly predictable code path with regards to the return code.

After executing the states, each minion while building the return dictionary object ret, will call _set_retcode(ret).

_set_retcode(ret) then does one of two things.

First it does a quick check to see if ret is a list, and if it is a list, it sets the return code key in the __context__ dict to 1 and then returns.

if isinstance(ret, list):
    __context__['retcode'] = 1
    return

If ret is not a list, it will then be passed to another function.

if not salt.utils.check_state_result(ret):
    __context__['retcode'] = 2

So far so good. The first check works as expected. Our second check however...

if not isinstance(running, dict):
    return False
if not running:
    return False

All good so far, just simple sanity checks check.
Now for the problem. The minion has passed a dict that looks like this to check_state_result(ret).

running = {
    'file_|-deliberate-pass_|-/Users/guest/deployall.jpg_|-exists': {
        'comment': 'Path /Users/guest/deployall.jpg exists', 
        '__run_num__': 0, 
        'changes': {}, 
        'name': '/Users/guest/deployall.jpg', 
        'result': True
        }, 
    'test_|-test-passes_|-foobar_|-succeed_without_changes': {
        'comment': '', 
        '__run_num__': 3, 
        'changes': {}, 
        'name': 'foobar', 
        'result': True
        }, 
    'layman_|-deliberate-fail_|-sunrise_|-present': {
        'comment': 'State layman.present found in sls test is unavailable\n', 
        '__run_num__': 2, 
        'changes': {}, 
        'result': False, 
        'name': 'sunrise'
        }, 
    'process_|-deliberate-error_|-apache2_|-absent': {
        'comment': 'An exception occurred in this state: Traceback (most recent call last):\n  File "/Users/user/dev/github.com/user/salt/salt/state.py", line 1371, in call\n    **cdata[\'kwargs\'])\n  File "/Users/user/dev/github.com/user/salt/salt/states/process.py", line 41, in absent\n    status = __salt__[\'ps.pkill\'](name, full=True)\n  File "/Users/user/dev/github.com/user/salt/salt/modules/ps.py", line 184, in pkill\n    name_match = pattern in \' \'.join(proc.cmdline) if full \\\n  File "/Users/user/.virtualenvs/salt-dev/lib/python2.7/site-packages/psutil/__init__.py", line 402, in cmdline\n    return self._platform_impl.get_process_cmdline()\n  File "/Users/user/.virtualenvs/salt-dev/lib/python2.7/site-packages/psutil/_psosx.py", line 172, in wrapper\n    raise AccessDenied(self.pid, self._process_name)\nAccessDenied: (pid=1)\n', '__run_num__': 1, 'changes': {}, 'result': False, 'name': 'apache2'
        }
    }

So whats wrong with that? Nothing. Its the function's checking logic that is the broken part, its written assuming that we have a different dictionary structure...

for host in running:
    if not isinstance(running[host], dict):
        return False
    if host.find('_|-') >= 3:
        # This is a single ret, no host associated
        rets = running[host]
    else:
        rets = running[host].values()

    if isinstance(rets, dict) and 'result' in rets:
        if rets['result'] is False:
            return False
        return True

    for ret in rets:
        if not isinstance(ret, dict):
            return False
        if 'result' not in ret:
            return False
        if ret['result'] is False:
            return False
return True

Running contains only the state results for each individual minion, it represents 1 host, so itterating over 'hosts' is wrong. In addition, the if isinstance conditional always exits true since it is always a dict, and since our dictionary is unsorted, it returns True or False based on what the result of the first state in the ditionary is. If your 1st state is true and then you have 100 broken ones, check_state_result(ret) returns True and _set_retcode(ret) passes back without setting a return code assuming all is good. If we skip over this conditional we have more problems. for ret in rets: with our dictionary structure is going to loop over the keys 'comment', 'changes', 'name', 'result', etc, that make up the results from each state call. the first conditional will always fail unless it happens to get 'changes' as the first item out of the dict, the second conditional will always fail unless your result happens to be the string 'result', and the third conditional will never evaluate to true and wont be executed making it pointless. Assuming we've run the gauntlet (which we never do with the current code) we are supposed to return True which is as expected.

After all this, we have one last problem with the current code path, nothing from __context__ is passed back in the job information dictionary which prevents the client/caller from ever knowing what the value produced by _set_retcode is.

Some good news however, I'll be updating the pull request to address all this and use the existing 'intended' results checking pathway.

thatch45 · 2014-02-18T17:27:53Z

Thanks for the catch, I think that this iteration was added when we originally did these checks and were looking at the return data on the master not on the minion, so we should be safe in making these changes. Thanks for the detective work here! It is greatly appreciated!

techdragon · 2014-02-19T03:46:30Z

At the heart of this problem is a need for a complete set of tests for the logic used here. This pull request can fix one facet of a larger problem. I think we need to better document the larger use case set and then setup a tracking issue so that in addition to my pull request here fixing one aspect, we can ensure that all the other aspects of the return code story are working.

thatch45 · 2014-02-19T21:42:18Z

Sounds like a good approach

thatch45 · 2014-02-19T21:42:34Z

Let me know when this is ready for review again

thatch45 · 2014-02-26T16:47:24Z

I am going to close this out since we have not heard for a while, please resubmit any updates along these lines

thedrow · 2014-03-22T08:01:43Z

@techdragon Was this PR superseded by #11337?

techdragon · 2014-03-23T04:41:24Z

@thedrow yes this was superseded by #11337

techdragon added 2 commits February 17, 2014 21:10

Merge remote-tracking branch 'upstream/develop' into enable-highstate…

38d7e0a

…-failure-retcode

kiorky mentioned this pull request Feb 17, 2014

salt-call state.highstate should optionally use exit codes to report failures #4176

Closed

thatch45 closed this Feb 26, 2014

Conversation

techdragon commented Feb 17, 2014

Uh oh!

kiorky commented Feb 17, 2014

Uh oh!

kiorky commented Feb 17, 2014

Uh oh!

ghost commented Feb 17, 2014

Uh oh!

kiorky commented Feb 17, 2014

Uh oh!

kiorky commented Feb 17, 2014

Uh oh!

kiorky commented Feb 17, 2014

Uh oh!

kiorky commented Feb 17, 2014

Uh oh!

thatch45 commented Feb 17, 2014

Uh oh!

kiorky commented Feb 17, 2014

Uh oh!

techdragon commented Feb 18, 2014

Uh oh!

thatch45 commented Feb 18, 2014

Uh oh!

techdragon commented Feb 19, 2014

Uh oh!

thatch45 commented Feb 19, 2014

Uh oh!

thatch45 commented Feb 19, 2014

Uh oh!

thatch45 commented Feb 26, 2014

Uh oh!

thedrow commented Mar 22, 2014

Uh oh!

techdragon commented Mar 23, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants