Enable highstate failure retcode#10504
Conversation
- new config variable to enable/disable the new functionality - added new function to determine if the highstate return data indicates an error or failure has occurred. - added new logic to the end of the salt cli call that if enabled using the config variable, determines if salt should exit with a non zero exit code and then will exit if any failures or errors are found in the highstate return data.
|
Can the bool value be true as default, as anyway the retcode was inconsistent before ? |
|
Test FAILed. |
|
Im checking before enabling that as default if there are some performances troubles with big highstates, see the gist: https://gist.github.com/kiorky/9052549 Test is running ATM, please ping me if i forget to post the result ^.^ |
|
SAMPLES == An highstate of N minions which 1000 states for each minion [PROCESSING THE 10000 SAMPLES of 1000 states] finished in 4.5949280262s [PROCESSING THE 10000 SAMPLES of 3000 states] finished in 14.2559530735s [PROCESSING THE 30000 SAMPLES of 1500 states] finished in 59.8476610184s |
|
For our deployments, we have ~1000 states per minion, that why i choosed 1000 in the first place, 3000 seem irrealistic. |
|
So for me the overhead of having the highstate results seems acceptable even with large highstates which even here seems a little irrealistic even for big deployments. |
|
This check should already be getting performed on the minion end and setting the retcode field in the return data, so we should not have to do it again. We just need to re-visit retcode-passthrough for the salt command. I think that this is not the best approach here. Is there anything wrong with _set_retcode in salt/modules/state.py? |
|
/cc @thatch45 |
|
@thatch45 I didn't even know that function was there, but after taking some time to investigate it, yes something is wrong with that function. Its broken... to quite a surprising extent. Each minion when requested to process a state, be it After executing the states, each minion while building the return dictionary object
First it does a quick check to see if if isinstance(ret, list):
__context__['retcode'] = 1
returnIf if not salt.utils.check_state_result(ret):
__context__['retcode'] = 2So far so good. The first check works as expected. Our second check however... if not isinstance(running, dict):
return False
if not running:
return FalseAll good so far, just simple sanity checks check. running = {
'file_|-deliberate-pass_|-/Users/guest/deployall.jpg_|-exists': {
'comment': 'Path /Users/guest/deployall.jpg exists',
'__run_num__': 0,
'changes': {},
'name': '/Users/guest/deployall.jpg',
'result': True
},
'test_|-test-passes_|-foobar_|-succeed_without_changes': {
'comment': '',
'__run_num__': 3,
'changes': {},
'name': 'foobar',
'result': True
},
'layman_|-deliberate-fail_|-sunrise_|-present': {
'comment': 'State layman.present found in sls test is unavailable\n',
'__run_num__': 2,
'changes': {},
'result': False,
'name': 'sunrise'
},
'process_|-deliberate-error_|-apache2_|-absent': {
'comment': 'An exception occurred in this state: Traceback (most recent call last):\n File "/Users/user/dev/github.com/user/salt/salt/state.py", line 1371, in call\n **cdata[\'kwargs\'])\n File "/Users/user/dev/github.com/user/salt/salt/states/process.py", line 41, in absent\n status = __salt__[\'ps.pkill\'](name, full=True)\n File "/Users/user/dev/github.com/user/salt/salt/modules/ps.py", line 184, in pkill\n name_match = pattern in \' \'.join(proc.cmdline) if full \\\n File "/Users/user/.virtualenvs/salt-dev/lib/python2.7/site-packages/psutil/__init__.py", line 402, in cmdline\n return self._platform_impl.get_process_cmdline()\n File "/Users/user/.virtualenvs/salt-dev/lib/python2.7/site-packages/psutil/_psosx.py", line 172, in wrapper\n raise AccessDenied(self.pid, self._process_name)\nAccessDenied: (pid=1)\n', '__run_num__': 1, 'changes': {}, 'result': False, 'name': 'apache2'
}
}So whats wrong with that? Nothing. Its the function's checking logic that is the broken part, its written assuming that we have a different dictionary structure... for host in running:
if not isinstance(running[host], dict):
return False
if host.find('_|-') >= 3:
# This is a single ret, no host associated
rets = running[host]
else:
rets = running[host].values()
if isinstance(rets, dict) and 'result' in rets:
if rets['result'] is False:
return False
return True
for ret in rets:
if not isinstance(ret, dict):
return False
if 'result' not in ret:
return False
if ret['result'] is False:
return False
return TrueRunning contains only the state results for each individual minion, it represents 1 host, so itterating over 'hosts' is wrong. In addition, the After all this, we have one last problem with the current code path, nothing from Some good news however, I'll be updating the pull request to address all this and use the existing 'intended' results checking pathway. |
|
Thanks for the catch, I think that this iteration was added when we originally did these checks and were looking at the return data on the master not on the minion, so we should be safe in making these changes. Thanks for the detective work here! It is greatly appreciated! |
|
At the heart of this problem is a need for a complete set of tests for the logic used here. This pull request can fix one facet of a larger problem. I think we need to better document the larger use case set and then setup a tracking issue so that in addition to my pull request here fixing one aspect, we can ensure that all the other aspects of the return code story are working. |
|
Sounds like a good approach |
|
Let me know when this is ready for review again |
|
I am going to close this out since we have not heard for a while, please resubmit any updates along these lines |
|
@techdragon Was this PR superseded by #11337? |
an error or failure has occurred.
the config variable, determines if salt should exit with a non zero
exit code and then will exit if any failures or errors are found in
the highstate return data.
Should fix issue #7013