The log streaming system suffers a 10 second delay after a skipped
command task because:
* when processing the results for the next task, we ensure that the
log streams for all of the previous tasks have stopped
* when stopping a log stream, we wait for the remote command process
to finish before we close the stream
* if a log file has not appeared yet, we can't determine if the
stream has finished
* so we wait 10 seconds for the log to appear before we proceed and
terminate the stream regardless.
The underlying issue is that the code that terminates the stream does
not know whether to expect a log file (the normal case) or not (the
case for a skipped task).
To correct that, we make a new Streamer class which bundles all of
the information about a particular log streamer and which is
individually addressable. This way we can stop an individual
streamer immediately if it's corresponding task is skipped, or stop
all of the streamers with a 10 second grace period in the normal
case.
Since the behavioral difference is a 10 second delay (but otherwise
no change in job output), we can't test the behavioral outcome of
this change, but we can exercise the code by ensuring that there are
skipped command tasks in the remote stream tests.
Change-Id: Id6ee13e5a82aa8fa3a2a0dd293cf99ed5b84347a
Combining stdout/stderr in the result can lead to problems when e.g.
the stdout of a task is used as an input for another task.
This is also different from the normal Ansible behavior and can be
surprising and hard to debug for users.
The new behavior is configurable and off by default to retain backward
compatibility.
Change-Id: Icaced970650913f9632a8db75a5970a38d3a6bc4
Co-Authored-By: James E. Blair <jim@acmegating.com>
I noticed in some of our testing a construct like
debug:
msg: '{{ ansible_version }}'
was actually erroring out; you'll see in the console output if you're
looking
Ansible output: b'TASK [Print ansible version msg={{ ansible_version }}] *************************'
Ansible output: b'[WARNING]: Failure using method (v2_runner_on_ok) in callback plugin'
Ansible output: b'(<ansible.plugins.callback.zuul_stream.CallbackModule object at'
Ansible output: b"0x7f502760b490>): 'dict' object has no attribute 'startswith'"
and the job-output.txt will be empty for this task (this is detected
by by I9f569a411729f8a067de17d99ef6b9d74fc21543).
This is because the msg value here comes in as a dict, and in several
places we assume it is a string. This changes places we inspect the
msg variable to use the standard Ansible way to make a text string
(to_text function) and ensures in the logging function it converts the
input to a string.
We test for this with updated tasks in the remote_zuul_stream tests.
It is slightly refactored to do partial matches so we can use the
version strings, which is where we saw the issue.
Change-Id: I6e6ed8dba2ba1fc74e7fc8361e8439ea6139279e
If the command didn't exist:
- Popen would throw an exception
- 't' would not be updated (t is None)
- return code would not be written to the console
- zuul_stream would wait unecessary for 10 seconds
As rc is defined in normal case or in both exceptions, it can be written
in each case to the console.
Change-Id: I77a4e1bdc6cd163143eacda06555b62c9195ee38
If a role is applied to a host more than once (via either play
roles or include_roles, but not via an include_role loop), it will
have the same task UUID from ansible which means Zuul's command
plugin will write the streaming output to the same filename, and
the log streaming will request the same file. That means the file
might look this after the second invocation:
2022-05-19 17:06:23.673625 | one
2022-05-19 17:06:23.673781 | [Zuul] Task exit code: 0
2022-05-19 17:06:29.226463 | two
2022-05-19 17:06:29.226605 | [Zuul] Task exit code: 0
But since we stop reading the log after "Task exit code", the user
would see "one" twice, and never see "two".
Here are some potential fixes for this that don't work:
* Accessing the task vars from zuul_stream to store any additional
information: the callback plugins are not given the task vars.
* Setting the log id on the task args in zuul_stream instead of
command: the same Task object is used for each host and therefore
the command module might see the task object after it has been
further modified (in other words, nothing host-specific can be
set on the task object).
* Setting an even more unique uuid than Task._uuid on the Task
object in zuul_stream and using that in the command module instead
of Task._uuid: in some rare cases, the actual task Python object
may be different between the callback and command plugin, yet still
have the same _uuid; therefore the new attribute would be missing.
Instead, a global variable is used in order to transfer data between
zuul_stream and command. This variable holds a counter for each
task+host combination. Most of the time it will be 1, but if we run
the same task on the same host again, it will increment. Since Ansible
will not run more than one task on a host simultaneously, so there is
no race between the counter being incremented in zuul_stream and used
in command.
Because Ansible is re-invoked for each playbook, the memory usage is
not a concern.
There may be a fork between zuul_stream and command, but that's fine
as long as we treat it as read-only in the command plugin. It will
have the data for our current task+host from the most recent zuul_stream
callback invocation.
This change also includes a somewhat unrelated change to the test
infrastructure. Because we were not setting the log stream port on
the executor in tests, we were actually relying on the "real" OpenDev
Zuul starting zuul_console on the test nodes rather than the
zuul_console we set up for each specific Ansible version from the tests.
This corrects that and uses the correct zuul_console port, so that if we
make any changes to zuul_console in the future, we will test the
changed version, not the one from the Zuul which actually runs the
tox-remote job.
Change-Id: Ia656db5f3dade52c8dbd0505b24049fe0fff67a5
This adds support for Ansible 5. As mentioned in the reno, only
the major version is specified; that corresponds to major.minor in
Ansible core, so is approximately equivalent to our current regime.
The command module is updated to be based on the current code in
ansible core 2.12.4 (corresponding to community 5.6.0). The previous
version is un-symlinked and copied to the 2.8 and 2.8 directories
for easy deletion as they age out.
The new command module has corrected a code path we used to test
that the zuul_stream module handles python exceptions in modules,
so instead we now take advantage of the ability to load
playbook-adjacent modules to add a test fixture module that always
raises an exception. The zuul stream functional test validation is
adjusted to match the new values.
Similarly, in test_command in the remote tests, we relied on that
behavior, but there is already a test for module exceptions in
test_module_exception, so that check is simply removed.
Among our Ansible version tests, we occasionally had tests which
exercised 2.8 but not 2.9 because it is the default and is otherwise
tested. This change adds explicit tests for 2.9 even if they are
redundant in order to make future Ansible version updates easier and
more mechanical (we don't need to remember to add 2.9 later when
we change the default).
This is our first version of Ansible where the value of
job.ansible-version could be interpreted as an integer, so the
configloader is updated to handle that possibility transparently,
as it already does for floating point values.
Change-Id: I694b979077d7944b4b365dbd8c72aba3f9807329
Zuul was designed to block local code execution in untrusted
environments to not only rely on bwrap to contain a job. This got
broken since the creation of a command plugin that injects the
zuul_job_id which is required for log streaming. However this plugin
doesn't do a check if the task is a localhost task. Further it is
required in trusted and untrusted environments due to log
streaming. Thus we need to fork this plugin and restrict the variant
that is used in untrusted environments.
We do this by moving actiongeneral/command.py back to action/*. We
further introduce a new catecory actiontrusted which gets the
unrestricted version of this plugin.
Change-Id: If81cc46bcae466f4c071badf09a8a88469ae6779
Story: 2007935
Task: 40391
This change ensures the zuul-tox-remote job is testing the unique zuul_console
service instead of the one setup by the job.
Change-Id: If2e2324235a6608985168d606d95adeade36afe0
Currently the default ansible version is selected by the version of
zuul itself. However we want to make this configurable per deployment
(zuul.conf), tenant and job.
Change-Id: Iccbb124ac7f7a8260c730fbc109ccfc1dec09f8b
In the zuul_stream callback we check for 'MODULE FAILURE' and then log
stdout, exception or stderr whichever we find first. However in some
cases stdout exists but is empty. In this case we don't log the error
reason and the user needs to look into the json log which contains the
correct information. This can be easily fixed by checking if stdout is
empty.
Change-Id: I8019d40e2e99310eaeab03f652368dac66e1ee17
Currently we do live streaming of command and shell tasks in
loops. However this is severely broken as we only get the output of
the first task that ran. This is hard to solve as a per task streaming
in loops is not possible because of missing start events. Further
streaming several tasks at once is difficult too because we don't know
how many tasks we need to stream upfront.
So the intermediate solution until the whole log streaming is reworked
is to just not live stream tasks in loops.
While we're at it also fix a broken log statement and remove duplicate
exit code prints of action tasks.
Change-Id: Ic1358d5b9f939549ffdd5d770fe4eafc047be6f1
This is an attempt to make the existing streaming log test more robust
based on observed behavior in the wild. Basically we need to make sure
that if nothing is logged for longer than our streaming connection
timeout that we still get all the data.
Change-Id: I0a60c92d8a75dac195f6b85de379b612a2148e37
We have observed that if a shell task runs on multiple hosts after
an include_role, the task will report a missing zuul_log_id on the
second host. This adds a test for that.
Change-Id: I59787285674947c24b8919f1c4d5ced9439da9f8
Story: 2002528
Task: 22067
A further edge case which could potentially break in the future is
using blocks with rescue and always tasks. Add them to the test to
make sure they won't break.
Change-Id: I2a45ba303825dff8ed9f64a68a6b71c0e5afa08c
When using the free play strategy the command and shell tasks fail
logging due to an undefined variable. Fix that and add tests for it.
Change-Id: Ib869b12c54201b59f3de852b502ff62183db7266
With the update to Ansible 2.5 command and shell tasks used in
handlers are broken and fail with [1]. The reason for this is that the
callback v2_playbook_on_task_start is not called anymore for
handlers. Instead the callback v2_playbook_on_handler_task_start is
called for them. This leads to a missing zuul_log_id in the handler
task and trying to log to /tmp/console-None.log. In case this file was
already created by a handler using sudo it may not be accessible which
leads to the exception.
This can be fixed by also defining v2_playbook_on_handler_task_start
in zuul_stream.
Also add a validation of zuul_log_id in the command module. This
should make it easier to spot such errors next time.
[1] Trace
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/tmp/ansible_X_j_M1/ansible_module_command.py", line 185, in follow
with Console(log_uuid) as console:
File "/tmp/ansible_X_j_M1/ansible_module_command.py", line 162, in __enter__
self.logfile = open(self.logfile_name, 'ab', buffering=0)
IOError: [Errno 13] Permission denied: '/tmp/console-None.log'
Change-Id: Ib9dd7fe09e4e7734f7a9ada876e6ce450ebc5038
Story: 2002528
Task: 22067
We suspect there are more fixes required, and the additional check
these changes added may cause more harm in the interim.
This reverts commit 3512a5608c.
This reverts commit d94a0d6f06.
Change-Id: I6e874cd68a1cf6f974e36ab1870573bddf4e647b
When using the free play strategy the command and shell tasks fail
logging due to an undefined variable. Fix that and add tests for it.
Change-Id: I830d64d6089f869dc5a0e3d4fe346e6480884793
With the update to Ansible 2.5 command and shell tasks used in
handlers are broken and fail with [1]. The reason for this is that the
callback v2_playbook_on_task_start is not called anymore for
handlers. Instead the callback v2_playbook_on_handler_task_start is
called for them. This leads to a missing zuul_log_id in the handler
task and trying to log to /tmp/console-None.log. In case this file was
already created by a handler using sudo it may not be accessible which
leads to the exception.
This can be fixed by also defining v2_playbook_on_handler_task_start
in zuul_stream.
Also add a validation of zuul_log_id in the command module. This
should make it easier to spot such errors next time.
[1] Trace
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/tmp/ansible_X_j_M1/ansible_module_command.py", line 185, in follow
with Console(log_uuid) as console:
File "/tmp/ansible_X_j_M1/ansible_module_command.py", line 162, in __enter__
self.logfile = open(self.logfile_name, 'ab', buffering=0)
IOError: [Errno 13] Permission denied: '/tmp/console-None.log'
Change-Id: I1978aa8faa488fec87406e1481d455d49731f867
Story: 2002528
Task: 22067
This tests zuul-stream using the 'remote' tests since many aspects
of zuul-stream require a remote node to be exercised.
Change-Id: I08edee6f9388faf13615c6dc310df7a94d0419b2