Commit Graph

23 Commits

Author SHA1 Message Date
James E. Blair b4e49fd8a1 Fix 10 second delay after skipped command task
The log streaming system suffers a 10 second delay after a skipped
command task because:

* when processing the results for the next task, we ensure that the
  log streams for all of the previous tasks have stopped
* when stopping a log stream, we wait for the remote command process
  to finish before we close the stream
* if a log file has not appeared yet, we can't determine if the
  stream has finished
* so we wait 10 seconds for the log to appear before we proceed and
  terminate the stream regardless.

The underlying issue is that the code that terminates the stream does
not know whether to expect a log file (the normal case) or not (the
case for a skipped task).

To correct that, we make a new Streamer class which bundles all of
the information about a particular log streamer and which is
individually addressable.  This way we can stop an individual
streamer immediately if it's corresponding task is skipped, or stop
all of the streamers with a 10 second grace period in the normal
case.

Since the behavioral difference is a 10 second delay (but otherwise
no change in job output), we can't test the behavioral outcome of
this change, but we can exercise the code by ensuring that there are
skipped command tasks in the remote stream tests.

Change-Id: Id6ee13e5a82aa8fa3a2a0dd293cf99ed5b84347a
2024-02-02 09:43:37 -08:00
Simon Westphahl 3b011296e6 Keep task stdout/stderr separate in result object
Combining stdout/stderr in the result can lead to problems when e.g.
the stdout of a task is used as an input for another task.

This is also different from the normal Ansible behavior and can be
surprising and hard to debug for users.

The new behavior is configurable and off by default to retain backward
compatibility.

Change-Id: Icaced970650913f9632a8db75a5970a38d3a6bc4
Co-Authored-By: James E. Blair <jim@acmegating.com>
2023-08-17 16:22:41 -07:00
Ian Wienand 0e9f474241 zuul_stream: handle non-string msg value
I noticed in some of our testing a construct like

 debug:
   msg: '{{ ansible_version }}'

was actually erroring out; you'll see in the console output if you're
looking

 Ansible output: b'TASK [Print ansible version msg={{ ansible_version }}] *************************'
 Ansible output: b'[WARNING]: Failure using method (v2_runner_on_ok) in callback plugin'
 Ansible output: b'(<ansible.plugins.callback.zuul_stream.CallbackModule object at'
 Ansible output: b"0x7f502760b490>): 'dict' object has no attribute 'startswith'"

and the job-output.txt will be empty for this task (this is detected
by by I9f569a411729f8a067de17d99ef6b9d74fc21543).

This is because the msg value here comes in as a dict, and in several
places we assume it is a string.  This changes places we inspect the
msg variable to use the standard Ansible way to make a text string
(to_text function) and ensures in the logging function it converts the
input to a string.

We test for this with updated tasks in the remote_zuul_stream tests.
It is slightly refactored to do partial matches so we can use the
version strings, which is where we saw the issue.

Change-Id: I6e6ed8dba2ba1fc74e7fc8361e8439ea6139279e
2022-08-24 13:32:37 +10:00
Guillaume Chauvel 6a53da691f
Fix zuul_run_command ret when command is not found
If the command didn't exist:
    - Popen would throw an exception
    - 't' would not be updated (t is None)
    - return code would not be written to the console
    - zuul_stream would wait unecessary for 10 seconds

As rc is defined in normal case or in both exceptions, it can be written
in each case to the console.

Change-Id: I77a4e1bdc6cd163143eacda06555b62c9195ee38
2022-06-17 21:26:08 +02:00
James E. Blair 096e47dc43 Fix console log streaming with duplicated roles
If a role is applied to a host more than once (via either play
roles or include_roles, but not via an include_role loop), it will
have the same task UUID from ansible which means Zuul's command
plugin will write the streaming output to the same filename, and
the log streaming will request the same file.  That means the file
might look this after the second invocation:

2022-05-19 17:06:23.673625 | one
2022-05-19 17:06:23.673781 | [Zuul] Task exit code: 0
2022-05-19 17:06:29.226463 | two
2022-05-19 17:06:29.226605 | [Zuul] Task exit code: 0

But since we stop reading the log after "Task exit code", the user
would see "one" twice, and never see "two".

Here are some potential fixes for this that don't work:

* Accessing the task vars from zuul_stream to store any additional
  information: the callback plugins are not given the task vars.
* Setting the log id on the task args in zuul_stream instead of
  command: the same Task object is used for each host and therefore
  the command module might see the task object after it has been
  further modified (in other words, nothing host-specific can be
  set on the task object).
* Setting an even more unique uuid than Task._uuid on the Task
  object in zuul_stream and using that in the command module instead
  of Task._uuid: in some rare cases, the actual task Python object
  may be different between the callback and command plugin, yet still
  have the same _uuid; therefore the new attribute would be missing.

Instead, a global variable is used in order to transfer data between
zuul_stream and command.  This variable holds a counter for each
task+host combination.  Most of the time it will be 1, but if we run
the same task on the same host again, it will increment.  Since Ansible
will not run more than one task on a host simultaneously, so there is
no race between the counter being incremented in zuul_stream and used
in command.

Because Ansible is re-invoked for each playbook, the memory usage is
not a concern.

There may be a fork between zuul_stream and command, but that's fine
as long as we treat it as read-only in the command plugin.  It will
have the data for our current task+host from the most recent zuul_stream
callback invocation.

This change also includes a somewhat unrelated change to the test
infrastructure.  Because we were not setting the log stream port on
the executor in tests, we were actually relying on the "real" OpenDev
Zuul starting zuul_console on the test nodes rather than the
zuul_console we set up for each specific Ansible version from the tests.
This corrects that and uses the correct zuul_console port, so that if we
make any changes to zuul_console in the future, we will test the
changed version, not the one from the Zuul which actually runs the
tox-remote job.

Change-Id: Ia656db5f3dade52c8dbd0505b24049fe0fff67a5
2022-05-21 08:42:03 -07:00
James E. Blair ebf5c96d57 Add support for Ansible 5
This adds support for Ansible 5.  As mentioned in the reno, only
the major version is specified; that corresponds to major.minor in
Ansible core, so is approximately equivalent to our current regime.

The command module is updated to be based on the current code in
ansible core 2.12.4 (corresponding to community 5.6.0).  The previous
version is un-symlinked and copied to the 2.8 and 2.8 directories
for easy deletion as they age out.

The new command module has corrected a code path we used to test
that the zuul_stream module handles python exceptions in modules,
so instead we now take advantage of the ability to load
playbook-adjacent modules to add a test fixture module that always
raises an exception.  The zuul stream functional test validation is
adjusted to match the new values.

Similarly, in test_command in the remote tests, we relied on that
behavior, but there is already a test for module exceptions in
test_module_exception, so that check is simply removed.

Among our Ansible version tests, we occasionally had tests which
exercised 2.8 but not 2.9 because it is the default and is otherwise
tested.  This change adds explicit tests for 2.9 even if they are
redundant in order to make future Ansible version updates easier and
more mechanical (we don't need to remember to add 2.9 later when
we change the default).

This is our first version of Ansible where the value of
job.ansible-version could be interpreted as an integer, so the
configloader is updated to handle that possibility transparently,
as it already does for floating point values.

Change-Id: I694b979077d7944b4b365dbd8c72aba3f9807329
2022-04-14 13:33:53 -07:00
Tobias Henkel bf4e9893d0
Block localhost shell tasks in untrusted playbooks
Zuul was designed to block local code execution in untrusted
environments to not only rely on bwrap to contain a job. This got
broken since the creation of a command plugin that injects the
zuul_job_id which is required for log streaming. However this plugin
doesn't do a check if the task is a localhost task. Further it is
required in trusted and untrusted environments due to log
streaming. Thus we need to fork this plugin and restrict the variant
that is used in untrusted environments.

We do this by moving actiongeneral/command.py back to action/*. We
further introduce a new catecory actiontrusted which gets the
unrestricted version of this plugin.

Change-Id: If81cc46bcae466f4c071badf09a8a88469ae6779
Story: 2007935
Task: 40391
2020-07-21 19:18:10 +02:00
Tristan Cacqueray 053e6388cd zuul-tox-remote: use unique zuul_console service
This change ensures the zuul-tox-remote job is testing the unique zuul_console
service instead of the one setup by the job.

Change-Id: If2e2324235a6608985168d606d95adeade36afe0
2019-05-17 07:56:58 +00:00
Tobias Henkel 5c2b61e638
Make ansible version configurable
Currently the default ansible version is selected by the version of
zuul itself. However we want to make this configurable per deployment
(zuul.conf), tenant and job.

Change-Id: Iccbb124ac7f7a8260c730fbc109ccfc1dec09f8b
2019-03-15 09:09:16 +01:00
Tobias Henkel 6f754dd878
Log exception on module failure with empty stdout
In the zuul_stream callback we check for 'MODULE FAILURE' and then log
stdout, exception or stderr whichever we find first. However in some
cases stdout exists but is empty. In this case we don't log the error
reason and the user needs to look into the json log which contains the
correct information. This can be easily fixed by checking if stdout is
empty.

Change-Id: I8019d40e2e99310eaeab03f652368dac66e1ee17
2019-03-06 21:51:33 +01:00
Tobias Henkel 0ae7a157e2
Don't do live streaming in loops
Currently we do live streaming of command and shell tasks in
loops. However this is severely broken as we only get the output of
the first task that ran. This is hard to solve as a per task streaming
in loops is not possible because of missing start events. Further
streaming several tasks at once is difficult too because we don't know
how many tasks we need to stream upfront.

So the intermediate solution until the whole log streaming is reworked
is to just not live stream tasks in loops.

While we're at it also fix a broken log statement and remove duplicate
exit code prints of action tasks.

Change-Id: Ic1358d5b9f939549ffdd5d770fe4eafc047be6f1
2018-11-05 22:21:38 +01:00
Zuul a5c6c31043 Merge "Fix race in tox-remote test_command" 2018-06-21 20:20:18 +00:00
Tobias Henkel d2f67f497d Fix race in tox-remote test_command
In test_command we create the file /tmp/console-None.log. We now have
two hosts in that test that are mapped to the same host. This could
lead to failures [1] during creation of that file. Fix that by adding
run_once.

[1] Test failure
2018-06-20 14:12:00.349980 | TASK [Create unwritable /tmp/console-None.log]
2018-06-20 14:12:00.727614 | controller | changed
2018-06-20 14:12:00.734316 | compute1 | ERROR
2018-06-20 14:12:00.734612 | compute1 | {
2018-06-20 14:12:00.734734 | compute1 |   "gid": 1000,
2018-06-20 14:12:00.734828 | compute1 |   "group": "zuul",
2018-06-20 14:12:00.734910 | compute1 |   "mode": "0444",
2018-06-20 14:12:00.734987 | compute1 |   "msg": "Error, could not touch target: [Errno 13] Permission denied: \'/tmp/console-None.log\'",
2018-06-20 14:12:00.735063 | compute1 |   "owner": "zuul",
2018-06-20 14:12:00.735137 | compute1 |   "path": "/tmp/console-None.log",
2018-06-20 14:12:00.735211 | compute1 |   "size": 0,
2018-06-20 14:12:00.735284 | compute1 |   "state": "file",
2018-06-20 14:12:00.735357 | compute1 |   "uid": 1000
2018-06-20 14:12:00.735430 | compute1 | }

Change-Id: Ib1d48fea57447131d7018e6a2772f289a668299a
2018-06-21 19:31:57 +00:00
Clark Boylan 7451fdd1e8
Make streaming test run longer than connection timeout
This is an attempt to make the existing streaming log test more robust
based on observed behavior in the wild. Basically we need to make sure
that if nothing is logged for longer than our streaming connection
timeout that we still get all the data.

Change-Id: I0a60c92d8a75dac195f6b85de379b612a2148e37
2018-06-20 15:19:31 +02:00
Tobias Henkel 1ef438c450 Remove failed_when when creating /tmp/console-None.log
This should make the streaming test more robust once zuul is updated
with the fix for this.

Change-Id: I0568a7abc59d3823e7e87678550e6a61575aad4b
2018-06-15 04:53:46 +00:00
James E. Blair 6526ce8282 Add test for shell after include_role
We have observed that if a shell task runs on multiple hosts after
an include_role, the task will report a missing zuul_log_id on the
second host.  This adds a test for that.

Change-Id: I59787285674947c24b8919f1c4d5ced9439da9f8
Story: 2002528
Task: 22067
2018-06-13 15:38:36 -07:00
Tobias Henkel 28dc87d4ee Add blocks to the zuul stream test
A further edge case which could potentially break in the future is
using blocks with rescue and always tasks. Add them to the test to
make sure they won't break.

Change-Id: I2a45ba303825dff8ed9f64a68a6b71c0e5afa08c
2018-06-12 08:58:30 -07:00
Tobias Henkel 1310aa592b Fix command tasks with free play strategy
When using the free play strategy the command and shell tasks fail
logging due to an undefined variable. Fix that and add tests for it.

Change-Id: Ib869b12c54201b59f3de852b502ff62183db7266
2018-06-12 08:54:08 -07:00
Tobias Henkel bfcdff9118 Fix broken command tasks in handlers
With the update to Ansible 2.5 command and shell tasks used in
handlers are broken and fail with [1]. The reason for this is that the
callback v2_playbook_on_task_start is not called anymore for
handlers. Instead the callback v2_playbook_on_handler_task_start is
called for them. This leads to a missing zuul_log_id in the handler
task and trying to log to /tmp/console-None.log. In case this file was
already created by a handler using sudo it may not be accessible which
leads to the exception.

This can be fixed by also defining v2_playbook_on_handler_task_start
in zuul_stream.

Also add a validation of zuul_log_id in the command module. This
should make it easier to spot such errors next time.

[1] Trace
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/tmp/ansible_X_j_M1/ansible_module_command.py", line 185, in follow
    with Console(log_uuid) as console:
  File "/tmp/ansible_X_j_M1/ansible_module_command.py", line 162, in __enter__
    self.logfile = open(self.logfile_name, 'ab', buffering=0)
IOError: [Errno 13] Permission denied: '/tmp/console-None.log'

Change-Id: Ib9dd7fe09e4e7734f7a9ada876e6ce450ebc5038
Story: 2002528
Task: 22067
2018-06-12 08:53:59 -07:00
James E. Blair 896df11638 Revert callback fixes
We suspect there are more fixes required, and the additional check
these changes added may cause more harm in the interim.

This reverts commit 3512a5608c.
This reverts commit d94a0d6f06.

Change-Id: I6e874cd68a1cf6f974e36ab1870573bddf4e647b
2018-06-12 08:53:19 -07:00
Tobias Henkel 3512a5608c
Fix command tasks with free play strategy
When using the free play strategy the command and shell tasks fail
logging due to an undefined variable. Fix that and add tests for it.

Change-Id: I830d64d6089f869dc5a0e3d4fe346e6480884793
2018-06-12 12:30:52 +02:00
Tobias Henkel d94a0d6f06
Fix broken command tasks in handlers
With the update to Ansible 2.5 command and shell tasks used in
handlers are broken and fail with [1]. The reason for this is that the
callback v2_playbook_on_task_start is not called anymore for
handlers. Instead the callback v2_playbook_on_handler_task_start is
called for them. This leads to a missing zuul_log_id in the handler
task and trying to log to /tmp/console-None.log. In case this file was
already created by a handler using sudo it may not be accessible which
leads to the exception.

This can be fixed by also defining v2_playbook_on_handler_task_start
in zuul_stream.

Also add a validation of zuul_log_id in the command module. This
should make it easier to spot such errors next time.

[1] Trace
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/tmp/ansible_X_j_M1/ansible_module_command.py", line 185, in follow
    with Console(log_uuid) as console:
  File "/tmp/ansible_X_j_M1/ansible_module_command.py", line 162, in __enter__
    self.logfile = open(self.logfile_name, 'ab', buffering=0)
IOError: [Errno 13] Permission denied: '/tmp/console-None.log'

Change-Id: I1978aa8faa488fec87406e1481d455d49731f867
Story: 2002528
Task: 22067
2018-06-12 12:30:49 +02:00
James E. Blair a51a9e7b8b Add zuul-stream remote tests
This tests zuul-stream using the 'remote' tests since many aspects
of zuul-stream require a remote node to be exercised.

Change-Id: I08edee6f9388faf13615c6dc310df7a94d0419b2
2018-03-20 16:23:41 -07:00