Ansible 6 is EOL and Ansible 9 is available. Remove 6 and add 9.
This is usually done in two changes, but this time it's in one
since we can just rotate the 6 around to make it a 9.
command.py has been updated for ansible 9.
Change-Id: I537667f66ba321d057b6637aa4885e48c8b96f04
Some remote commands can run for some time without producing output.
Some network devices/firewalls/pnats/etc may decide to drop the
connection since it looks idle. Try to avoid that by sending TCP
keepalives, and if that doesn't work, that should at least cause
us to be notified that the connection is dead. In that case, the
log streamer will receive a timeout error on the socket recv call.
We will log it and stop streaming.
Change-Id: Ib8e2aae0de07ff2dcc9de8e244f09e0cf9796e20
The log streaming system suffers a 10 second delay after a skipped
command task because:
* when processing the results for the next task, we ensure that the
log streams for all of the previous tasks have stopped
* when stopping a log stream, we wait for the remote command process
to finish before we close the stream
* if a log file has not appeared yet, we can't determine if the
stream has finished
* so we wait 10 seconds for the log to appear before we proceed and
terminate the stream regardless.
The underlying issue is that the code that terminates the stream does
not know whether to expect a log file (the normal case) or not (the
case for a skipped task).
To correct that, we make a new Streamer class which bundles all of
the information about a particular log streamer and which is
individually addressable. This way we can stop an individual
streamer immediately if it's corresponding task is skipped, or stop
all of the streamers with a 10 second grace period in the normal
case.
Since the behavioral difference is a 10 second delay (but otherwise
no change in job output), we can't test the behavioral outcome of
this change, but we can exercise the code by ensuring that there are
skipped command tasks in the remote stream tests.
Change-Id: Id6ee13e5a82aa8fa3a2a0dd293cf99ed5b84347a
The imp module has been deprecated for quite some time. The importlib
module has been present since python 3.6 and replaces the imp module.
This is important because python3.12 removes the imp module entirely and
any code using imp needs to be rewritten to be python3.12 compatible.
The rewrite here comes from the suggestions in an upstream discussion
about this transition [0].
[0] https://discuss.python.org/t/how-do-i-migrate-from-imp/27885
Change-Id: Iaa742c71bbb208225726c2b8fa5af111f48667c9
We currently iterate over every job/host/etc variable in the freeze
playbook. The reason is because if any value in any variable is
Undefined according to jinja, the Ansible combine filter throws
an error. What we want to do in Zuul is merge any variable we can,
but if any is undefined, we skip it. Thus, the process of combining
the variables one at a time in a task and ignoring errors.
This process can be slow, especially if we have start with a large
amount of data in one of the early variables. The combine filter
needs to reprocess the large data repeatedly for each additional
variable.
To improve the process, we create a new action plugin, "zuul_freeze"
which takes a list of variables we want to freeze, then templates
them one at a time and stores the result in a cacheable fact. This
is the essence of what we were trying to accomplish with the combine
filter.
Change-Id: Ie41f404762daa1b1a5ae47f6ec1aa1954ad36a39
This allows users to trigger the new early failure detection by
matching regexes in the streaming job output.
For example, if a unit test job outputs something sufficiently
unique on failure, one could write a regex that matches that and
triggers the early failure detection before the playbook completes.
For hour-long unit test jobs, this could save a considerable amount
of time.
Note that this adds the google-re2 library to the Ansible venvs. It
has manylinux wheels available, so is easy to install with
zuul-manage-ansible. In Zuul itself, we use the fb-re2 library which
requires compilation and is therefore more difficult to use with
zuul-manage-ansible. Presumably using fb-re2 to validate the syntax
and then later actually using google-re2 to run the regexes is
sufficient. We may want to switch Zuul to use google-re2 later for
consistency.
Change-Id: Ifc9454767385de4c96e6da6d6f41bcb936aa24cd
Ansible tasks that fail within block tasks call the failure
callback, which means that they triggered Zuul early failure
detection even if they were later rescued. To avoid this,
ignore block tasks for purposes of early failure detection.
Also make early failure detection "sticky". This helps
uncover errors like this (the "remote" tests uncover this
particular failure if the result is made sticky), and also
ensures consistent behavior in dependent pipelines.
Change-Id: I505667678e7384386819b5389036e4fb4f108afd
Combining stdout/stderr in the result can lead to problems when e.g.
the stdout of a task is used as an input for another task.
This is also different from the normal Ansible behavior and can be
surprising and hard to debug for users.
The new behavior is configurable and off by default to retain backward
compatibility.
Change-Id: Icaced970650913f9632a8db75a5970a38d3a6bc4
Co-Authored-By: James E. Blair <jim@acmegating.com>
We can have the Ansible callback plugin tell the executor to tell
the scheduler that a task has failed and therefore the job will
fail. This will allow the scheduler to begin a gate reset before
the failing job has finished and potentially save much developer
and CPU time.
We take some extra precautions to try to avoid sending a pre-fail
notification where we think we might end up retrying the job
(either due to a failure in a pre-run playbook, or an unreachable
host). If that does happen then a gate pipeline might end up
flapping between two different NNFI configurations (ie, it may
perform unecessary gate resets behind the change with the retrying
job), but should still not produce an incorrect result. Presumably
the detections here should catch that case sufficiently early, but
due to the nature of these errors, we may need to observe it in
production to be sure.
Change-Id: Ic40b8826f2d54e45fb6c4d2478761a89ef4549e4
When a user sets zuul_console_disabled, we don't need to try to
connect to the streaming daemon. In fact, they may have set it
because they know it won't be running. Check for this and avoid
the connection step in that case and therefore avoid the extraneous
"Waiting on logger" messages and extra 30 second delay at the end
of each task.
Change-Id: I86af231f1ca1c5b54b21daae29387a8798190a58
I've managed to get better at grepping for this and this finds some of
the stragglers. They are all file opens without closes fixed by using a
with open() context manager.
Change-Id: I7b8c8516a86558e2027cb0f01aefe2dd1849069c
Upon further discussion we recently found another case of leaking
console streaming files; if the zuul_console is not running on port
19885, or can not be reached, the streaming spool files will still be
leaked.
The prior work in I823156dc2bcae91bd6d9770bd1520aa55ad875b4 has the
receiving side indicate to the zuul_console daemon that it should
remove the spool file.
If this doesn't happen, either because the daemon was never there, or
it is firewalled off, the streaming spool files are left behind.
This modifies the command action plugin to look for a variable
"zuul_console_disable" which will indicate to the library running the
shell/command task not to write out the spool file at all, as it will
not be consumed.
It is expected this would be set at a host level in inventory for
nodes that you know can not or will not have access to zuul_console
daemon.
We already have a mechanism to disable this for commands running in a
loop; we expand this with a new string type. The advantage of this is
it leaves the library/command.py side basically untouched.
Documentation is updated, and we cover this with a new test.
Change-Id: I0273993c3ece4363098e4bf30bfc4308bb69a8b4
This is a follow-on to Ia78ad9e3ec51bc47bf68c9ff38c0fcd16ba2e728 to
use a different loopback address for the local connection to the
Python 2.7 container. This way, we don't have to override the
existing localhost/127.0.0.1 matches that avoid the executor trying to
talk to a zuul_console daemon. These bits are removed.
The comment around the port settings is updated while we're here.
Change-Id: I33b2198baba13ea348052e998b1a5a362c165479
Change Ief366c092e05fb88351782f6d9cd280bfae96237 intoduced a bug in
the streaming daemons because it was using Python 3.6 features. The
streaming console needs to work on all Ansible managed nodes, which
includes back to Python 2.7 nodes (while Ansible supports that).
This introduces a regression test by building about the smallest
Python 2.7 container that can be managed by Ansbile. We start this
container and modify the test inventory to include it, then run the
stream tests against it.
The existing testing runs against the "new" console but also tests
against the console OpenDev's Zuul starts to ensure
backwards-compatability. Since this container wasn't started by Zuul
it doesn't have this, so that testing is skipped for this node.
It might be good to abstract all testing of the console daemons into
separate containers for each Ansible supported managed-node Python
version -- it's a bit more work than I want to take on right now.
This should ensure the lower-bound though and prevent regressions for
older platforms.
Change-Id: Ia78ad9e3ec51bc47bf68c9ff38c0fcd16ba2e728
With the defaulh "linear" strategy (and likely others), Ansible will
send the on_task_start callback, and then fork a worker process to
execute that task. Since we spawn a thread in the on_task_start
callback, we can end up emitting a log message in this method while
Ansible is forking. If a forked process inherits a Python file object
(i.e., stdout) that is locked by a thread that doesn't exist in the
fork (i.e., this one), it can deadlock when trying to flush the file
object. To minimize the chances of that happening, we should avoid
using _display outside the main thread.
The Python logging module is supposed to use internal locks which are
automatically aqcuired and released across a fork. Assuming this is
(still) true and functioning correctly, we should be okay to issue
our Python logging module calls at any time. If there is a fault
in this system, however, it could have a potential to cause a similar
problem.
If we can convince the Ansible maintainers to lock _display across
forks, we may be able to revert this change in the future.
Change-Id: Ifc6b835c151539e6209284728ccad467bef8be6f
I noticed in some of our testing a construct like
debug:
msg: '{{ ansible_version }}'
was actually erroring out; you'll see in the console output if you're
looking
Ansible output: b'TASK [Print ansible version msg={{ ansible_version }}] *************************'
Ansible output: b'[WARNING]: Failure using method (v2_runner_on_ok) in callback plugin'
Ansible output: b'(<ansible.plugins.callback.zuul_stream.CallbackModule object at'
Ansible output: b"0x7f502760b490>): 'dict' object has no attribute 'startswith'"
and the job-output.txt will be empty for this task (this is detected
by by I9f569a411729f8a067de17d99ef6b9d74fc21543).
This is because the msg value here comes in as a dict, and in several
places we assume it is a string. This changes places we inspect the
msg variable to use the standard Ansible way to make a text string
(to_text function) and ensures in the logging function it converts the
input to a string.
We test for this with updated tasks in the remote_zuul_stream tests.
It is slightly refactored to do partial matches so we can use the
version strings, which is where we saw the issue.
Change-Id: I6e6ed8dba2ba1fc74e7fc8361e8439ea6139279e
Currently the task in the test playbook
- hosts: compute1
tasks:
- name: Command Not Found
command: command-not-found
failed_when: false
is failing in the zuul_stream callback with an exception trying to
fill out the "delta" value in the message here. The result dict
(taken from the new output) shows us why:
2022-08-24 07:19:27.079961 | TASK [Command Not Found]
2022-08-24 07:19:28.578380 | compute1 | ok: ERROR (ignored)
2022-08-24 07:19:28.578622 | compute1 | {
2022-08-24 07:19:28.578672 | compute1 | "failed_when_result": false,
2022-08-24 07:19:28.578700 | compute1 | "msg": "[Errno 2] No such file or directory: b'command-not-found'",
2022-08-24 07:19:28.578726 | compute1 | "rc": 2
2022-08-24 07:19:28.578750 | compute1 | }
i.e. it has no start/stop/delta in the result (it did run and fail, so
you'd think it might ... but this is what Ansible gives us).
This checks for this path; as mentioned the output will now look like
above in this case.
This was found by the prior change
I9f569a411729f8a067de17d99ef6b9d74fc21543. This fixes the current
warning, so we invert the test to prevent further regressions.
Change-Id: I106b2bbe626ed5af8ca739d354ba41eca2f08f77
In Ief366c092e05fb88351782f6d9cd280bfae96237 I missed that this runs
in the context of the remote node; meaning that it must support all
the Python versions that might run there. f-strings are not 3.5
compatible.
I'm thinking about how to lint this better (a syntax check run?)
Change-Id: Ia4133b061800791196cd631f2e6836cb77347664
When using protocol version 1, send a finalise message when streaming
is complete so that the zuul_console daemon can delete the temporary
file.
We test this by inspecting the Ansible console output, which logs a
message with the UUID of the streaming job. We dump the temporary
files on the remote side and make sure a console file for that job
isn't present.
Change-Id: I823156dc2bcae91bd6d9770bd1520aa55ad875b4
A refresher on how this works, to the best of my knowledge
1 Firstly, Zuul's Ansible has a library task "zuul_console:" which is
run against the remote node; this forks a console daemon, listening
on a default port.
2 We have a action plugin that runs for each task, and if that task
is a command/shell task, assigns it a unique id
3 We then override with library/command.py (which backs command/shell
tasks) with a version that forks and runs the process on the target
node as usual, but also saves the stdout/stderr output to a
temporary file named per the unique uuid from the step above.
4 At the same time we have the callback plugin zuul_stream.py, which
Ansible is calling as it moves through starting, running and
finishing the tasks. This looks at the task, and if it has a UUID
[2], sends a request to the zuul_console [1], which opens the
temporary file [3] and starts streaming it back.
5 We loop reading this until the connection is closed by [1],
eventually outputting each line.
In this way, the console log is effectively streamed and saved into
our job output.
We have established that we expect the console [1] is updated
asynchronously to the command/streaming [3,4] in situation such as
static nodes. This poses a problem if we ever want to update either
part -- for example we can not change the file-name that the
command.py file logs to, because an old zuul_console: will not know to
open the new file. You could imagine other fantasy things you might
like to do; e.g. negotiate compression etc. that would have similar
issues.
To provide the flexibility for these types of changes, implement a
simple protocol where the zuul_stream and zuul_console sides exchange
their respective version numbers before sending the log files. This
way they can both decide what operations are compatible both ways.
Luckily the extant protocol, which is really just sending a plain
uuid, can be adapted to this. When an old zuul_console server gets
the protocol request it will just look like an invalid log file, which
zuul_stream can handle and thus assume the remote end doesn't know
about protocols.
This bumps the testing timeout; it seems that the extra calls make for
random failures. The failures are random and not in the same place,
I've run this separately in 850719 several times and not seen any
problems with the higher timeout. This test is already has a settle
timeout slightly higher so I think it must have just been on the edge.
Change-Id: Ief366c092e05fb88351782f6d9cd280bfae96237
In zuul_stream.py:v2_playbook_on_task_start() it checks for
"task.loop" and exits if the task is part of a loop.
However the library/command.py override still writes out the console
log despite it never being read. To avoid leaving this file around,
mark a sentinel uuid in the action plugin if the command is part of a
loop. In that case, for simplicity we just write to /dev/null -- that
way no other assumptions in the library/command.py have to change; it
just doesn't leave a file on disk.
This is currently difficult to test as the infrastructure zuul_console
leaves /tmp/console-* files and we do not know what comes from that,
or testing. After this and the related change
I823156dc2bcae91bd6d9770bd1520aa55ad875b4 are deployed to the
infrastructure executors, we can make a simple and complete test for
the future by just ensuring no /tmp/console-* files are left behind
afer testing. I have tested this locally and do not see files from
loops, which I was before this change.
Change-Id: I4f4660c3c0b0f170561c14940cc159dc43eadc79
The Zuul Ansible callback stream plugin assumed that the ansible loop
var was always called 'item' in the result_dict. You can override this
value (and it is often necessary to do so to avoid collisions) to
something less generic. In those cases we would get errors like:
b'[WARNING]: Failure using method (v2_runner_item_on_ok) in callback plugin'
b'(<ansible.plugins.callback.zuul_stream.CallbackModule object at'
b"0x7fbecc97c910>): 'item'"
And stream output would not include the info typically logged.
Address this by checking if ansible_loop_var is in the results_dict and
using that value for the loop var name instead. We still fall back to
'item' as I'm not sure that ansible_loop_var is always present.
Change-Id: I408e6d4af632f8097d63c04cbcb611d843086f6c
If the command didn't exist:
- Popen would throw an exception
- 't' would not be updated (t is None)
- return code would not be written to the console
- zuul_stream would wait unecessary for 10 seconds
As rc is defined in normal case or in both exceptions, it can be written
in each case to the console.
Change-Id: I77a4e1bdc6cd163143eacda06555b62c9195ee38
If a role is applied to a host more than once (via either play
roles or include_roles, but not via an include_role loop), it will
have the same task UUID from ansible which means Zuul's command
plugin will write the streaming output to the same filename, and
the log streaming will request the same file. That means the file
might look this after the second invocation:
2022-05-19 17:06:23.673625 | one
2022-05-19 17:06:23.673781 | [Zuul] Task exit code: 0
2022-05-19 17:06:29.226463 | two
2022-05-19 17:06:29.226605 | [Zuul] Task exit code: 0
But since we stop reading the log after "Task exit code", the user
would see "one" twice, and never see "two".
Here are some potential fixes for this that don't work:
* Accessing the task vars from zuul_stream to store any additional
information: the callback plugins are not given the task vars.
* Setting the log id on the task args in zuul_stream instead of
command: the same Task object is used for each host and therefore
the command module might see the task object after it has been
further modified (in other words, nothing host-specific can be
set on the task object).
* Setting an even more unique uuid than Task._uuid on the Task
object in zuul_stream and using that in the command module instead
of Task._uuid: in some rare cases, the actual task Python object
may be different between the callback and command plugin, yet still
have the same _uuid; therefore the new attribute would be missing.
Instead, a global variable is used in order to transfer data between
zuul_stream and command. This variable holds a counter for each
task+host combination. Most of the time it will be 1, but if we run
the same task on the same host again, it will increment. Since Ansible
will not run more than one task on a host simultaneously, so there is
no race between the counter being incremented in zuul_stream and used
in command.
Because Ansible is re-invoked for each playbook, the memory usage is
not a concern.
There may be a fork between zuul_stream and command, but that's fine
as long as we treat it as read-only in the command plugin. It will
have the data for our current task+host from the most recent zuul_stream
callback invocation.
This change also includes a somewhat unrelated change to the test
infrastructure. Because we were not setting the log stream port on
the executor in tests, we were actually relying on the "real" OpenDev
Zuul starting zuul_console on the test nodes rather than the
zuul_console we set up for each specific Ansible version from the tests.
This corrects that and uses the correct zuul_console port, so that if we
make any changes to zuul_console in the future, we will test the
changed version, not the one from the Zuul which actually runs the
tox-remote job.
Change-Id: Ia656db5f3dade52c8dbd0505b24049fe0fff67a5
The current no_log handling here was added in the following steps
Ic1becaf2f3ab345da22fa62314f1296d76777fec was the original zuul_json
and added the check of the result object
if result._result.get('_ansible_no_log', False):
This happened in July, 2017. However, soon after in October 2017,
Ansible merged [1] which changed things so that the callbacks
(v2_runner_on_ok) get "clean copy" of the TaskResult object; e.g.
if isinstance(arg, TaskResult):
new_args.append(arg.clean_copy())
It can be seen at [2] that this is where results are censored. This
change made it into Ansible 2.5. Ergo this callback will never see
uncensored results here.
The second part of the no_log processing here was added with
I9e8d08f75207b362ca23457c44cc2f38ff43ac23 in March 2018. This was to
work around an issue with loops, where the uncensored values would
"leak" into the results under certain circumstances when using loops,
e.g.
- name: A templated no_log loop
a_task:
a_arg: 'foo'
no_log: '{{ item }}'
loop:
- True
- False
Ansible merged a few changes related to fixing this. [3] was merged
in June 2018 for this exact issue of unreachable hosts. [4] was
merged in August 2018, which makes it so that if any loop item is
"no_log", the whole task is. Both of these changes made it into
Ansible 2.7.
Ansible has merged test-cases for these, but we have also merged
test-cases, so we have double coverage against regression in the
future.
Ergo, I believe we can revert both of these checks. This also means
we don't have to worry about special-casing string results as done in
I02bcd307bcfad8d99dd0db13d979ce7ba3d5e0e4.
[1] 01b6c7c9c6
[2] f7c2b1986c/lib/ansible/executor/task_result.py (L13)
[3] 336b3762b2
[4] bda074d34e
Change-Id: I00ef08869f3a8f08a1affa5e15e3386a1891f11e
We have noted a problem with tasks such as "yum:" or "package:" that
output is not being logged in the job json file.
After investigation, I have found that this is failing with an
exception in the results parsing at
for index, item_result in enumerate(
clean_result.get('results', [])):
if not item_result.get('_ansible_no_log', False):
continue
When we are *not* processing a task loop, "results" is a simple list
of strings about the package results; thus when it walks each result
the "item_result.get('_ansible_no_log', False)" step fails because a
string has no ".get()". This causes the plugin to exit and the
information in the resulting .json to be incomplete, leading to the
tasks being missing from the UI. I believe this was a regression
introduced with I9e8d08f75207b362ca23457c44cc2f38ff43ac23.
When we *are* processing a task loop, "results" is a list where each
entry is a loop-item result. In this case, we are always walking a
list of dictionaries, so the existing ".get()" call works.
Change-Id: I02bcd307bcfad8d99dd0db13d979ce7ba3d5e0e4
This adds support for Ansible 5. As mentioned in the reno, only
the major version is specified; that corresponds to major.minor in
Ansible core, so is approximately equivalent to our current regime.
The command module is updated to be based on the current code in
ansible core 2.12.4 (corresponding to community 5.6.0). The previous
version is un-symlinked and copied to the 2.8 and 2.8 directories
for easy deletion as they age out.
The new command module has corrected a code path we used to test
that the zuul_stream module handles python exceptions in modules,
so instead we now take advantage of the ability to load
playbook-adjacent modules to add a test fixture module that always
raises an exception. The zuul stream functional test validation is
adjusted to match the new values.
Similarly, in test_command in the remote tests, we relied on that
behavior, but there is already a test for module exceptions in
test_module_exception, so that check is simply removed.
Among our Ansible version tests, we occasionally had tests which
exercised 2.8 but not 2.9 because it is the default and is otherwise
tested. This change adds explicit tests for 2.9 even if they are
redundant in order to make future Ansible version updates easier and
more mechanical (we don't need to remember to add 2.9 later when
we change the default).
This is our first version of Ansible where the value of
job.ansible-version could be interpreted as an integer, so the
configloader is updated to handle that possibility transparently,
as it already does for floating point values.
Change-Id: I694b979077d7944b4b365dbd8c72aba3f9807329
We no longer need these directories since there is no deffirence
between which Ansible plugins we load for the trusted vs untrusted
context.
Change-Id: Ibd460d89ebd75a0b58ce715284916e1e1628b518