Zuul packs refs directly rather than rely on git to do so. The reason
for this is it greatly speeds up repo resetting. Typically there are two
pieces of information for each packged ref (sha and refname). Git
annotated and signed tags are special because they have the sha of the
tag object proper, the tag refname, and finally the sha of the object
the tag refers to.
Update Zuul's ref packing to handle this extra piece of information for
git tags.
Co-Authored-By: James E. Blair <jim@acmegating.com>
Change-Id: I828ab924a918e3ded2cd64deadf8ad0b4726eb1e
When the dependency graph exceeds the configured size we will raise an
exception. Currently we don't handle those exceptions and let them
bubble up to the pipeline processing loop in the scheduler.
When this happens during trigger event processing this is only aborting
the current pipeline handling run and the next scheduler will continue
processing the pipeline as usual.
However, in case where the item is already enqueued this exception can
block the pipeline processor and lead to a hanging pipeline:
ERROR zuul.Scheduler: Exception in pipeline processing:
Traceback (most recent call last):
File "/opt/zuul/lib/python3.11/site-packages/zuul/scheduler.py", line 2370, in _process_pipeline
while not self._stopped and pipeline.manager.processQueue():
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zuul/lib/python3.11/site-packages/zuul/manager/__init__.py", line 1800, in processQueue
item_changed, nnfi = self._processOneItem(
^^^^^^^^^^^^^^^^^^^^^
File "/opt/zuul/lib/python3.11/site-packages/zuul/manager/__init__.py", line 1624, in _processOneItem
self.getDependencyGraph(item.changes[0], dependency_graph, item.event,
File "/opt/zuul/lib/python3.11/site-packages/zuul/manager/__init__.py", line 822, in getDependencyGraph
self.getDependencyGraph(needed_change, dependency_graph,
File "/opt/zuul/lib/python3.11/site-packages/zuul/manager/__init__.py", line 822, in getDependencyGraph
self.getDependencyGraph(needed_change, dependency_graph,
File "/opt/zuul/lib/python3.11/site-packages/zuul/manager/__init__.py", line 822, in getDependencyGraph
self.getDependencyGraph(needed_change, dependency_graph,
[Previous line repeated 8 more times]
File "/opt/zuul/lib/python3.11/site-packages/zuul/manager/__init__.py", line 813, in getDependencyGraph
raise Exception("Dependency graph is too large")
Exception: Dependency graph is too large
To fix this, we'll handle the exception and remove the affected item.
We'll also handle the exception during enqueue and ignore the trigger
event in this case.
Change-Id: I210c5fa4c568f2bf03eedc18b3e9c9a022628dc3
This error is used in some places, but not all. Correct that to
improve config error structured data.
Change-Id: Ice4fbee679ff8e7ab05042452bbd4f45ca8f1122
This error exception class went unused, likely due to complications
from circular imports.
To resolve this, move all of the configuration error exceptions
into the exceptions.py file so they can be imported in both
model.py and configloader.py.
Change-Id: I19b0f078f4d215a2e14c2c7ed893ab225d1e1084
When an item updates config we will schedule a merge for the proposed
change and its dependencies.
The merger will return a list of config files for each merged change.
The scheduler upon receiving the merge result will combine the collected
config files for a project-branch from all involved changes.
This lead to the problem that the old content of renamed config files
were still used when building the dynamic layout.
Since the config we receive from the merger is always exhaustive, we
just need to keep the latest config files.
Another (or additional) fix would be to only return the latest config
files for a project-branch from the mergers. However, in case of
circular dependencies it could make sense in the future to get the
update config per change to report errors more precisely.
Change-Id: Iebf49a9efe193788199197bf7846e336d96edf19
This will allow users to write post-run playbooks that skip
running certain tasks on unreachable hosts.
Change-Id: I04106ad0222bcd8073ed6655a8e4ed77f43881f8
Ansible 6 is EOL and Ansible 9 is available. Remove 6 and add 9.
This is usually done in two changes, but this time it's in one
since we can just rotate the 6 around to make it a 9.
command.py has been updated for ansible 9.
Change-Id: I537667f66ba321d057b6637aa4885e48c8b96f04
This change completes the circular dependency refactor.
The principal change is that queue items may now include
more than one change simultaneously in the case of circular
dependencies.
In dependent pipelines, the two-phase reporting process is
simplified because it happens during processing of a single
item.
In independent pipelines, non-live items are still used for
linear depnedencies, but multi-change items are used for
circular dependencies.
Previously changes were enqueued recursively and then
bundles were made out of the resulting items. Since we now
need to enqueue entire cycles in one queue item, the
dependency graph generation is performed at the start of
enqueing the first change in a cycle.
Some tests exercise situations where Zuul is processing
events for old patchsets of changes. The new change query
sequence mentioned in the previous paragraph necessitates
more accurate information about out-of-date patchsets than
the previous sequence, therefore the Gerrit driver has been
updated to query and return more data about non-current
patchsets.
This change is not backwards compatible with the existing
ZK schema, and will require Zuul systems delete all pipeline
states during the upgrade. A later change will implement
a helper command for this.
All backwards compatability handling for the last several
model_api versions which were added to prepare for this
upgrade have been removed. In general, all model data
structures involving frozen jobs are now indexed by the
frozen job's uuid and no longer include the job name since
a job name no longer uniquely identifies a job in a buildset
(either the uuid or the (job name, change) tuple must be
used to identify it).
Job deduplication is simplified and now only needs to
consider jobs within the same buildset.
The fake github driver had a bug (fakegithub.py line 694) where
it did not correctly increment the check run counter, so our
tests that verified that we closed out obsolete check runs
when re-enqueing were not valid. This has been corrected, and
in doing so, has necessitated some changes around quiet dequeing
when we re-enqueue a change.
The reporting in several drivers has been updated to support
reporting information about multiple changes in a queue item.
Change-Id: I0b9e4d3f9936b1e66a08142fc36866269dc287f1
Depends-On: https://review.opendev.org/907627
With job parents that supply data we might end up updating the (secret)
parent data and artifacts of a job multiple times in addition to also
storing duplicate data as most of this information is part of the
parent's build result.
Instead we will collect the parent data and artifacts before scheduling
a build request and send it as part of the request paramters.
If those parameters are part of the build request the executor will use
them, otherwise it falls back on using the data from the job for
backward compatibility.
This change affects the behavior of job deduplication in that input data
from parent jobs is no longer considered when deciding if a job can be
deduplicated or not.
Change-Id: Ic4a85a57983d38f033cf63947a3b276c1ecc70dc
This refactors the the config error handling based on nested
context managers that add increasing amounts of information
about error locations. In other words, when we start processing
a Job object, we will:
with ParseContext.errorContext(info about job):
do some stuff
with ParseContext.errorContext(info about job attr):
do some stuff regarding the job attribute
with ParseContext.errorContext(next job attr):
do stuff with a different attr
We store a stack of error contexts on the parse context, and at any
point we can access the accumulator for the most recent one with
ParseContext.accumulator in order to add a warning or error. If we
have an attribute line number, we'll use it, otherwise we'll just
use the object-level information.
We also collapse the exception hnadlers into a single context
manager which catches exceptions and adds them to the accumulator.
This lets us decide when to catch an exception and skip to the next
phase of processing separately from where we narrow our focus to
a new object or attribute. These two actions often happen together,
but not always.
This should serve to simplify the configloader code and make it
easier to have consistent error handling within.
Change-Id: I180f9b271acd4b62039003fa8b38900f9863bad8
Currently we schedule a merge/repo-state for every item that is added to
a pipeline. For changes and tags we need the initial merge in order to
build a dynamic layout or to determine if a given job variant on a
branch should match for a tag.
For other change-types (branches/refs) we don't need the initial
merge/repo-state before we can freeze the job graph. The overhead of
those operations can become quite substantial for projects with a lot of
branches that also have a periodic pipeline config, but only want to
execute jobs for a small subset of those branches.
With this change, branch/ref changes that don't execute any jobs will
be removed without triggering any merge/repo state requests.
In addition we will reduce the number of merge requests for branch/ref
changes as the initial merge is skipped in all cases.
Change-Id: I157ed52dba8f4e197b35798217b23ec7f035b2d9
The always-dynamic-branches option specifies a regex such that
branches that match it are ignored for Zuul configuration purposes,
unless a change is proposed, at which point the zuul.yaml config
is read from the branch in the same way as if a change was made
to the file.
Because creading and deleting dynamic branches do not cause
reconfigurations, the list of project branches stored on a tenant
may not be updated after a dynamic branch is created. This list
is used to decide from what branches to try to load config files.
Together, all of this means that if you create an always-dynamic-branch
and propose a change to it shortly afterwords, Zuul is likely to
ignore the change since it won't know to load configuration from
its branch.
To correct this, we extend the list of branches from which Zuul
knows to read configuration with the branch of the item under test
and any items ahead of it in the queue (but only if these branches
match the dynamic config regex so that we don't include an excluded
branch).
Also add a log entry to indicate when we are loading dynamic
configuration from a file.
Change-Id: Ibd15ce4a154311cdb523c5603f4ad17f761d1078
To protect Zuul servers from accidental DoS attacks in case someone,
say, uploads a 1k change tree to gerrit, add an option to limit the
dependency processing in the Gerrit driver and in Zuul itself (since
those are the two places where we recursively process deps).
Change-Id: I568bd80bbc75284a8e63c2e414c5ac940fc1429a
If a configuration error existed for a project on one branch
and a change was proposed to update the config on another branch,
that would activate a code path in the manager which attempts to
determine whether errors are relevant. An error (or warning) is
relevant if it is not in a parent change, and is on the same
project+branch as the current patch. This is pretty generous.
This means that if a patch touches Zuul configuration with a
warning, all warnings on that branch must be updated. This was
not the intended behavior.
To correct that, we no longer consider warnings in any of the
places where we check that a queue item is failing due to
configuration errors.
An existing test is updated to include sufficient setup to trigger
the case where a new valid configuration is added to a project
with existing errors and warnings.
A new test case is added to show that we can add new deprecations
as well, and that they are reported to users as warnings.
Change-Id: Id901a540fce7be6fedae668390418aca06a950af
Prior to this change we checked if there are any errors in the config
(which includes warnings by default) and return a build error if there
are. Now we only check that proper errors are present when returning
errors.
This allows users to push config updates that don't fix all warnings
immediately. Without this any project with warnings present would need
to fix all warnings before newly proposed configs can take effect. This
is particularly problematic for speculative testing, but in general it
seems like warnings shouldn't be fatal.
Change-Id: I31b094fb366328696708b019354b843c4b94ffc0
This allows users to set a maximum value for the active window
in the event they have a project that has long stretches of
passing tests but they still don't want to commit too many resources
in case of a failure.
We should all be so lucky.
Change-Id: I52b5f3a9e7262b88fb16afc4520b35854e8df184
This adds a configuration warning (viewable in the web UI) for any
regular expressions found in in-repo configuration that can not
be compiled and executed with RE2.
Change-Id: I092b47e9b43e9548cafdcb65d5d21712fc6cc3af
The re2 library does not support negative lookahead expressions.
Expressions such as "(?!stable/)", "(?!master)", and "(?!refs/)" are
very useful branch specifiers with likely many instances in the wild.
We need to provide a migration path for these.
This updates the configuration options which currently accepts Python
regular expressions to additionally accept a nested dictionary which
allows specifying that the regex should be negated. In the future,
other options (global, multiline, etc) could be added.
A very few options are currently already compiled with re2. These are
left alone for now, but once the transition to re2 is complete, they
can be upgraded to use this syntax as well.
Change-Id: I509c9821993e1886cef1708ddee6d62d1a160bb0
This allows users to trigger the new early failure detection by
matching regexes in the streaming job output.
For example, if a unit test job outputs something sufficiently
unique on failure, one could write a regex that matches that and
triggers the early failure detection before the playbook completes.
For hour-long unit test jobs, this could save a considerable amount
of time.
Note that this adds the google-re2 library to the Ansible venvs. It
has manylinux wheels available, so is easy to install with
zuul-manage-ansible. In Zuul itself, we use the fb-re2 library which
requires compilation and is therefore more difficult to use with
zuul-manage-ansible. Presumably using fb-re2 to validate the syntax
and then later actually using google-re2 to run the regexes is
sufficient. We may want to switch Zuul to use google-re2 later for
consistency.
Change-Id: Ifc9454767385de4c96e6da6d6f41bcb936aa24cd
Combining stdout/stderr in the result can lead to problems when e.g.
the stdout of a task is used as an input for another task.
This is also different from the normal Ansible behavior and can be
surprising and hard to debug for users.
The new behavior is configurable and off by default to retain backward
compatibility.
Change-Id: Icaced970650913f9632a8db75a5970a38d3a6bc4
Co-Authored-By: James E. Blair <jim@acmegating.com>
In case of a retry there might be no logs available to help the user
understand the reason for a failure. To improve this we can the details
of the failure as part of the build result.
Change-Id: Ib9fdbdec5d783a347d1b6e5ce6510d50acfe1286
Warnings are also added to the list of loading errors. For the tenant
validation we need to distinguish between errors and deprecation
warnings and only fail the validation when there are errors.
Alternatively we could also introduce a flag for the tenant validation
to thread deprecation warnings as errors.
Change-Id: I9c8957520c37157a295627848d30e52a36c8da0a
Co-Authored-By: James E. Blair <jim@acmegating.com>
This change adds a variable to post and cleanup playboks in order to
determine if a job will be retried due to a failure in one of the
earlier playbooks.
This variable might be useful for only performing certain actions (e.g.
interacting with a remote system) when the job result is final and there
won't be any further attempts.
Change-Id: If7f4488d4a59b1544795401bdc243978fea9ca86
We can have the Ansible callback plugin tell the executor to tell
the scheduler that a task has failed and therefore the job will
fail. This will allow the scheduler to begin a gate reset before
the failing job has finished and potentially save much developer
and CPU time.
We take some extra precautions to try to avoid sending a pre-fail
notification where we think we might end up retrying the job
(either due to a failure in a pre-run playbook, or an unreachable
host). If that does happen then a gate pipeline might end up
flapping between two different NNFI configurations (ie, it may
perform unecessary gate resets behind the change with the retrying
job), but should still not produce an incorrect result. Presumably
the detections here should catch that case sufficiently early, but
due to the nature of these errors, we may need to observe it in
production to be sure.
Change-Id: Ic40b8826f2d54e45fb6c4d2478761a89ef4549e4
A truly empty zuul.yaml file does not cause an error, but one with
only comments, which is semantically equivalent, does. Handle that
case.
Change-Id: If1fde821f24fa1ee175e006016753c6a1b42f837
Options such as always-dynamic-branches or exclude-unprotected-branches
can cause Zuul to toggle between using or not using implied branch matchers
as projects move between having one or more than one branch.
Let operators avoid that by specifying the intended behavior if necessary.
Change-Id: Ib8efa224fc396220ae85896845be4a908ac1008d
In case of a bundle, zuul should load extra-config-paths not only from
items ahead, but should from all items in that bundle. Otherwise it might
throw a "invalid config" error, because the required zuul items in
extra-config-paths are not found.
Change-Id: I5c14bcb14b7f5c627fd9bd49f887dcd55803c6a1
PyYAML is efficient with YAML anchors and will only construct one
Python object for a YAML mapping that is used multiple times via
anchors.
We copy job variables (a mapping) straight from the YAML to an
attribute of the Job object, then we freeze the Job object which
converts all of its dicts to mappingproxy objects. This mutates
the contents of the vars dict, and if that is used on another
job via an anchor, we will end up with mappingproxy objects in
what we think is a "raw yaml" configuration dict, and we will not
be able to serialize it in case of errors.
To address this, perform a deep copy of every configuration yaml
blob before parsing it into an object.
Change-Id: Idaa6ff78b1ac5a108fb9f43700cf8e66192c43ce
Inventory testing was opening yaml files to parse them and not
explicitly closing them when done. Fix this through the use of with
open() context managers.
Change-Id: I41a8ee607fcf13e86dd800cefb00d7e120265ed4
We've investigated an issue where a job was stuck on the executor
because it wasn't aborted properly. The job was cancelled by the
scheduler, but the cleanup playbook on the executor ran into a timeout.
This caused another abort via the WatchDog.
The problem is that the abort function doesn't do anything if the
cleanup playbook is running [1]. Most probably this covers the case
that we don't want to abort the cleanup playbook after a normal job
cancellation.
However, this doesn't differentiate if the abort was caused by the run
of the cleanup playbook itself, resulting in a build that's hanging
indefinitely on the executor.
To fix this, we now differentiate if the abort was caused by a stop()
call [2] or if it was caused by a timeout. In case of a timeout, we kill
the running process.
Add a test case to validate the changed behaviour. Without the fix, the
test case runs indefinitetly because the cleanup playbook won't be
aborted even after the test times out (during the test cleanup).
[1]: 4d555ca675/zuul/executor/server.py (L2688)
[2]: 4d555ca675/zuul/executor/server.py (L1064)
Change-Id: I979f55b52da3b7a237ac826dfa8f3007e8679932
We may be able to speed up pipeline refreshes in cases where there
are large numbers of items or jobs/builds by parallelizing ZK reads.
Quick refresher: the ZK protocol is async, and kazoo uses a queue to
send operations to a single thread which manages IO. We typically
call synchronous kazoo client methods which wait for the async result
before returning. Since this is all thread-safe, we can attempt to
fill the kazoo pipe by having multiple threads call the synchronous
kazoo methods. If kazoo is waiting on IO for an earlier call, it
will be able to start a later request simultaneously.
Quick aside: it would be difficult for us to use the async methods
directly since our overall code structure is still ordered and
effectively single threaded (we need to load a QueueItem before we
can load the BuildSet and the Builds, etc).
Thus it makes the most sense for us to retain our ordering by using
a ThreadPoolExecutor to run some operations in parallel.
This change parallelizes loading QueueItems within a ChangeQueue,
and also Builds/Jobs within a BuildSet. These are the points in
a pipeline refresh tree which potentially have the largest number
of children and could benefit the most from the change, especially
if the ZK server has some measurable latency.
Change-Id: I0871cc05a2d13e4ddc4ac284bd67e5e3003200ad
This adds the ability to specify that the Zuul executor should
acquire a semaphore before running an individual playbook. This
is useful for long running jobs which need exclusive access to
a resources for only a small amount of time.
Change-Id: I90f5e0f570ef6c4b0986b0143318a78ddc27bbde
We currently only detect some errors with job parents when freezing
the job graph. This is due to the vagaries of job variants, where
it is possible for a variant on one branch to be okay while one on
another branch is an error. If the erroneous job doesn't match,
then there is no harm.
However, in the typical case where there is only one variant or
multiple variants are identical, it is possible for us to detect
during config loading a situation where we know the job graph
generation will later fail. This change adds that analysis and
raises errors early.
This can save users quite a bit of time, and since variants are
typically added one at a time, may even prevent users from getting
into abiguous situations which could only be detected when freezing
the job graph.
Change-Id: Ie8b9ee7758c94788ee7bc05947ddd97d9fa8e075
When specifying job.override-checkout, we apply job variants from
that match the specified branch. The mechanism we use to do that
is to create a synthetic Ref object to pass to the branch matcher
instead of the real branch of the Change (since the real branch
is likely different -- that's why override-checkout was specified).
However, branch matching behavior has gottes slightly more
sophisticated and Ref objects don't match in the same way that
Change objects do.
In particular, implied branch matchers match Ref subclasses that
have a branch attribute iff the match is exact.
This means that if a user constructed two branches:
* testbranch
* testbranch2
and used override-checkout to favor a job definition from testbranch2,
the implied branch matcher for the variant in testbranch would match
since the matcher behaved as if it were matching a Ref not a Change
or Branch.
To correct this, we update the simulated change object used in the
override-checkout variant matching routine to be a Branch (which
unsurprisingly has a branch attribute) instead of a Ref.
The test test_implied_branch_matcher_similar_override_checkout is added
to cover this test case. Additionally, the test
test_implied_branch_matcher_similar is added for good measure (it tests
implied branch matchers in the same way but without specifying
override-checkout), though its behavior is already correct.
A release note is included since this may have an effect on job behavior.
Change-Id: I1104eaf02f752e8a73e9b04939f03a4888763b27