This change completes the circular dependency refactor.
The principal change is that queue items may now include
more than one change simultaneously in the case of circular
dependencies.
In dependent pipelines, the two-phase reporting process is
simplified because it happens during processing of a single
item.
In independent pipelines, non-live items are still used for
linear depnedencies, but multi-change items are used for
circular dependencies.
Previously changes were enqueued recursively and then
bundles were made out of the resulting items. Since we now
need to enqueue entire cycles in one queue item, the
dependency graph generation is performed at the start of
enqueing the first change in a cycle.
Some tests exercise situations where Zuul is processing
events for old patchsets of changes. The new change query
sequence mentioned in the previous paragraph necessitates
more accurate information about out-of-date patchsets than
the previous sequence, therefore the Gerrit driver has been
updated to query and return more data about non-current
patchsets.
This change is not backwards compatible with the existing
ZK schema, and will require Zuul systems delete all pipeline
states during the upgrade. A later change will implement
a helper command for this.
All backwards compatability handling for the last several
model_api versions which were added to prepare for this
upgrade have been removed. In general, all model data
structures involving frozen jobs are now indexed by the
frozen job's uuid and no longer include the job name since
a job name no longer uniquely identifies a job in a buildset
(either the uuid or the (job name, change) tuple must be
used to identify it).
Job deduplication is simplified and now only needs to
consider jobs within the same buildset.
The fake github driver had a bug (fakegithub.py line 694) where
it did not correctly increment the check run counter, so our
tests that verified that we closed out obsolete check runs
when re-enqueing were not valid. This has been corrected, and
in doing so, has necessitated some changes around quiet dequeing
when we re-enqueue a change.
The reporting in several drivers has been updated to support
reporting information about multiple changes in a queue item.
Change-Id: I0b9e4d3f9936b1e66a08142fc36866269dc287f1
Depends-On: https://review.opendev.org/907627
This is part of the circular dependency refactor.
It updates BuildRequests to include the job uuid (similar to the
previous change with NodeRequests). We also use the job uuid when
sending BuildResultEvents.
Change-Id: If7a55138d0cb5a358c62e8cd97ee322087b09a6b
This is part of the circular dependency refactor.
This updates the buildset object in memory (and zk) to store builds
indexed by frozen job uuid rather than job name. This also updates
everal related fields and also temporary dictionaries to do the same.
This will allow us, in the future, to have more than one job/build
in a buildset with the same name (for different changes/refs).
Change-Id: I70865ec8d70fb9105633f0d03ba7c7e3e6cd147d
We intend to handle the case where two queue items in check share
a deduplicated build and one of them completes and is dequeued while
the other continues. To handle this, we avoid removing any queue
items (and therefore their associated builds) from ZK if their builds
show up in any currently-enqueued queue item. However, we don't
actually have a mechanism to load a build from ZK in that situation
if it isn't in a queue item that is currently enqueued.
Adding such a mechanism is complex and risky, whereas the circular
dependency refactoring effort currently in progress will address this
issue in a comprehensive manner.
This change addresses the issue by detecting the situation at the
point where we would try to launch the build (since we failed to
restore it from ZK) and instead of raising an exception as we currently
do, we tell the scheduler to retry the build. This results in the
buildset not actually taking advantage of the potential deduplication,
but it does at least provide a working build result to the user in
the form of a brand new build.
Change-Id: I86c159c82b858e67433bdaa1e479471b21ea8b86
If a build is to be deduplicated and has not started yet and has
a pending node request, we store a dictionary describing the target
deduplicated build in the node_requests dictionary on the buildset.
There were a few places where we directly accessed that dictionary
and assumed the results would be the node request id. Notably, this
could cause an error in pipeline processing (as well os potentially
some other edge cases such as reconfiguring).
Most of the time we can just ignore deduplicated node requests since
the "real" buildset will take care of them. This change enriches
the API to help with that. In other places, we add a check for the
type.
To test this, we enable relative_priority in the config file which
is used in the deduplication tests, and we also add an assertion
which runs at the end of every test that ensures there were no
pipeline exceptions during the test (almost all the existing dedup
tests fail this assertion before this change).
Change-Id: Ia0c3f000426011b59542d8e56b43767fccc89a22
This adds the ability to specify that the Zuul executor should
acquire a semaphore before running an individual playbook. This
is useful for long running jobs which need exclusive access to
a resources for only a small amount of time.
Change-Id: I90f5e0f570ef6c4b0986b0143318a78ddc27bbde
This implements the job freezing API directly in zuul-web, so there is
no need to call the scheduler via RPC.
Change-Id: Ibc7737a51fe5428bacdcb4763b3e6155cea29036
This removes the local builds list (executor.builds) from the executor
client.
To make this work, we have to adapt a few tests which are still using
this list. To provide a similar functionality for those tests, the
ZuulTestCase class now provides a method getCurrentBuilds() that looks
up the builds from the active pipelines/queues.
Change-Id: If74bc7164d1156cdb7ba06f12868c60bf1617ea8
The executor client still holds a list of local builds objects which is
used in various places. One use case is to look up necessary
information of the original build when a build result event is handled.
Using such a local list won't work with multiple schedulers in place. As
a first step we will avoid using this list for handling build result
events and instead provide all necessary information to the build result
itself and look up the remaining information from the pipeline directly.
This change also improves the log output when processing build result
events in the scheduler.
Change-Id: I9c4e573de2ce63259ec6cfb7d69c2f5be48f33ef
It's possible for the following sequence to occur (prefixed by
thread ids):
2> process job request cache update
1> finish job
1> set job request state to complete
1> unlock job request
1> delete job request
1> delete job request lock
2> get cached list of running jobs for lostRequests, start examining job
2> check if the job is unlocked (this will re-create the lock dir and return true)
2> attempt to set job request state to complete (this will raise JobRequestNotFound)
2> bail
This leaves a lock node laying around. We have a cleanup process that
will eventually remove it in production, but it's existence can cause
the clean-state checks at the end of unit tests to fail.
To correct this:
a) Try to avoid re-creating the lock (though this is not possible in all cases)
b) If we encounter JobRequestNotFound error in the cleanup, attempt to
delete the job nonetheless (so that we re-delete the lock dir)
The remove method is also made entirely idemptotent to support this.
Change-Id: I49ad5c38a3c6cbaf0962e805b6c228e36b97a3d2
To prepare for queue items moving into ZooKeeper, stop storing the
NodeRequest and NodeSet objects on them. Instead, reference requests
by ID and consult ZK when necessary, and store only the info about
nodesets that the scheduler needs. The result are simple dicts
than can easily be serialized.
The deleteNodeRequest method is updated to accept IDs instead of
NodeRequest objects to minimize the number of times we need to
use a full NodeRequest object.
Change-Id: I3587a42eb5a151f41369385e482b7f36b1c41bf6
To make things simpler for schedulers to handle node provisioned
events for node requests which they may not have in their local
pipeline state, we need to make the pipeline storage of node requests
simpler. That starts by removing the nodeset object as an attribute
of the NodeRequest object. This means that the scheduler can work
with a node request object without relying on having the associated
nodeset. It also simplifies the ZooKeeper code that deserializes
NodeRequests (as it doesn't have to create fake NodeSet objects too).
And finally, it simplifies what must be stored in the pipeline
and queue item structures, which will also come in handy later.
Two tests designed to verify that the request->nodeset magic
deserialization worked have been removed since they are no longer
applicable.
Change-Id: I70ae083765d5cd9a4fd1afc2442bf22d6c52ba0b
Those are not submitted to zookeeper but still need to be removed from
the builds list to prevent a memory leak.
Change-Id: I60d2e5e957aa4b7d877dd5e8ba78977eb3e7d7e6
The Merger and executor APIs have a lot in common, but they behave
slightly differently. A merger needs to sometimes return results.
An executor needs to have separate queues for zones and be able to
pause or cancel jobs.
This refactors them both into a common class which can handle job
state changes (like pause/cancel) and return results if requested.
The MergerApi can subclass this fairly trivially.
The ExecutorApi adds an intermediate layer which uses a
DefaultKeyDict to maintain a distinct queue for every zone and then
transparently dispatches method calls to the queue object for
that zone.
The ZK paths for both are significantly altered in this change.
Change-Id: I3adedcc4ea293e43070ba6ef0fe29e7889a0b502
We don't need to clean up leaked build parameters every time we
clean up lost build requests. The latter happens every 60 seconds
because it's performance critical, but the param cleanup can happen
much less frequently.
Move it into the scheduled tasks that the scheduler runs, but rather
than continuing to add new apsched events, combine this with the
config cache cleanup. We can also add the similar cleanup task which
will be needed by the merger api. These can all run infrequently
(every 60 minutes).
The only downside to this is that we will have one scheduler running
all of these combined tasks, but they are relatively low impact so
it's probably not important to load-balance them.
Change-Id: I76ce9df242ebdaf9f1b7113530775bd702ccaaa0
Now that build request parameters are sharded outside of a transaction,
we might write them but fail to write the actual build request. To
address that, add a periodic checke for leaked parameters which is run
after the search for lost build requests.
Change-Id: I0787d94513572f90ae66616a6619d488c9592a5d
The serialized nodeset is now supplied as a build parameter,
which makes the synthetic hosts and groups parameters which are
derived from it redundant.
Update the executor to rely entirely on the deserialized nodeset.
We also rename the method which creates the parameters since they
are not used for gearman any more.
A subsequent change can remove the hosts and nodes parameters.
Change-Id: Ied7f78c332485e5c66b5721c1007c25660d4238e
Currently, the nodes are locked in the scheduler/pipeline manager before
the actual build is created in the executor client. When the nodes are
locked, the corresponding NodeRequest is also deleted.
With this change, the executor will lock the nodes directly before
starting the build and unlock them when the build is completed.
To keep the order of events intact, the nodepool.acceptNodes() method is
split up into two:
1. nodepool.acceptNodeRequest() does most of the old acceptNodes()
method except for locking the nodes and deleting the node
request. It is called on the scheduler side when the
NodesProvisionedEvent is handled (which is also where
acceptNodes() was previously called).
2. nodepool.acceptNodes() is now called on the executor side when
the job is started. It locks the nodes and deletes the node
request in ZooKeeper.
Finally, it's also necessary to move the autohold processing to the
executor, as this requires a lock on the node. To allow processing of
autoholds, the executor now also determines the build attempts and sets
the RETRY_LIMIT result if necessary.
Change-Id: I7392ce47e84dcfb8079c16e34e0ed2062ebf4136
The scheduler is a better home for the build request cleanup because
the cleanup process should run across all zones, and the executor
api object in the executors are tied to specific zones, so they
have an incomplete cache.
It also means that if all the executors go offline, the schedulers
can still cancel and reset the builds, so it doesn't look like
dead builds are still running.
Since we now have two cleanup actions, instead of making a thread
for each, we use APScheduler to run the cleanup functions on an
interval. We use ZK locks to make sure that they aren't running
on more than one scheduler.
Change-Id: I0245b71f31aae9616d8e65d27c63b25b2c27f815
This is the second part of I5de26afdf6774944b35472e2054b93d12fe21793.
It uses the executor api.
Three tests are disabled until the next change.
Change-Id: Ie08fa9dfb4bb3adb9a02e0a2e8b11309e1ec27cd
If geard loses track of a job, this code is supposed to detect it,
but we've moved classes, so we can't use dunderscores anymore,
we need to change it to a single underscore.
This also corrects the mirror situation where the scheduler is
disconnected from the gearman server; that would eventually fall
through to the lost build handler, but since we get an unambiguous
disconnect event, let's restore the handling of that.
A test case for each of these is added.
Change-Id: I41a926c488717f8bc4cd8a173beaa9458e7cb6e2
This sets the build's start and end time on the executor server rather
than the scheduler. This is necessary to calculate the in-use duration
of locked nodes later on when the nodes are locked on the executor.
Since the scheduler must still set the start and end time on the local
build object, we use the BuildStarted and BuildCompleted to provide the
timestamps to the scheduler. Once the builds are stored in ZooKeeper,
the executor can directly set the timestamps on the build.
In case of noop builds the timestamps will be set by the executor client
as those builds don't show up on any executor server.
Change-Id: I08ac9618b3c12280a8720657f97d0577803e1a59
Use an empty dictionary for the data of the BuildStartedEvent for noop
jobs instead of passing None. Otherwise the dictionary access will
fail when trying to set the worker info.
2021-05-03 13:20:50,287 zuul.Scheduler ERROR Exception in run handler:
Traceback (most recent call last):
File "/home/westphahl/src/opendev/zuul/zuul/zuul/scheduler.py", line 1186, in run
self.process_result_queue()
File "/home/westphahl/src/opendev/zuul/zuul/zuul/scheduler.py", line 1441, in process_result_queue
self._process_result_event(event)
File "/home/westphahl/src/opendev/zuul/zuul/zuul/scheduler.py", line 1467, in _process_result_event
self._doBuildStartedEvent(event)
File "/home/westphahl/src/opendev/zuul/zuul/zuul/scheduler.py", line 1486, in _doBuildStartedEvent
self.executor.setWorkerInfo(build, event.data)
File "/home/westphahl/src/opendev/zuul/zuul/zuul/executor/client.py", line 302, in setWorkerInfo
build.worker.updateFromData(data)
File "/home/westphahl/src/opendev/zuul/zuul/zuul/model.py", line 2020, in updateFromData
self.name = data.get('worker_name', self.name)
AttributeError: 'NoneType' object has no attribute 'get'
Change-Id: I09370e9b69b42cbc2d42f32b3c375c49961a8390
This puts the build result events (BuildStarted, BuildStatus,
BuildPaused and BuildCompleted) into ZooKeeper.
This change moves most of the result event logic into the scheduler
class. Previously, the executor client did a lot before forwarding the
events to the scheduler. Now, the scheduler picks up the events from the
result event queues and does everything by its own before notifying the
pipeline manager.
To make this work, noop builds now get an entry in the local self.builds
list in the executor client. Like before, they are started and completed
immediately (by putting the corresponding events in the result event
queue) without acquiring a compute node.
To properly deal with the pending build results from tenants/pipelines
that were removed during a reconfiguration, we have to put some cleanup
logic into the Scheduler. This is necessary as the result events of
those builds cannot be processed anymore as the tenant/pipeline is no
longer part of the current abide.
The builds itself are still executed via Gearman, but the results aren't
looked up from the Gearman job anymore, but from the result queues in
ZooKeeper.
Change-Id: I067c38cec8ac14f3a3bb6f607ede943c88ce493d
With the introduction of type checking, many of the Optional attributes
must be checked in various places before accessing. Making the buildset
mandatory on the build removes some of those otherwise necessary
NoneType checks and will avoid more in future changes.
Change-Id: I70409551f0f786ad6ea9c4f2cf0ab0f9eb2d1555
Move the param generation for an execution job into a library for reuse.
The param preparation takes care of determining projects and connections
for dependant roles which saves a zuul-runner from needing to understand
canonical names for which it would need to query a scheduler.
Implement a basic API to freeze and grab these job params. These could
then be passed to another zuul-executor or other runner.
Change-Id: I681f2a3384c9a65ae0acc3fce966e8ec47005b64
Co-Authored-By: Tristan Cacqueray <tdecacqu@redhat.com>
We currently set the paused flag on the build object directly when we
receive the notification from the executor and afterwards enqueue a
BuildPaused event to let the pipeline process the pause. As soon a the
build is marked as paused the pipeline processor will request
resources for the child jobs. However during processing of the
BuildPaused event it can be that the paused job requested to skip some
jobs. In this case we have a race between requesting the resources and
setting the child job results to SKIPPED. The result is that if the
resources are requested earlier they will be leaked because they will
never be released by the scheduler as no builds will be enqueued for
them.
This can be solved by deferring to set the paused flag to the event
queue which eliminates the race.
This is similar to the deferred result setting in
I0631b97e6ae2bf7d883c6c6a189e9d41d99ece5d.
Change-Id: Ie0fa8a2613fca65884d54b59273769f4984552cd
If a build must be retried, the previous build information get lost.
To be informed that a build was retried, the retry builds are now part
of the mqtt message.
Change-Id: I8c93376f844c3d1c55c89a250384a7f835763677
Depends-On: https://review.opendev.org/704983
With the event id in the inventory we can use it in the emit-job-header
role. This can aid in debugging, e.g. when a job is still running.
Change-Id: Id2d38bb3d121fa9cf959b9765ac8025d42001c02
Currently the algorithm for canceling is as follows:
1. check url of build, if exists abort and return
2. remove from gearman queue and return if successful
3. sleep a second
4. check url of build again
The url part of the build is used to determine if it has been started
on the executor. However the url is actually set just before the
ansible playbooks are started after all repo fetching and merging has
happened. This most of the time takes more than the one second wait
time in step three. This leaves a large window open where builds
cannot be canceled at all and will run completely.
This can be fixed by reporting the worker_name earlier just after the
job has been accepted and use the worker name instead of the url to
indicate if the job is running.
Change-Id: I37925c7a6207e8d92d47097c9d66200343b3d090
Record the number of attempts zuul has made to run this job in the job
inventory. This will help expose reliability information in job logs and
in job log indexers. We want to try and expose job reliability as much
as possible and this is one way to do that.
Change-Id: I2f7c31ce510f59bc569c5db89ce6626d9e3ef436
When debugging cancel issues the message from cancelJobInQueue easily
gets overlooked without the annotation.
Change-Id: Ic9198c0cdcee2d2cd7771dcc17448328a71dadd6
The logs around build completion and cancelations should be annotated
with the event that triggered the action.
Change-Id: Ifd039d0314a32647a0ec9ebe9bff064e503e0452
This change adds a cleanup-run job attribute to specify a cleanup playbook to
be executed after the job execution.
Change-Id: Iebd93b2d6d13e1d4ccf2878e79e563cc9da2cce9
In some cases like resource constrained environments it is beneficial
to report on changes in a fail fast manner to immediately report if
one job fails. This can be useful especially if a project has many
expensive long-running jobs. This introduces a fail-fast flag in the
project pipeline that let's the project choose the trade off between
full information and quick feedback.
Change-Id: Ie4a5ac8e025362dbaacd3ae82f2e8369f7447a62
Currently the default ansible version is selected by the version of
zuul itself. However we want to make this configurable per deployment
(zuul.conf), tenant and job.
Change-Id: Iccbb124ac7f7a8260c730fbc109ccfc1dec09f8b
Adds support for expressing artifact dependencies between jobs
which may run in different projects.
Change-Id: If8cce8750d296d607841800e4bbf688a24c40e08
Having the change message available via the Zuul vars simplifies cases
where a job e.g. needs to update a GitHub/Jira/... ticket.
Those ticket numbers are usually referenced in the commit/PR message.
This avoids having to deal with secrets etc. to get this information
'out-of-band'.
Change-Id: Ib88db7f724dadfb8a4f86e76692f3e1c2c63a258