Commit Graph

831 Commits

Author SHA1 Message Date
James E. Blair 1350ce8ad6 Use NodesetNotFoundError class
This error exception class went unused, likely due to complications
from circular imports.

To resolve this, move all of the configuration error exceptions
into the exceptions.py file so they can be imported in both
model.py and configloader.py.

Change-Id: I19b0f078f4d215a2e14c2c7ed893ab225d1e1084
2024-03-18 15:03:58 -07:00
Zuul 617bbb229c Merge "Fix validate-tenants isolation" 2024-02-28 02:46:55 +00:00
James E. Blair fa274fcf25 Remove most model_api backwards-compat handling
Since we require deleting the state (except config cache) for the
circular dependency upgrade, we can now remove any zookeeper
backwards compatability handling not in the config cache.  This
change does so.  It leaves upgrade handling for two attributes
in the branch cache (even though they are old) because it is
plausible that even systems that have been upgrading regularly
may have non-upgraded data in the branch cache for infrequently
used branches, and the cost for retaining the upgrade handling
is small.

Change-Id: I881578159b06c8c7b38a9fa0980aa0626dddad31
2024-02-09 07:39:55 -08:00
James E. Blair 1f026bd49c Finish circular dependency refactor
This change completes the circular dependency refactor.

The principal change is that queue items may now include
more than one change simultaneously in the case of circular
dependencies.

In dependent pipelines, the two-phase reporting process is
simplified because it happens during processing of a single
item.

In independent pipelines, non-live items are still used for
linear depnedencies, but multi-change items are used for
circular dependencies.

Previously changes were enqueued recursively and then
bundles were made out of the resulting items.  Since we now
need to enqueue entire cycles in one queue item, the
dependency graph generation is performed at the start of
enqueing the first change in a cycle.

Some tests exercise situations where Zuul is processing
events for old patchsets of changes.  The new change query
sequence mentioned in the previous paragraph necessitates
more accurate information about out-of-date patchsets than
the previous sequence, therefore the Gerrit driver has been
updated to query and return more data about non-current
patchsets.

This change is not backwards compatible with the existing
ZK schema, and will require Zuul systems delete all pipeline
states during the upgrade.  A later change will implement
a helper command for this.

All backwards compatability handling for the last several
model_api versions which were added to prepare for this
upgrade have been removed.  In general, all model data
structures involving frozen jobs are now indexed by the
frozen job's uuid and no longer include the job name since
a job name no longer uniquely identifies a job in a buildset
(either the uuid or the (job name, change) tuple must be
used to identify it).

Job deduplication is simplified and now only needs to
consider jobs within the same buildset.

The fake github driver had a bug (fakegithub.py line 694) where
it did not correctly increment the check run counter, so our
tests that verified that we closed out obsolete check runs
when re-enqueing were not valid.  This has been corrected, and
in doing so, has necessitated some changes around quiet dequeing
when we re-enqueue a change.

The reporting in several drivers has been updated to support
reporting information about multiple changes in a queue item.

Change-Id: I0b9e4d3f9936b1e66a08142fc36866269dc287f1
Depends-On: https://review.opendev.org/907627
2024-02-09 07:39:40 -08:00
James E. Blair fb7d24b245 Fix validate-tenants isolation
The validate-tenants scheduler subcommand is supposed to perform
complete tenant validation, and in doing so, it interacts with zk.
It is supposed to isolate itself from the production data, but
it appears to accidentally use the same unparsed config cache
as the production system.  This is mostly okay, but if the loading
paths are different, it could lead to writing cache errors into
the production file cache.

The error is caused because the ConfigLoader creates an internal
reference to the unparsed config cache and therefore ignores the
temporary/isolated unparsed config cache created by the scheduler.

To correct this, we will always pass the unparsed config cache
into the configloader.

Change-Id: I40bdbef4b767e19e99f58cbb3aa690bcb840fcd7
2024-01-31 14:58:45 -08:00
James E. Blair 7eb3fbeb97
Use a DummyFrozenJob in build cleanup
There is a place where we try to clean up builds that we may no
longer have FrozenJobs for; but the cleanup methods expect FrozenJob
objects essentially just to get the job name.

To make it more clear that's what's happening, use a dedicated object
to pass that information.

Change-Id: Ieb85bcba6d9a184e28120953adf882e4720621b2
2024-01-15 07:39:51 +01:00
James E. Blair 663931cba3 Include job_uuid in BuildRequests/Events
This is part of the circular dependency refactor.

It updates BuildRequests to include the job uuid (similar to the
previous change with NodeRequests).  We also use the job uuid when
sending BuildResultEvents.

Change-Id: If7a55138d0cb5a358c62e8cd97ee322087b09a6b
2024-01-08 08:54:56 -08:00
James E. Blair 7262ef7f6f Include job_uuid in NodeRequests
This is part of the circular dependency refactor.  It updates the
NodeRequest object to include the job_uuid in addition to the job_name
(which is temporarily kept for backwards compatability).  When node
requests are completed, we now look up the job by uuid if supplied.

Change-Id: I57d4ab6c241b03f76f80346b5567600e1692947a
2023-12-20 10:44:04 -08:00
James E. Blair 9201f9ee28 Store builds on buildset by uuid
This is part of the circular dependency refactor.

This updates the buildset object in memory (and zk) to store builds
indexed by frozen job uuid rather than job name.  This also updates
everal related fields and also temporary dictionaries to do the same.

This will allow us, in the future, to have more than one job/build
in a buildset with the same name (for different changes/refs).

Change-Id: I70865ec8d70fb9105633f0d03ba7c7e3e6cd147d
2023-12-12 11:58:21 -08:00
James E. Blair 1a226acbd0 Emit stats for more build results
We currently typically only emit build stats for success and failure,
but do not emit stats for NODE_FAILURE, CANCELED, SKIPPED, etc.

To round out the feature so that all build results are emitted, this
includes a small refactor to report build stats at the same time we
report build results to the database (it almost all cases).  One
exception to that is when we receive a non-current build result --
that generally happens after a build is canceled, so we don't need
to emit the stats again.

Change-Id: I3bdf4e2643e151e2eae9f204f62cdc106df876b4
2023-09-26 11:32:03 -07:00
Zuul e660979a8c Merge "Use tenant-level layout locks" 2023-08-25 14:44:38 +00:00
James E. Blair eb803984a0 Use tenant-level layout locks
The current "layout_lock" in the scheduler is really an "abide" lock.
We lock it every time we change something in the abide (including
tenant layouts).  The name is inherited from pre-multi-tenant Zuul.

This can cause some less-than-optimal behavior when we need to wait to
acquire the "layout_lock" for a tenant reconfiguration event in one
thread while another thread holds the same lock because it is
reloading the configuration for a different tenant.  Ideally we should
be able to have finer-grained tenant-level locking instead, allowing
for less time waiting to reconfigure.

The following sections describe the layout lock use prior to this
commit and how this commit adjusts the code to make it safe for
finer-grained locking.

1) Tenant iteration

The layout lock is used in some places (notably some cleanup methods)
to avoid having the tenant list change during the method.  However,
the configloader already performs an atomic replacement of the tenant
list making it safe for iteration.  This change adds a lock around
updates to the tenant list to prevent corruption if two threads update
it at the same time.

The semaphore cleanup method indirectly references the abide and
layout for use in global and local semaphores.  This is just for path
construction, and the semaphores exist apart from the abide and layout
configurations and so should not be affected by either changing while
the cleanup method is running.

The node request cleanup method could end up running with an outdated
layout objects, including pipelines, however it should not be a
problem if these orphaned objects end up refreshing data from ZK right
before they are removed.

In these cases, we can simply remove the layout lock.

2) Protecting the unparsed project branch cache

The config cache cleanup method uses the unparsed project branch cache
(that is, the in-memory cache of the contents of zuul config files) to
determine what the active projects are.

Within the configloader, the cache is updated and then also used while
loading tenant configuration.  The layout lock would have made sure
all of these actions were mutually exclusive.  In order to remove the
layout lock here, we need to make sure the Abide's
unparsed_project_branch_cache is safe for concurrent updates.

The unparsed_project_branch_cache attribute is a dictionary that
conains references to UnparsedBranchCache objects.  Previously, the
configloader would delete each UnparsedBranchCache object from the
dictionary, reinitialize it, then incrementially add to it.

This current process has a benign flaw.  The branch cache is cleared,
and then loaded with data based on the tenant project config (TPC)
currently being processed.  Because the cache is loaded based on data
from the TPC, it is really only valid for one tenant at a time despite
our intention that it be valid for the entire abide.  However, since
we do check whether it is valid for a given TPC, and then clear and
reload it if it is not, there is no error in data, merely an
incomplete utilization of the cache.

In order to make the cache safe for use by different tenants at the
same time, we address this problem (and effectively make it so that it
is also *effective* for different tenants, even at different times).
The cache is updated to store the ltime for each entry in the cache,
and also to store null entries (with ltimes) for files and paths that
have been checked but are not present in the project-cache.  This
means that at any given time we can determine whether the cache is
valid for a given TPC, and support multiple TPCs (i.e., multiple
tenants).

It's okay for the cache to be updated simultaneously by two tenants
since we don't allow the cache contents to go backwards in ltime.  The
cache will either have the data with at least the ltime required, or
if not, that particular tenant load will spawn cat jobs and update it.

3) Protecting Tenant Project Configs (TPCs)

The loadTPC method on the ConfigLoader would similarly clear the TPCs
for a tenant, then add them back.  This could be problematic for any
other thread which might be referencing or iterating over TPCs.  To
correct this, we take a similar approach of atomic replacement.

Because there are two flavors of TPCs (config and untrusted) and they
are stored in two separate dictionaries, in order to atomically update
a complete tenant at once, the storage hierarchy is restructured as
"tenant -> {config/untrusted} -> project" rather than
"{config/untrusted} -> tenant -> project".  A new class named
TenantTPCRegistry holds both flavors of TPCs for a given tenant, and
it is this object that is atomically replaced.

Now that these issues are dealt with, we can implement a tenant-level
thread lock that is used simply to ensure that two threads don't
update the configuration for the same tenant at the same time.

The scheduler's unparsed abide is updated in two places: upon full
reconfiguration, or when another scheduler has performed a full
reconfiguration and updated the copy in ZK.  To prevent these two
methods from performing the same update simultaneously, we add an
"unparsed_abide_lock" and mutually exclude them.

Change-Id: Ifba261b206db85611c16bab6157f8d1f4349535d
2023-08-24 17:32:25 -07:00
James E. Blair 57a9c13197 Use the GitHub default branch as the default branch
This supplies a per-project default value for Zuul's default-branch
based on what the default branch is set to in GitHub.  This means
that if users omit the default-branch setting on a Zuul project
stanza, Zuul will automatically use the correct value.

If the value in GitHub is changed, an event is emitted which allows
us to automatically reconfigure the tenant.

This could be expanded to other drivers that support an indication
of which branch is default.

Change-Id: I660376ecb3f382785d3bf96459384cfafef200c9
2023-08-23 11:07:08 -07:00
Zuul e62dad266d Merge "Log item related messages with event id" 2023-08-15 09:46:46 +00:00
James E. Blair 76f791e4d3 Fix linting errors
A new pycodestyle errors on ",\".  We only use that to support
Python <3.10, and since Zuul is now targeting only 3.11+, these
instances are updated to use implicit continuation.

An instance of "==" is changed to "is".

A function definition which overrides an assignment is separated
so that the assignment always occurs regardless of whether it
ends up pointing to the function def.

Finally, though not required, since we're editing the code anyway
for nits, some typing info is removed.

Change-Id: I6bb096b87582ab1450bed02541483fc6f1d6c44a
2023-08-02 10:28:22 -07:00
Zuul 6c0ffe565f Merge "Report early failure from Ansible task failures" 2023-07-29 18:28:08 +00:00
Simon Westphahl 39e5eb9e2c Don't fail tenant validation for deprecations
Warnings are also added to the list of loading errors. For the tenant
validation we need to distinguish between errors and deprecation
warnings and only fail the validation when there are errors.

Alternatively we could also introduce a flag for the tenant validation
to thread deprecation warnings as errors.

Change-Id: I9c8957520c37157a295627848d30e52a36c8da0a
Co-Authored-By: James E. Blair <jim@acmegating.com>
2023-07-28 07:33:36 -07:00
Zuul c7adffc244 Merge "Deduplicate circular dep jobs in check" 2023-07-11 15:45:31 +00:00
James E. Blair 1170b91bd8 Report early failure from Ansible task failures
We can have the Ansible callback plugin tell the executor to tell
the scheduler that a task has failed and therefore the job will
fail.  This will allow the scheduler to begin a gate reset before
the failing job has finished and potentially save much developer
and CPU time.

We take some extra precautions to try to avoid sending a pre-fail
notification where we think we might end up retrying the job
(either due to a failure in a pre-run playbook, or an unreachable
host).  If that does happen then a gate pipeline might end up
flapping between two different NNFI configurations (ie, it may
perform unecessary gate resets behind the change with the retrying
job), but should still not produce an incorrect result.  Presumably
the detections here should catch that case sufficiently early, but
due to the nature of these errors, we may need to observe it in
production to be sure.

Change-Id: Ic40b8826f2d54e45fb6c4d2478761a89ef4549e4
2023-06-29 13:40:34 -07:00
Simon Westphahl 7992ff08c6
Log item related messages with event id
A few messages in the scheduler so far were not using the annotated
logger. This means that those messages could not be easily correlated
via the event id. Improve this by using the annotated logger when
possible.

Change-Id: I4378e6db7e7fb0af2c29e1a226343b25de2ce508
2023-06-22 11:20:02 +02:00
James E. Blair b913a67889 Deduplicate circular dep jobs in check
If a cycle of dependencies in a dependent pipeline meet certain
conditions and share some jobs, we deduplicate those jobs (i.e.,
we run only one build of the job and share it between multiple
items simultaneously).

This makes the same happen for independent pipelines.  This is
accomplished by searching for identical bundles within the pipeline
and deduplicating the job across bundles.  The deduplication is
no different than dependent pipelines -- we simply store a reference
to the actual build in ZK and use the same Build object in Python.

In dependent pipelines, bundle queue items are removed
simultaneously (they all report at the same time).  However, in
independent pipelines, they may report at different times.  If
a queue item is dequeued while one of its builds is referenced by
a different queue item, then the second queue item will begin to
have errors loading information from ZK.

To deal with this, we utlilize a property of how we store items in
ZK: the QueueItem itself is stored underneath the pipeline object
with no reference to its place in the queue.  Separately there are
queue lists which reference the QueueItem zobjects by path.

We will continue to remove the QueueItem from the Queue, but we will
leave it in place in ZK if it is possible that it may be referenced
by other queue items.  Then we will rely more heavily on a garbage
collection system to remove the QueueItems once they, and their builds,
are no longer referenced by any item in the pipeline.

The existing garbage collection happens after loading the pipeline
state from ZK, but this change moves it to the end of pipeline
processing so that we are more likely to write out empty pipelines
at the end of the process once all items are removed.

Change-Id: I7fc897088d720f483e320431fefaa06425a33c86
2023-04-20 16:10:22 -07:00
Simon Westphahl 711e1e5c98
Add missing event id to management events
The change management events via Zuul web and the command socket did not
have an event ID assigned. This made it harder to debug issues where we
need to find the logs related to a certain action.

Change-Id: I05ccbc13c7f906f91e13fb66e4a01a51fc822676
2023-04-14 08:27:29 +02:00
Zuul bb648c8b7d Merge "Match events to pipelines based on topic deps" 2023-02-27 21:49:42 +00:00
James E. Blair 9229e1a774 Match events to pipelines based on topic deps
We distribute tenant events to pipelines based on whether the event
matches the pipeline (ie, patchset-created for a check pipeline) or
if the event is related to a change already in the pipeline.  The
latter condition means that pipelines notice quickly when dependencies
are changed and they can take appropriate action (such as ejecting
changes which no longer have correct dependencies).

For git and commit dependencies, an update to a cycle to add a new
change requires an update to at least one existing change (for example,
adding a new change to a cycle usually requires at least two Depends-On
footers: the new change, as well as one of the changes already in the
cycle).  This means that updates to add new changes to cycles quickly
come to the attention of pipelines.

However, it is possible to add a new change to a topic dependency cycle
without updating any existing changes.  Merely uploading a new change
in the same topic adds it to the cycle.  Since that new change does
not appear in any existing pipelines, pipelines won't notice the update
until their next natural processing cycle, at which point they will
refresh dependencies of any changes they contain, and they will notice
the new dpendency and eject the cycle.

To align the behavior of topic dependencies with git and commit
dependencis, this change causes the scheduler to refresh the
dependencies of the change it is handling during tenant trigger event
processing, so that it can then compare that change's dependencies
to changes already in pipelines to determine if this event is
potentially relevant.

This moves some work from pipeline processing (which is highly parallel)
to tenant processing (which is only somewhat parallel).  This could
slow tenant event processing somewhat.  However, the work is
persisted in the change cache, and so it will not need to be repeated
during pipeline processing.

This is necessary because the tenant trigger event processor operates
only with the pipeline change list data; it does not perform a full
pipeline refresh, so it does not have access to the current queue items
and their changes in order to compare the event change's topic with
currently enqueued topics.

There are some alternative ways we could implement this if the additional
cost is an issue:

1) At the beginning of tenant trigger event processing, using the change
   list, restore each of the queue's change items from the change cache
   and compare topics.  For large queues, this could end up generating
   quite a bit of ZK traffic.
2) Add the change topic to the change reference data structure so that
   it is stored in the change list.  This is an abuse of this structure
   which otherwise exists only to store the minimum amount of information
   about a change in order to uniquely identify it.
3) Implement a PipelineTopicList similar to a PipelineChangeList for storing
   pipeline topics and accesing them without a full refresh.

Another alternative would be to accept the delayed event handling of topic
dependencies and elect not to "fix" this behavior.

Change-Id: Ia9d691fa45d4a71a1bc78cc7a4bdec206cc025c8
2023-02-17 10:12:14 -08:00
Simon Westphahl b23f76e677
Update reconfig event ltime on (smart) reconfig
Make sure to update the last reconfig event ltime in the layout state
during a (smart) reconfig. This is important in order to unblock
pipelines when a tenant reconfig event is lost.

So far the last reconfig event ltime was passed as -1, but wasn't set in
the layout state since the ltime is not allowed to go backward.

Change-Id: Iab04a962abbfbe901c22e4d5f1d484e3f53b0d33
2023-02-17 08:39:07 +01:00
Zuul 8c673e0320 Merge "Expand the scope of zkprofile" 2023-02-15 08:50:40 +00:00
Zuul 54ec0b2df7 Merge "Add more log messages for run handler steps" 2023-02-15 08:50:37 +00:00
Zuul af96e5786f Merge "Add scheduler run handler metric" 2023-02-15 08:43:19 +00:00
James E. Blair 3a829196b9 Expand the scope of zkprofile
Currently the zkprofile command only profiles the pipeline refresh
operations.  This expands the scope to include all pipeline operations
(refreshing and then any subsequent changes).  In particular, this
will help identify places where we write too much data.

Change-Id: Ida7bcb311b5037f6ce15a166ddeae6db44ed6709
2023-02-14 16:32:51 -08:00
James E. Blair 98dcd51d90 Fix race condition in pipeline change list init
Simon Westphahl describes the race condition:

> [The race condition] can occur after a reconfiguration while
> some schedulers are updating their local layout and some
> already start processing pipelines in the same tenant.
>
> In this case the pipeline manager's `_postConfig()` method that
> calls `PipelineChangeList.create(...)` races with the pipeline
> processor updating the change keys.
>
> This leads to two change lists being written as separate
> shards, that can't be correctly loaded, as all shards combined
> are expected to form a single JSON object.
>
> The sequence of events seems to be the following:
> 1. S1: pipeline processor needs to update the change keys
> 2. S1 the shard writer will delete the `change_list` key with the old
>    shards
> 3. S2: configloader calls the `_postConfig()` method
> 4. S2: `PipelineChangeList.create()` notices that the `change_list` node
>    doesn't exists in Zookeeper:
>    https://opendev.org/zuul/zuul/src/branch/master/zuul/model.py#L921
> 6. S2: the shard writer creates the first shard `0000000000`
> 7. S1: the shard writer creates the second shared `0000000001`
>
> The race condition seems to be introduced with
> Ib1e467b5adb907f93bab0de61da84d2efc22e2a7

That change updated the pipeline manager _postConfig method so
that it no longer acquires the pipeline lock when initalizing the
pipeline state and change lists.  This greatly reduces potential
delays during reconfiguration, but has, perhaps predictably, lead
to the race condition above.

In the commit message for that change, I noted that we might be
able to avoid even more work if we accept some caveats related to
error reporting.  Avoiding that work mean avoiding performing any
writes during _postConfig which addresses the root cause of the
race condition (steps 3-6 above.  Ed. note: there is no step 5).

From the commit message:

> We still need to attach new state and change list objects to
> the pipeline in _postConfig (since our pipeline object is new).
> We also should make sure that the objects exist in ZK before we
> leave that method, so that if a new pipeline is created, other
> schedulers will be able to load the (potentially still empty)
> objects from ZK.  As an alternative, we could avoid even this
> work in _postConfig, but then we would have to handle missing
> objects on refresh, and it would not be possible to tell if the
> object was missing due to it being new or due to an error.  To
> avoid masking errors, we keep the current expectation that we
> will create these objects in ZK on the initial reconfiguration.

The current change does exactly that.  We no longer perform any
ZK write operations on the state and change list objects in
_postConfig.  Instead, inside of the refresh methods, we detect
the cases where they should be newly created and do so at that
time.  This happens with the pipeline lock, so is safe against
any simultaneous operation from other components.

There will be "ERROR" level log messages indicating that reading
the state from ZK has failed when these objects are first
initialized.  To indicate that this is probably okay, they will
now be immediately followed by "WARNING" level messages explaining
that.

Strictly speaking, this particular race should only occur for the
change list object, not the pipeline state, since the race
condition above requires a sharded object and of the two, only
the change list is sharded.  However, to keep the lifecycle of
these two objects matched (and to simplify _postConfig) the same
treatment is applied to both.

Note that change I7fa99cd83a857216321f8d946fd42abd9ec427a3 merged
after Ib1e467b and changed the behavior slightly, introducing the
old_state and old_list arguments.  Curiously, the old_list
argument is effectively unused, so it is removed entirely in this
change.  Old_state still has a purpose and is retained.

Change-Id: I519348e7d5d74e675808e990920480fb6e1fb981
2023-02-10 15:03:08 -08:00
Simon Westphahl ec5d50d177
Add more log messages for run handler steps
This change adds distinct log message for certain run handler steps.
So far some of those phases could only be inferred by other related log
messages.

Change-Id: Icceca78456b20754cbebb5747da3a538586cd5f5
2023-02-06 08:06:22 +01:00
Simon Westphahl 95ecb41c51
Add scheduler run handler metric
In order to better understand the scheduler run handler performance this
change adds a new `zuul.scheduler.run_handler` metric to measure the
duration of one run handler loop.

Change-Id: I77e862cf99d6a8445e71d7daab410d5853487dc3
2023-02-06 08:05:41 +01:00
Simon Westphahl 9048706d93
Cleanup deleted pipelines and and event queues
When a pipeline is removed during a reconfiguration Zuul will cancel
active builds and node requests. However, since we no longer refresh the
pipeline state during a reconfig we can run into errors when Zuul tries
to cancel builds and node requests based on a possibly outdated pipeline
state.

2023-01-17 10:41:32,223 ERROR zuul.Scheduler: Exception in run handler:
Traceback (most recent call last):
  File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 2007, in run
    self.process_tenant_management_queue(tenant)
  File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 2452, in process_tenant_management_queue
    self._process_tenant_management_queue(tenant)
  File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 2462, in _process_tenant_management_queue
    self._doTenantReconfigureEvent(event)
  File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 1533, in _doTenantReconfigureEvent
    self._reconfigureTenant(ctx, min_ltimes,
  File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 1699, in _reconfigureTenant
    self._reconfigureDeletePipeline(old_pipeline)
  File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 1804, in _reconfigureDeletePipeline
    self.cancelJob(build.build_set, build.job,
  File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 2930, in cancelJob
    build.updateAttributes(
  File "/opt/zuul/lib/python3.10/site-packages/zuul/zk/zkobject.py", line 193, in updateAttributes
    self._save(context, serial)
  File "/opt/zuul/lib/python3.10/site-packages/zuul/zk/zkobject.py", line 392, in _save
    zstat = self._retry(context, self._retryableSave,
  File "/opt/zuul/lib/python3.10/site-packages/zuul/zk/zkobject.py", line 314, in _retry
    return kazoo_retry(func, *args, **kw)
  File "/opt/zuul/lib/python3.10/site-packages/kazoo/retry.py", line 126, in __call__
    return func(*args, **kwargs)
  File "/opt/zuul/lib/python3.10/site-packages/zuul/zk/zkobject.py", line 371, in _retryableSave
    zstat = context.client.set(path, compressed_data,
  File "/opt/zuul/lib/python3.10/site-packages/kazoo/client.py", line 1359, in set
    return self.set_async(path, value, version).get()
  File "/opt/zuul/lib/python3.10/site-packages/kazoo/handlers/utils.py", line 86, in get
    raise self._exception
kazoo.exceptions.BadVersionError

To fix this we need to refresh the pipeline state prior to canceling
those active builds and node requests.

We will also take care of removing the pipeline state and the event
queues from Zookeeper if possible. Errors will be ignored as the
periodic cleanup task takes care of removing leaked pipelines.

Change-Id: I2986419636d8c6557d68d65fb6aff589aa4a680e
2023-01-24 10:25:04 +01:00
Simon Westphahl a2b114e1a3
Periodically cleanup leaked pipeline state
So far we did not cleanup the pipeline state and event queues of deleted
pipelines. To fix that we'll remove the data of pipelines that are no
longer part of the tenant layout in a periodic cleanup task as part of
the general cleanup.

To support the use-case when a pipeline is added back we also need to
initialize the event queues during a tenant reconfiguration. The local
event queue registry usually takes care of creating the queues when the
even queue is first accessed. However, the old event queue object could
still be cached in the registry when we remove and re-add a pipeline.

For the same use-case we also need to remove the pipeline from the list
of watches in the event watcher. Otherwise we won't re-create the
children watch when the pipeline is added.

Change-Id: I02127fe462cc390c81330e717be55780bc2535eb
2023-01-24 10:25:04 +01:00
Simon Westphahl 19668e0bc7
Require latest layout for processing mgmt events
A scheduler should only be allowed to process the management event
queue for a tenant when the local layout is up-to-date. This is already
the case for the tenant trigger event queue.

Otherwise the scheduler could forward mangement events to pipelines that
no longer exist.

Change-Id: I3c485b9e54a65e7c1c09dc51b1f2b4c83c45de19
2023-01-24 10:25:04 +01:00
Zuul cb40ddc7db Merge "Store pause and resume events on the build and report them" 2023-01-02 20:57:37 +00:00
Felix Edel f9786ac2a8 Store pause and resume events on the build and report them
When a build is paused or resumed, we now store this information on the
build together with the event time. Instead of additional attributes for
each timestamp, we add an "event" list attribute to the build which can
also be used for other events in the future.

The events are stored in the SQL database and added to the MQTT payload
so the information can be used by the zuul-web UI (e.g. in the "build
times" gantt chart) or provided to external services.

Change-Id: I789b4f69faf96e3b8fd090a2e389df3bb9efd602
2023-01-02 10:07:26 +01:00
Zuul 35c68169bd Merge "Fix deduplication exceptions in pipeline processing" 2022-12-20 01:48:36 +00:00
James E. Blair fe04739c78 Reuse queue items after reconfiguration
When we reconfigure, we create new Pipeline objects, empty the values
in the PipelineState and then reload all the objects from ZK.  We then
re-enqueue all the QueueItems to adjust and correct the object
pointers between them (item_ahead and items_behind).  We can avoid
reloading all the objects from ZK if we keep queue items from the
previous layout and rely on the re-enqueue method correctly resetting
any relevant object pointers.

We already defer this re-enqueue work to the next pipeline processing
after a reconfiguration (so the reconfiguration itself doesn't take
very long, but now the first pipeline run after a reconfiguration must
perform a complete refresh).  With this change, that first refresh
is no longer be a complete refresh but a normal refresh, so we will get
the benefits of previous reductions in refresh times.

The main risk of this change is that it could introduce a memory leak.
During development, additional debugging was performed to verify that
after a re-enqueue, there are no obsolete layout or pipeline objects
reachable from the pipeline state object.

On schedulers where a re-enqueue does not take place (these schedulers
would simply see the layout update and re-create their PipelineState
python objects and refresh them after another scheduler has already
performed the re-enqueue), we need to ensure that we update any
internal references to Pipeline objects (which then lead to Layout
objects and can cause memory leaks).

To address that, we update the pipeline references in the ChangeQueue
instances underneath a given PipelineState when that state is being
reset after a reconfiguration.

This change also removes the pipeline reference from the QueueItem,
replacing it with a property that uses the pipeline reference
on the ChangeQueue instead.  This removes one extra place where
an incorrect reference could cause a memory leak.

Change-Id: I7fa99cd83a857216321f8d946fd42abd9ec427a3
2022-12-13 13:19:48 -08:00
Zuul e0bd0b6ab0 Merge "Avoid acquiring pipeline locks in manager postConfig" 2022-12-13 18:07:16 +00:00
James E. Blair e25795a120 Avoid acquiring pipeline locks in manager postConfig
After creating a new tenant layout, we call _postConfig on every
pipeline manager.  That creates the shared change queues and then
also attaches new PipelineState and PipelineChangeList objects
to the new Pipeline objects on the new layout.

If the routine detects that it is the first scheduler to deal with
this pipeline under the new layout UUID, it also resets the
pipeline state in ZK (resets flags and moves all queues to the
old_queues attribute so they will be re-enqueued).  It also drops
a PostConfigEvent into the pipeline queues to trigger a run of
the pipeline after the re-enqueues.

The work in the above paragraph means that we must hold the
pipeline lock for each pipeline in turn during the reconfiguration.
Most pipelines should not be processed at this point since we do
hold the tenant write lock, however, some cleanup routines can
be operating, and so we would end up waiting for them to complete
before completing the reconfiguration.  This could end up adding
minutes to a reconfiguration.  Incidentally, these cleanup routines
are written with the expectation that they may be running during
a reconfiguration and handle missing data from refreshes.

We can avoid this by moving the "reset" work into the PipelineState
deserialization method, where we can determine at the moment we
refresh the object whether we need to "reset" it and do so.  We
can tell that a reset needs to happen if the layout uuid of the
state object does not match the running layout of the tenant.

We still need to attach new state and change list objects to the
pipeline in _postConfig (since our pipeline object is new).  We
also should make sure that the objects exist in ZK before we leave
that method, so that if a new pipeline is created, other schedulers
will be able to load the (potentially still empty) objects from ZK.
As an alternative, we could avoid even this work in _postConfig, but
then we would have to handle missing objects on refresh, and it
would not be possible to tell if the object was missing due to it
being new or due to an error.  To avoid masking errors, we keep
the current expectation that we will create these objects in ZK
on the initial reconfiguration.

Finally, we do still want to run the pipeline processing after
a reconfiguration (at least for now -- we may be approaching a
point where that will no longer be necessary).  So we move the
emission of the PostConfigEvent into the scheduler in the cases
where we know it has just updated the tenant layout.

Change-Id: Ib1e467b5adb907f93bab0de61da84d2efc22e2a7
2022-12-12 13:24:22 -08:00
Simon Westphahl 71598ee51b
Allow clean scheduler shutdown when priming fails
If config priming failed we need to unblock the main thread by setting
the primed event so we can join the main thread during shutdown.

So far this was only done after trying to join the main thread, so we
need to set the primed event earlier.

Change-Id: I5aef9a215cfd53baa94e525f8c303592a6c7e4b8
2022-12-01 07:32:55 +01:00
James E. Blair 279d7fb5cd
Fix deduplication exceptions in pipeline processing
If a build is to be deduplicated and has not started yet and has
a pending node request, we store a dictionary describing the target
deduplicated build in the node_requests dictionary on the buildset.

There were a few places where we directly accessed that dictionary
and assumed the results would be the node request id.  Notably, this
could cause an error in pipeline processing (as well os potentially
some other edge cases such as reconfiguring).

Most of the time we can just ignore deduplicated node requests since
the "real" buildset will take care of them.  This change enriches
the API to help with that.  In other places, we add a check for the
type.

To test this, we enable relative_priority in the config file which
is used in the deduplication tests, and we also add an assertion
which runs at the end of every test that ensures there were no
pipeline exceptions during the test (almost all the existing dedup
tests fail this assertion before this change).

Change-Id: Ia0c3f000426011b59542d8e56b43767fccc89a22
2022-11-21 09:22:25 +01:00
Zuul 57451d04b9 Merge "Add extra logging around tenant reconfiguration events" 2022-11-10 16:29:15 +00:00
Zuul ed013d82cc Merge "Parallelize some pipeline refresh ops" 2022-11-10 15:01:09 +00:00
Zuul fe49687f29 Merge "Add playbook semaphores" 2022-11-10 12:32:07 +00:00
James E. Blair 3a981b89a8 Parallelize some pipeline refresh ops
We may be able to speed up pipeline refreshes in cases where there
are large numbers of items or jobs/builds by parallelizing ZK reads.

Quick refresher: the ZK protocol is async, and kazoo uses a queue to
send operations to a single thread which manages IO.  We typically
call synchronous kazoo client methods which wait for the async result
before returning.  Since this is all thread-safe, we can attempt to
fill the kazoo pipe by having multiple threads call the synchronous
kazoo methods.  If kazoo is waiting on IO for an earlier call, it
will be able to start a later request simultaneously.

Quick aside: it would be difficult for us to use the async methods
directly since our overall code structure is still ordered and
effectively single threaded (we need to load a QueueItem before we
can load the BuildSet and the Builds, etc).

Thus it makes the most sense for us to retain our ordering by using
a ThreadPoolExecutor to run some operations in parallel.

This change parallelizes loading QueueItems within a ChangeQueue,
and also Builds/Jobs within a BuildSet.  These are the points in
a pipeline refresh tree which potentially have the largest number
of children and could benefit the most from the change, especially
if the ZK server has some measurable latency.

Change-Id: I0871cc05a2d13e4ddc4ac284bd67e5e3003200ad
2022-11-09 10:51:29 -08:00
Zuul 66c0865985 Merge "Don't refresh the pipeline state during reconfig" 2022-11-09 07:50:24 +00:00
James E. Blair faa519520e Update help text for zkprofile command
The help text for this command should indicate that it toggles
rather than simply enables profiling.

Change-Id: I65cc16b745f086450463d44fe62c874471349708
2022-11-08 16:12:37 -08:00
James E. Blair c355adf44e Add playbook semaphores
This adds the ability to specify that the Zuul executor should
acquire a semaphore before running an individual playbook.  This
is useful for long running jobs which need exclusive access to
a resources for only a small amount of time.

Change-Id: I90f5e0f570ef6c4b0986b0143318a78ddc27bbde
2022-11-07 08:41:10 -08:00