zuul/zuul - zuul - OpenDev: Free Software Needs Free Tools

Commit Graph

Author	SHA1	Message	Date
James E. Blair	1350ce8ad6	Use NodesetNotFoundError class This error exception class went unused, likely due to complications from circular imports. To resolve this, move all of the configuration error exceptions into the exceptions.py file so they can be imported in both model.py and configloader.py. Change-Id: I19b0f078f4d215a2e14c2c7ed893ab225d1e1084	2024-03-18 15:03:58 -07:00
Zuul	617bbb229c	Merge "Fix validate-tenants isolation"	2024-02-28 02:46:55 +00:00
James E. Blair	fa274fcf25	Remove most model_api backwards-compat handling Since we require deleting the state (except config cache) for the circular dependency upgrade, we can now remove any zookeeper backwards compatability handling not in the config cache. This change does so. It leaves upgrade handling for two attributes in the branch cache (even though they are old) because it is plausible that even systems that have been upgrading regularly may have non-upgraded data in the branch cache for infrequently used branches, and the cost for retaining the upgrade handling is small. Change-Id: I881578159b06c8c7b38a9fa0980aa0626dddad31	2024-02-09 07:39:55 -08:00
James E. Blair	1f026bd49c	Finish circular dependency refactor This change completes the circular dependency refactor. The principal change is that queue items may now include more than one change simultaneously in the case of circular dependencies. In dependent pipelines, the two-phase reporting process is simplified because it happens during processing of a single item. In independent pipelines, non-live items are still used for linear depnedencies, but multi-change items are used for circular dependencies. Previously changes were enqueued recursively and then bundles were made out of the resulting items. Since we now need to enqueue entire cycles in one queue item, the dependency graph generation is performed at the start of enqueing the first change in a cycle. Some tests exercise situations where Zuul is processing events for old patchsets of changes. The new change query sequence mentioned in the previous paragraph necessitates more accurate information about out-of-date patchsets than the previous sequence, therefore the Gerrit driver has been updated to query and return more data about non-current patchsets. This change is not backwards compatible with the existing ZK schema, and will require Zuul systems delete all pipeline states during the upgrade. A later change will implement a helper command for this. All backwards compatability handling for the last several model_api versions which were added to prepare for this upgrade have been removed. In general, all model data structures involving frozen jobs are now indexed by the frozen job's uuid and no longer include the job name since a job name no longer uniquely identifies a job in a buildset (either the uuid or the (job name, change) tuple must be used to identify it). Job deduplication is simplified and now only needs to consider jobs within the same buildset. The fake github driver had a bug (fakegithub.py line 694) where it did not correctly increment the check run counter, so our tests that verified that we closed out obsolete check runs when re-enqueing were not valid. This has been corrected, and in doing so, has necessitated some changes around quiet dequeing when we re-enqueue a change. The reporting in several drivers has been updated to support reporting information about multiple changes in a queue item. Change-Id: I0b9e4d3f9936b1e66a08142fc36866269dc287f1 Depends-On: https://review.opendev.org/907627	2024-02-09 07:39:40 -08:00
James E. Blair	fb7d24b245	Fix validate-tenants isolation The validate-tenants scheduler subcommand is supposed to perform complete tenant validation, and in doing so, it interacts with zk. It is supposed to isolate itself from the production data, but it appears to accidentally use the same unparsed config cache as the production system. This is mostly okay, but if the loading paths are different, it could lead to writing cache errors into the production file cache. The error is caused because the ConfigLoader creates an internal reference to the unparsed config cache and therefore ignores the temporary/isolated unparsed config cache created by the scheduler. To correct this, we will always pass the unparsed config cache into the configloader. Change-Id: I40bdbef4b767e19e99f58cbb3aa690bcb840fcd7	2024-01-31 14:58:45 -08:00
James E. Blair	7eb3fbeb97	Use a DummyFrozenJob in build cleanup There is a place where we try to clean up builds that we may no longer have FrozenJobs for; but the cleanup methods expect FrozenJob objects essentially just to get the job name. To make it more clear that's what's happening, use a dedicated object to pass that information. Change-Id: Ieb85bcba6d9a184e28120953adf882e4720621b2	2024-01-15 07:39:51 +01:00
James E. Blair	663931cba3	Include job_uuid in BuildRequests/Events This is part of the circular dependency refactor. It updates BuildRequests to include the job uuid (similar to the previous change with NodeRequests). We also use the job uuid when sending BuildResultEvents. Change-Id: If7a55138d0cb5a358c62e8cd97ee322087b09a6b	2024-01-08 08:54:56 -08:00
James E. Blair	7262ef7f6f	Include job_uuid in NodeRequests This is part of the circular dependency refactor. It updates the NodeRequest object to include the job_uuid in addition to the job_name (which is temporarily kept for backwards compatability). When node requests are completed, we now look up the job by uuid if supplied. Change-Id: I57d4ab6c241b03f76f80346b5567600e1692947a	2023-12-20 10:44:04 -08:00
James E. Blair	9201f9ee28	Store builds on buildset by uuid This is part of the circular dependency refactor. This updates the buildset object in memory (and zk) to store builds indexed by frozen job uuid rather than job name. This also updates everal related fields and also temporary dictionaries to do the same. This will allow us, in the future, to have more than one job/build in a buildset with the same name (for different changes/refs). Change-Id: I70865ec8d70fb9105633f0d03ba7c7e3e6cd147d	2023-12-12 11:58:21 -08:00
James E. Blair	1a226acbd0	Emit stats for more build results We currently typically only emit build stats for success and failure, but do not emit stats for NODE_FAILURE, CANCELED, SKIPPED, etc. To round out the feature so that all build results are emitted, this includes a small refactor to report build stats at the same time we report build results to the database (it almost all cases). One exception to that is when we receive a non-current build result -- that generally happens after a build is canceled, so we don't need to emit the stats again. Change-Id: I3bdf4e2643e151e2eae9f204f62cdc106df876b4	2023-09-26 11:32:03 -07:00
Zuul	e660979a8c	Merge "Use tenant-level layout locks"	2023-08-25 14:44:38 +00:00
James E. Blair	eb803984a0	Use tenant-level layout locks The current "layout_lock" in the scheduler is really an "abide" lock. We lock it every time we change something in the abide (including tenant layouts). The name is inherited from pre-multi-tenant Zuul. This can cause some less-than-optimal behavior when we need to wait to acquire the "layout_lock" for a tenant reconfiguration event in one thread while another thread holds the same lock because it is reloading the configuration for a different tenant. Ideally we should be able to have finer-grained tenant-level locking instead, allowing for less time waiting to reconfigure. The following sections describe the layout lock use prior to this commit and how this commit adjusts the code to make it safe for finer-grained locking. 1) Tenant iteration The layout lock is used in some places (notably some cleanup methods) to avoid having the tenant list change during the method. However, the configloader already performs an atomic replacement of the tenant list making it safe for iteration. This change adds a lock around updates to the tenant list to prevent corruption if two threads update it at the same time. The semaphore cleanup method indirectly references the abide and layout for use in global and local semaphores. This is just for path construction, and the semaphores exist apart from the abide and layout configurations and so should not be affected by either changing while the cleanup method is running. The node request cleanup method could end up running with an outdated layout objects, including pipelines, however it should not be a problem if these orphaned objects end up refreshing data from ZK right before they are removed. In these cases, we can simply remove the layout lock. 2) Protecting the unparsed project branch cache The config cache cleanup method uses the unparsed project branch cache (that is, the in-memory cache of the contents of zuul config files) to determine what the active projects are. Within the configloader, the cache is updated and then also used while loading tenant configuration. The layout lock would have made sure all of these actions were mutually exclusive. In order to remove the layout lock here, we need to make sure the Abide's unparsed_project_branch_cache is safe for concurrent updates. The unparsed_project_branch_cache attribute is a dictionary that conains references to UnparsedBranchCache objects. Previously, the configloader would delete each UnparsedBranchCache object from the dictionary, reinitialize it, then incrementially add to it. This current process has a benign flaw. The branch cache is cleared, and then loaded with data based on the tenant project config (TPC) currently being processed. Because the cache is loaded based on data from the TPC, it is really only valid for one tenant at a time despite our intention that it be valid for the entire abide. However, since we do check whether it is valid for a given TPC, and then clear and reload it if it is not, there is no error in data, merely an incomplete utilization of the cache. In order to make the cache safe for use by different tenants at the same time, we address this problem (and effectively make it so that it is also effective for different tenants, even at different times). The cache is updated to store the ltime for each entry in the cache, and also to store null entries (with ltimes) for files and paths that have been checked but are not present in the project-cache. This means that at any given time we can determine whether the cache is valid for a given TPC, and support multiple TPCs (i.e., multiple tenants). It's okay for the cache to be updated simultaneously by two tenants since we don't allow the cache contents to go backwards in ltime. The cache will either have the data with at least the ltime required, or if not, that particular tenant load will spawn cat jobs and update it. 3) Protecting Tenant Project Configs (TPCs) The loadTPC method on the ConfigLoader would similarly clear the TPCs for a tenant, then add them back. This could be problematic for any other thread which might be referencing or iterating over TPCs. To correct this, we take a similar approach of atomic replacement. Because there are two flavors of TPCs (config and untrusted) and they are stored in two separate dictionaries, in order to atomically update a complete tenant at once, the storage hierarchy is restructured as "tenant -> {config/untrusted} -> project" rather than "{config/untrusted} -> tenant -> project". A new class named TenantTPCRegistry holds both flavors of TPCs for a given tenant, and it is this object that is atomically replaced. Now that these issues are dealt with, we can implement a tenant-level thread lock that is used simply to ensure that two threads don't update the configuration for the same tenant at the same time. The scheduler's unparsed abide is updated in two places: upon full reconfiguration, or when another scheduler has performed a full reconfiguration and updated the copy in ZK. To prevent these two methods from performing the same update simultaneously, we add an "unparsed_abide_lock" and mutually exclude them. Change-Id: Ifba261b206db85611c16bab6157f8d1f4349535d	2023-08-24 17:32:25 -07:00
James E. Blair	57a9c13197	Use the GitHub default branch as the default branch This supplies a per-project default value for Zuul's default-branch based on what the default branch is set to in GitHub. This means that if users omit the default-branch setting on a Zuul project stanza, Zuul will automatically use the correct value. If the value in GitHub is changed, an event is emitted which allows us to automatically reconfigure the tenant. This could be expanded to other drivers that support an indication of which branch is default. Change-Id: I660376ecb3f382785d3bf96459384cfafef200c9	2023-08-23 11:07:08 -07:00
Zuul	e62dad266d	Merge "Log item related messages with event id"	2023-08-15 09:46:46 +00:00
James E. Blair	76f791e4d3	Fix linting errors A new pycodestyle errors on ",\". We only use that to support Python <3.10, and since Zuul is now targeting only 3.11+, these instances are updated to use implicit continuation. An instance of "==" is changed to "is". A function definition which overrides an assignment is separated so that the assignment always occurs regardless of whether it ends up pointing to the function def. Finally, though not required, since we're editing the code anyway for nits, some typing info is removed. Change-Id: I6bb096b87582ab1450bed02541483fc6f1d6c44a	2023-08-02 10:28:22 -07:00
Zuul	6c0ffe565f	Merge "Report early failure from Ansible task failures"	2023-07-29 18:28:08 +00:00
Simon Westphahl	39e5eb9e2c	Don't fail tenant validation for deprecations Warnings are also added to the list of loading errors. For the tenant validation we need to distinguish between errors and deprecation warnings and only fail the validation when there are errors. Alternatively we could also introduce a flag for the tenant validation to thread deprecation warnings as errors. Change-Id: I9c8957520c37157a295627848d30e52a36c8da0a Co-Authored-By: James E. Blair <jim@acmegating.com>	2023-07-28 07:33:36 -07:00
Zuul	c7adffc244	Merge "Deduplicate circular dep jobs in check"	2023-07-11 15:45:31 +00:00
James E. Blair	1170b91bd8	Report early failure from Ansible task failures We can have the Ansible callback plugin tell the executor to tell the scheduler that a task has failed and therefore the job will fail. This will allow the scheduler to begin a gate reset before the failing job has finished and potentially save much developer and CPU time. We take some extra precautions to try to avoid sending a pre-fail notification where we think we might end up retrying the job (either due to a failure in a pre-run playbook, or an unreachable host). If that does happen then a gate pipeline might end up flapping between two different NNFI configurations (ie, it may perform unecessary gate resets behind the change with the retrying job), but should still not produce an incorrect result. Presumably the detections here should catch that case sufficiently early, but due to the nature of these errors, we may need to observe it in production to be sure. Change-Id: Ic40b8826f2d54e45fb6c4d2478761a89ef4549e4	2023-06-29 13:40:34 -07:00
Simon Westphahl	7992ff08c6	Log item related messages with event id A few messages in the scheduler so far were not using the annotated logger. This means that those messages could not be easily correlated via the event id. Improve this by using the annotated logger when possible. Change-Id: I4378e6db7e7fb0af2c29e1a226343b25de2ce508	2023-06-22 11:20:02 +02:00
James E. Blair	b913a67889	Deduplicate circular dep jobs in check If a cycle of dependencies in a dependent pipeline meet certain conditions and share some jobs, we deduplicate those jobs (i.e., we run only one build of the job and share it between multiple items simultaneously). This makes the same happen for independent pipelines. This is accomplished by searching for identical bundles within the pipeline and deduplicating the job across bundles. The deduplication is no different than dependent pipelines -- we simply store a reference to the actual build in ZK and use the same Build object in Python. In dependent pipelines, bundle queue items are removed simultaneously (they all report at the same time). However, in independent pipelines, they may report at different times. If a queue item is dequeued while one of its builds is referenced by a different queue item, then the second queue item will begin to have errors loading information from ZK. To deal with this, we utlilize a property of how we store items in ZK: the QueueItem itself is stored underneath the pipeline object with no reference to its place in the queue. Separately there are queue lists which reference the QueueItem zobjects by path. We will continue to remove the QueueItem from the Queue, but we will leave it in place in ZK if it is possible that it may be referenced by other queue items. Then we will rely more heavily on a garbage collection system to remove the QueueItems once they, and their builds, are no longer referenced by any item in the pipeline. The existing garbage collection happens after loading the pipeline state from ZK, but this change moves it to the end of pipeline processing so that we are more likely to write out empty pipelines at the end of the process once all items are removed. Change-Id: I7fc897088d720f483e320431fefaa06425a33c86	2023-04-20 16:10:22 -07:00
Simon Westphahl	711e1e5c98	Add missing event id to management events The change management events via Zuul web and the command socket did not have an event ID assigned. This made it harder to debug issues where we need to find the logs related to a certain action. Change-Id: I05ccbc13c7f906f91e13fb66e4a01a51fc822676	2023-04-14 08:27:29 +02:00
Zuul	bb648c8b7d	Merge "Match events to pipelines based on topic deps"	2023-02-27 21:49:42 +00:00
James E. Blair	9229e1a774	Match events to pipelines based on topic deps We distribute tenant events to pipelines based on whether the event matches the pipeline (ie, patchset-created for a check pipeline) or if the event is related to a change already in the pipeline. The latter condition means that pipelines notice quickly when dependencies are changed and they can take appropriate action (such as ejecting changes which no longer have correct dependencies). For git and commit dependencies, an update to a cycle to add a new change requires an update to at least one existing change (for example, adding a new change to a cycle usually requires at least two Depends-On footers: the new change, as well as one of the changes already in the cycle). This means that updates to add new changes to cycles quickly come to the attention of pipelines. However, it is possible to add a new change to a topic dependency cycle without updating any existing changes. Merely uploading a new change in the same topic adds it to the cycle. Since that new change does not appear in any existing pipelines, pipelines won't notice the update until their next natural processing cycle, at which point they will refresh dependencies of any changes they contain, and they will notice the new dpendency and eject the cycle. To align the behavior of topic dependencies with git and commit dependencis, this change causes the scheduler to refresh the dependencies of the change it is handling during tenant trigger event processing, so that it can then compare that change's dependencies to changes already in pipelines to determine if this event is potentially relevant. This moves some work from pipeline processing (which is highly parallel) to tenant processing (which is only somewhat parallel). This could slow tenant event processing somewhat. However, the work is persisted in the change cache, and so it will not need to be repeated during pipeline processing. This is necessary because the tenant trigger event processor operates only with the pipeline change list data; it does not perform a full pipeline refresh, so it does not have access to the current queue items and their changes in order to compare the event change's topic with currently enqueued topics. There are some alternative ways we could implement this if the additional cost is an issue: 1) At the beginning of tenant trigger event processing, using the change list, restore each of the queue's change items from the change cache and compare topics. For large queues, this could end up generating quite a bit of ZK traffic. 2) Add the change topic to the change reference data structure so that it is stored in the change list. This is an abuse of this structure which otherwise exists only to store the minimum amount of information about a change in order to uniquely identify it. 3) Implement a PipelineTopicList similar to a PipelineChangeList for storing pipeline topics and accesing them without a full refresh. Another alternative would be to accept the delayed event handling of topic dependencies and elect not to "fix" this behavior. Change-Id: Ia9d691fa45d4a71a1bc78cc7a4bdec206cc025c8	2023-02-17 10:12:14 -08:00
Simon Westphahl	b23f76e677	Update reconfig event ltime on (smart) reconfig Make sure to update the last reconfig event ltime in the layout state during a (smart) reconfig. This is important in order to unblock pipelines when a tenant reconfig event is lost. So far the last reconfig event ltime was passed as -1, but wasn't set in the layout state since the ltime is not allowed to go backward. Change-Id: Iab04a962abbfbe901c22e4d5f1d484e3f53b0d33	2023-02-17 08:39:07 +01:00
Zuul	8c673e0320	Merge "Expand the scope of zkprofile"	2023-02-15 08:50:40 +00:00
Zuul	54ec0b2df7	Merge "Add more log messages for run handler steps"	2023-02-15 08:50:37 +00:00
Zuul	af96e5786f	Merge "Add scheduler run handler metric"	2023-02-15 08:43:19 +00:00
James E. Blair	3a829196b9	Expand the scope of zkprofile Currently the zkprofile command only profiles the pipeline refresh operations. This expands the scope to include all pipeline operations (refreshing and then any subsequent changes). In particular, this will help identify places where we write too much data. Change-Id: Ida7bcb311b5037f6ce15a166ddeae6db44ed6709	2023-02-14 16:32:51 -08:00
James E. Blair	98dcd51d90	Fix race condition in pipeline change list init Simon Westphahl describes the race condition: > [The race condition] can occur after a reconfiguration while > some schedulers are updating their local layout and some > already start processing pipelines in the same tenant. > > In this case the pipeline manager's `_postConfig()` method that > calls `PipelineChangeList.create(...)` races with the pipeline > processor updating the change keys. > > This leads to two change lists being written as separate > shards, that can't be correctly loaded, as all shards combined > are expected to form a single JSON object. > > The sequence of events seems to be the following: > 1. S1: pipeline processor needs to update the change keys > 2. S1 the shard writer will delete the `change_list` key with the old > shards > 3. S2: configloader calls the `_postConfig()` method > 4. S2: `PipelineChangeList.create()` notices that the `change_list` node > doesn't exists in Zookeeper: > https://opendev.org/zuul/zuul/src/branch/master/zuul/model.py#L921 > 6. S2: the shard writer creates the first shard `0000000000` > 7. S1: the shard writer creates the second shared `0000000001` > > The race condition seems to be introduced with > Ib1e467b5adb907f93bab0de61da84d2efc22e2a7 That change updated the pipeline manager _postConfig method so that it no longer acquires the pipeline lock when initalizing the pipeline state and change lists. This greatly reduces potential delays during reconfiguration, but has, perhaps predictably, lead to the race condition above. In the commit message for that change, I noted that we might be able to avoid even more work if we accept some caveats related to error reporting. Avoiding that work mean avoiding performing any writes during _postConfig which addresses the root cause of the race condition (steps 3-6 above. Ed. note: there is no step 5). From the commit message: > We still need to attach new state and change list objects to > the pipeline in _postConfig (since our pipeline object is new). > We also should make sure that the objects exist in ZK before we > leave that method, so that if a new pipeline is created, other > schedulers will be able to load the (potentially still empty) > objects from ZK. As an alternative, we could avoid even this > work in _postConfig, but then we would have to handle missing > objects on refresh, and it would not be possible to tell if the > object was missing due to it being new or due to an error. To > avoid masking errors, we keep the current expectation that we > will create these objects in ZK on the initial reconfiguration. The current change does exactly that. We no longer perform any ZK write operations on the state and change list objects in _postConfig. Instead, inside of the refresh methods, we detect the cases where they should be newly created and do so at that time. This happens with the pipeline lock, so is safe against any simultaneous operation from other components. There will be "ERROR" level log messages indicating that reading the state from ZK has failed when these objects are first initialized. To indicate that this is probably okay, they will now be immediately followed by "WARNING" level messages explaining that. Strictly speaking, this particular race should only occur for the change list object, not the pipeline state, since the race condition above requires a sharded object and of the two, only the change list is sharded. However, to keep the lifecycle of these two objects matched (and to simplify _postConfig) the same treatment is applied to both. Note that change I7fa99cd83a857216321f8d946fd42abd9ec427a3 merged after Ib1e467b and changed the behavior slightly, introducing the old_state and old_list arguments. Curiously, the old_list argument is effectively unused, so it is removed entirely in this change. Old_state still has a purpose and is retained. Change-Id: I519348e7d5d74e675808e990920480fb6e1fb981	2023-02-10 15:03:08 -08:00
Simon Westphahl	ec5d50d177	Add more log messages for run handler steps This change adds distinct log message for certain run handler steps. So far some of those phases could only be inferred by other related log messages. Change-Id: Icceca78456b20754cbebb5747da3a538586cd5f5	2023-02-06 08:06:22 +01:00
Simon Westphahl	95ecb41c51	Add scheduler run handler metric In order to better understand the scheduler run handler performance this change adds a new `zuul.scheduler.run_handler` metric to measure the duration of one run handler loop. Change-Id: I77e862cf99d6a8445e71d7daab410d5853487dc3	2023-02-06 08:05:41 +01:00
Simon Westphahl	9048706d93	Cleanup deleted pipelines and and event queues When a pipeline is removed during a reconfiguration Zuul will cancel active builds and node requests. However, since we no longer refresh the pipeline state during a reconfig we can run into errors when Zuul tries to cancel builds and node requests based on a possibly outdated pipeline state. 2023-01-17 10:41:32,223 ERROR zuul.Scheduler: Exception in run handler: Traceback (most recent call last): File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 2007, in run self.process_tenant_management_queue(tenant) File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 2452, in process_tenant_management_queue self._process_tenant_management_queue(tenant) File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 2462, in _process_tenant_management_queue self._doTenantReconfigureEvent(event) File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 1533, in _doTenantReconfigureEvent self._reconfigureTenant(ctx, min_ltimes, File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 1699, in _reconfigureTenant self._reconfigureDeletePipeline(old_pipeline) File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 1804, in _reconfigureDeletePipeline self.cancelJob(build.build_set, build.job, File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 2930, in cancelJob build.updateAttributes( File "/opt/zuul/lib/python3.10/site-packages/zuul/zk/zkobject.py", line 193, in updateAttributes self._save(context, serial) File "/opt/zuul/lib/python3.10/site-packages/zuul/zk/zkobject.py", line 392, in _save zstat = self._retry(context, self._retryableSave, File "/opt/zuul/lib/python3.10/site-packages/zuul/zk/zkobject.py", line 314, in _retry return kazoo_retry(func, args, kw) File "/opt/zuul/lib/python3.10/site-packages/kazoo/retry.py", line 126, in __call__ return func(args, **kwargs) File "/opt/zuul/lib/python3.10/site-packages/zuul/zk/zkobject.py", line 371, in _retryableSave zstat = context.client.set(path, compressed_data, File "/opt/zuul/lib/python3.10/site-packages/kazoo/client.py", line 1359, in set return self.set_async(path, value, version).get() File "/opt/zuul/lib/python3.10/site-packages/kazoo/handlers/utils.py", line 86, in get raise self._exception kazoo.exceptions.BadVersionError To fix this we need to refresh the pipeline state prior to canceling those active builds and node requests. We will also take care of removing the pipeline state and the event queues from Zookeeper if possible. Errors will be ignored as the periodic cleanup task takes care of removing leaked pipelines. Change-Id: I2986419636d8c6557d68d65fb6aff589aa4a680e	2023-01-24 10:25:04 +01:00
Simon Westphahl	a2b114e1a3	Periodically cleanup leaked pipeline state So far we did not cleanup the pipeline state and event queues of deleted pipelines. To fix that we'll remove the data of pipelines that are no longer part of the tenant layout in a periodic cleanup task as part of the general cleanup. To support the use-case when a pipeline is added back we also need to initialize the event queues during a tenant reconfiguration. The local event queue registry usually takes care of creating the queues when the even queue is first accessed. However, the old event queue object could still be cached in the registry when we remove and re-add a pipeline. For the same use-case we also need to remove the pipeline from the list of watches in the event watcher. Otherwise we won't re-create the children watch when the pipeline is added. Change-Id: I02127fe462cc390c81330e717be55780bc2535eb	2023-01-24 10:25:04 +01:00
Simon Westphahl	19668e0bc7	Require latest layout for processing mgmt events A scheduler should only be allowed to process the management event queue for a tenant when the local layout is up-to-date. This is already the case for the tenant trigger event queue. Otherwise the scheduler could forward mangement events to pipelines that no longer exist. Change-Id: I3c485b9e54a65e7c1c09dc51b1f2b4c83c45de19	2023-01-24 10:25:04 +01:00
Zuul	cb40ddc7db	Merge "Store pause and resume events on the build and report them"	2023-01-02 20:57:37 +00:00
Felix Edel	f9786ac2a8	Store pause and resume events on the build and report them When a build is paused or resumed, we now store this information on the build together with the event time. Instead of additional attributes for each timestamp, we add an "event" list attribute to the build which can also be used for other events in the future. The events are stored in the SQL database and added to the MQTT payload so the information can be used by the zuul-web UI (e.g. in the "build times" gantt chart) or provided to external services. Change-Id: I789b4f69faf96e3b8fd090a2e389df3bb9efd602	2023-01-02 10:07:26 +01:00
Zuul	35c68169bd	Merge "Fix deduplication exceptions in pipeline processing"	2022-12-20 01:48:36 +00:00
James E. Blair	fe04739c78	Reuse queue items after reconfiguration When we reconfigure, we create new Pipeline objects, empty the values in the PipelineState and then reload all the objects from ZK. We then re-enqueue all the QueueItems to adjust and correct the object pointers between them (item_ahead and items_behind). We can avoid reloading all the objects from ZK if we keep queue items from the previous layout and rely on the re-enqueue method correctly resetting any relevant object pointers. We already defer this re-enqueue work to the next pipeline processing after a reconfiguration (so the reconfiguration itself doesn't take very long, but now the first pipeline run after a reconfiguration must perform a complete refresh). With this change, that first refresh is no longer be a complete refresh but a normal refresh, so we will get the benefits of previous reductions in refresh times. The main risk of this change is that it could introduce a memory leak. During development, additional debugging was performed to verify that after a re-enqueue, there are no obsolete layout or pipeline objects reachable from the pipeline state object. On schedulers where a re-enqueue does not take place (these schedulers would simply see the layout update and re-create their PipelineState python objects and refresh them after another scheduler has already performed the re-enqueue), we need to ensure that we update any internal references to Pipeline objects (which then lead to Layout objects and can cause memory leaks). To address that, we update the pipeline references in the ChangeQueue instances underneath a given PipelineState when that state is being reset after a reconfiguration. This change also removes the pipeline reference from the QueueItem, replacing it with a property that uses the pipeline reference on the ChangeQueue instead. This removes one extra place where an incorrect reference could cause a memory leak. Change-Id: I7fa99cd83a857216321f8d946fd42abd9ec427a3	2022-12-13 13:19:48 -08:00
Zuul	e0bd0b6ab0	Merge "Avoid acquiring pipeline locks in manager postConfig"	2022-12-13 18:07:16 +00:00
James E. Blair	e25795a120	Avoid acquiring pipeline locks in manager postConfig After creating a new tenant layout, we call _postConfig on every pipeline manager. That creates the shared change queues and then also attaches new PipelineState and PipelineChangeList objects to the new Pipeline objects on the new layout. If the routine detects that it is the first scheduler to deal with this pipeline under the new layout UUID, it also resets the pipeline state in ZK (resets flags and moves all queues to the old_queues attribute so they will be re-enqueued). It also drops a PostConfigEvent into the pipeline queues to trigger a run of the pipeline after the re-enqueues. The work in the above paragraph means that we must hold the pipeline lock for each pipeline in turn during the reconfiguration. Most pipelines should not be processed at this point since we do hold the tenant write lock, however, some cleanup routines can be operating, and so we would end up waiting for them to complete before completing the reconfiguration. This could end up adding minutes to a reconfiguration. Incidentally, these cleanup routines are written with the expectation that they may be running during a reconfiguration and handle missing data from refreshes. We can avoid this by moving the "reset" work into the PipelineState deserialization method, where we can determine at the moment we refresh the object whether we need to "reset" it and do so. We can tell that a reset needs to happen if the layout uuid of the state object does not match the running layout of the tenant. We still need to attach new state and change list objects to the pipeline in _postConfig (since our pipeline object is new). We also should make sure that the objects exist in ZK before we leave that method, so that if a new pipeline is created, other schedulers will be able to load the (potentially still empty) objects from ZK. As an alternative, we could avoid even this work in _postConfig, but then we would have to handle missing objects on refresh, and it would not be possible to tell if the object was missing due to it being new or due to an error. To avoid masking errors, we keep the current expectation that we will create these objects in ZK on the initial reconfiguration. Finally, we do still want to run the pipeline processing after a reconfiguration (at least for now -- we may be approaching a point where that will no longer be necessary). So we move the emission of the PostConfigEvent into the scheduler in the cases where we know it has just updated the tenant layout. Change-Id: Ib1e467b5adb907f93bab0de61da84d2efc22e2a7	2022-12-12 13:24:22 -08:00
Simon Westphahl	71598ee51b	Allow clean scheduler shutdown when priming fails If config priming failed we need to unblock the main thread by setting the primed event so we can join the main thread during shutdown. So far this was only done after trying to join the main thread, so we need to set the primed event earlier. Change-Id: I5aef9a215cfd53baa94e525f8c303592a6c7e4b8	2022-12-01 07:32:55 +01:00
James E. Blair	279d7fb5cd	Fix deduplication exceptions in pipeline processing If a build is to be deduplicated and has not started yet and has a pending node request, we store a dictionary describing the target deduplicated build in the node_requests dictionary on the buildset. There were a few places where we directly accessed that dictionary and assumed the results would be the node request id. Notably, this could cause an error in pipeline processing (as well os potentially some other edge cases such as reconfiguring). Most of the time we can just ignore deduplicated node requests since the "real" buildset will take care of them. This change enriches the API to help with that. In other places, we add a check for the type. To test this, we enable relative_priority in the config file which is used in the deduplication tests, and we also add an assertion which runs at the end of every test that ensures there were no pipeline exceptions during the test (almost all the existing dedup tests fail this assertion before this change). Change-Id: Ia0c3f000426011b59542d8e56b43767fccc89a22	2022-11-21 09:22:25 +01:00
Zuul	57451d04b9	Merge "Add extra logging around tenant reconfiguration events"	2022-11-10 16:29:15 +00:00
Zuul	ed013d82cc	Merge "Parallelize some pipeline refresh ops"	2022-11-10 15:01:09 +00:00
Zuul	fe49687f29	Merge "Add playbook semaphores"	2022-11-10 12:32:07 +00:00
James E. Blair	3a981b89a8	Parallelize some pipeline refresh ops We may be able to speed up pipeline refreshes in cases where there are large numbers of items or jobs/builds by parallelizing ZK reads. Quick refresher: the ZK protocol is async, and kazoo uses a queue to send operations to a single thread which manages IO. We typically call synchronous kazoo client methods which wait for the async result before returning. Since this is all thread-safe, we can attempt to fill the kazoo pipe by having multiple threads call the synchronous kazoo methods. If kazoo is waiting on IO for an earlier call, it will be able to start a later request simultaneously. Quick aside: it would be difficult for us to use the async methods directly since our overall code structure is still ordered and effectively single threaded (we need to load a QueueItem before we can load the BuildSet and the Builds, etc). Thus it makes the most sense for us to retain our ordering by using a ThreadPoolExecutor to run some operations in parallel. This change parallelizes loading QueueItems within a ChangeQueue, and also Builds/Jobs within a BuildSet. These are the points in a pipeline refresh tree which potentially have the largest number of children and could benefit the most from the change, especially if the ZK server has some measurable latency. Change-Id: I0871cc05a2d13e4ddc4ac284bd67e5e3003200ad	2022-11-09 10:51:29 -08:00
Zuul	66c0865985	Merge "Don't refresh the pipeline state during reconfig"	2022-11-09 07:50:24 +00:00
James E. Blair	faa519520e	Update help text for zkprofile command The help text for this command should indicate that it toggles rather than simply enables profiling. Change-Id: I65cc16b745f086450463d44fe62c874471349708	2022-11-08 16:12:37 -08:00
James E. Blair	c355adf44e	Add playbook semaphores This adds the ability to specify that the Zuul executor should acquire a semaphore before running an individual playbook. This is useful for long running jobs which need exclusive access to a resources for only a small amount of time. Change-Id: I90f5e0f570ef6c4b0986b0143318a78ddc27bbde	2022-11-07 08:41:10 -08:00

1 2 3 4 5 ...

831 Commits