Commit Graph

245 Commits

Author SHA1 Message Date
Zuul 239a4b9142 Merge "Only return the latest config for project-branch" 2024-03-11 10:57:17 +00:00
Simon Westphahl 76882e1b3a
Only return the latest config for project-branch
In addition to the safeguard in
Iebf49a9efe193788199197bf7846e336d96edf19 we will only return the final
config for a project-branch as part of the merge result.

Change-Id: I1eb3b75d8762aff4e1ebd057661869df985a79e2
2024-03-05 09:01:57 +01:00
James E. Blair 3e4caaac4b Produce consistent merge commit shas
Use a fixed timestamp and merge message so that zuul mergers
produce the exact same commit sha each time they perform a merge
for a queue item.  This can help correlate git repo states for
different jobs in the same change as well as across different
changes in the case of a dependent change series.

The timestamp used is the "configuration time" of the queue item
(ie, the time the buildset was created or reset).  This means
that it will change on gate resets (which could be useful for
distinguishing one run of a build from another).

Change-Id: I3379b19d77badbe2a2ec8347ddacc50a2551e505
2024-02-26 16:32:46 -08:00
James E. Blair 1f026bd49c Finish circular dependency refactor
This change completes the circular dependency refactor.

The principal change is that queue items may now include
more than one change simultaneously in the case of circular
dependencies.

In dependent pipelines, the two-phase reporting process is
simplified because it happens during processing of a single
item.

In independent pipelines, non-live items are still used for
linear depnedencies, but multi-change items are used for
circular dependencies.

Previously changes were enqueued recursively and then
bundles were made out of the resulting items.  Since we now
need to enqueue entire cycles in one queue item, the
dependency graph generation is performed at the start of
enqueing the first change in a cycle.

Some tests exercise situations where Zuul is processing
events for old patchsets of changes.  The new change query
sequence mentioned in the previous paragraph necessitates
more accurate information about out-of-date patchsets than
the previous sequence, therefore the Gerrit driver has been
updated to query and return more data about non-current
patchsets.

This change is not backwards compatible with the existing
ZK schema, and will require Zuul systems delete all pipeline
states during the upgrade.  A later change will implement
a helper command for this.

All backwards compatability handling for the last several
model_api versions which were added to prepare for this
upgrade have been removed.  In general, all model data
structures involving frozen jobs are now indexed by the
frozen job's uuid and no longer include the job name since
a job name no longer uniquely identifies a job in a buildset
(either the uuid or the (job name, change) tuple must be
used to identify it).

Job deduplication is simplified and now only needs to
consider jobs within the same buildset.

The fake github driver had a bug (fakegithub.py line 694) where
it did not correctly increment the check run counter, so our
tests that verified that we closed out obsolete check runs
when re-enqueing were not valid.  This has been corrected, and
in doing so, has necessitated some changes around quiet dequeing
when we re-enqueue a change.

The reporting in several drivers has been updated to support
reporting information about multiple changes in a queue item.

Change-Id: I0b9e4d3f9936b1e66a08142fc36866269dc287f1
Depends-On: https://review.opendev.org/907627
2024-02-09 07:39:40 -08:00
Simon Westphahl 74da022ec9
Fix issue with logging when setting refs async
When refs are set asynchronously we don't supply a logger and expect the
`_setRefs()` to return the log messages. This was not the case in the
exception handler when we can't resolve an object.

In addition to this fix another debug message is now also return as part
of the message list as expected.

Traceback (most recent call last):
  File "/opt/zuul/lib/python3.11/site-packages/zuul/merger/merger.py", line 553, in _setRefs
    repo.odb.info(binsha)
  File "/opt/zuul/lib/python3.11/site-packages/git/db.py", line 40, in info
    hexsha, typename, size = self._git.get_object_header(bin_to_hex(binsha))
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/zuul/lib/python3.11/site-packages/git/cmd.py", line 1384, in get_object_header
    return self.__get_object_header(cmd, ref)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/zuul/lib/python3.11/site-packages/git/cmd.py", line 1371, in __get_object_header
    return self._parse_object_header(cmd.stdout.readline())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/zuul/lib/python3.11/site-packages/git/cmd.py", line 1332, in _parse_object_header
    raise ValueError("SHA %s could not be resolved, git returned: %r" % (tokens[0], header_line.strip()))
ValueError: SHA b'8683bca8c75c1c3ae07730452d93c736b1e899db' could not be resolved, git returned: b'8683bca8c75c1c3ae07730452d93c736b1e899db missing'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/concurrent/futures/process.py", line 261, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/zuul/lib/python3.11/site-packages/zuul/merger/merger.py", line 532, in setRefsAsync
    messages = Repo._setRefs(repo, refs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/zuul/lib/python3.11/site-packages/zuul/merger/merger.py", line 560, in _setRefs
    log.warning("Unable to resolve reference %s at %s in %s",
    ^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'warning'

Change-Id: Ieb54a53f1fe09848da0a40fdb0dfcb445c65eded
2024-01-15 09:31:31 +01:00
James E. Blair 033470e8b3 Fix repo state restore for zuul role tag override
When a repo that is being used for a zuul role has override-checkout
set to a tag, the job would fail because we did not reconstruct the
tag in our zuul-role checkout; we only did that for branches.

This fixes the repo state restore for any type of ref.

There is a an untested code path where a zuul role repo is checked
out to a tag using override-checkout.  Add a test for that (and
also the same for a branch, for good measure).

Change-Id: I36f142cd3c4e7d0b930318dddd7276f3635cc3a2
2023-11-30 10:06:03 -08:00
James E. Blair 9409efae1e Add safety when setting refs
The recent update to make setting refs more efficient could encounter
an edge case if a branch or tag was removed from the upstream repo
after the repo state was retrieved by zuul.  If removing the ref caused
the underlying objects to be removed not not be sent in a fetch, then
our blindly setting the repo state with a ref pointed to an unresolvable
object could leave the repository in a corrupted state.

To recover from any potential corruption that may have somehow happened,
this change adds an additional case where we will remove the underlying
repo and re-clone.

To prevent any such corruption from happening in the first place, we add
a check that each hexsha is resolvable before we set it when restoring
the repo state.  This does add a small amount of overhead, but should be
much less than manipulating the loos refs one at a time.  A copy of nova
with 10,000 refs adds 100ms for this checking.

Change-Id: Ifd298905e634f83a147644d35ff3ea1c143b3d1f
2023-11-20 09:50:47 -08:00
James E. Blair 518194af1d Thin the workspace repo clone
When the executor clones a repo from the cache to the workspace,
it performs a lot of unecessary work:

* It checks out HEAD and prepares a workspace which we will
  immediately change.
* It copies all of the branch refs, half of which we will immediately
  delete, and in some configurations (exclude_unprotected_branches)
  we will immediately delete most of the rest.  Deleting refs with
  gitpython is much more expensive than creating them.

This change updates the initial clone to do none of those, instead
relying on the repo state restoration to take care of that for us.

Change-Id: Ie8846c48ccd6255953f46640f5559bb41491d425
2023-11-10 06:19:47 -08:00
Simon Westphahl 810191b60e
Select correct merge method for Github
Starting with Github Enterprise 3.8[0] and github.com from September
2022 on[1], the merge strategy changed from using merge-recursive to
merge-ort[0].

The merge-ort strategy is available in the Git client since version
2.33.0 and became the default in 2.34.0[2].

If not configured otherwise, we've so far used the default merge
strategy of the Git client (which varies depending on the client
version). With this change, we are now explicitly choosing the default
merge strategy based on the Github version. This way, we can reduce
errors resulting from the use of different merge strategies in Zuul and
Github.

Since the newly added merge strategies must be understood by the mergers
we also need to bump the model API version.

[0] https://docs.github.com/en/enterprise-server@3.8/admin/release-notes
[1] https://github.blog/changelog/2022-09-12-merge-commits-now-created-using-the-merge-ort-strategy/
[2] https://git-scm.com/docs/merge-strategies#Documentation/merge-strategies.txt-recursive

Change-Id: I354a76fa8985426312344818320980c67171d774
2023-10-24 07:15:39 +02:00
Flavio Percoco 334f34bbf4 Redact remote URL before logging
The Github driver prints un-sanitized URLs to log files. This PR
uses the redact_url function to sanitize the remote URL before
logging.

Task: #48344
Signed-off-by: Flavio Percoco <flavio@pacerevenue.com>
Change-Id: Id725c8dfe3e4e782c293ff350fc7e35b23d377ab
Signed-off-by: Flavio Percoco <flavio@pacerevenue.com>
2023-08-29 11:30:18 +01:00
Simon Westphahl c963526560
Add Zuul event id to merge completed events
Return the Zuul event ID that is already part of the merge request with
the merge result event so logs can be correlated.

Change-Id: I018709cd4d4afa562e6851d0d52c1ddd7583dc62
2023-08-08 12:02:36 +02:00
Simon Westphahl 183d221124
Log commit SHA when getting files from repo
From the previous log messages related to a cat job it wasn't clear
which HEAD SHA was used to get the file content.

This change adds a log message to the `getFiles()` method that contains
the HEAD commit SHA.

Change-Id: I02a3a97f9b3dfa70f6e55954ea6ef365289f0046
2023-03-24 11:34:48 +01:00
Joshua Watt 28428942f4 merger: Keep redundant cherry-pick commits
In normal git usage, cherry-picking a commit that has already been
applied and doesn't do anything or cherry-picking an empty commit causes
git to exit with an error to let the user decide what they want to do.
However, this doesn't match the behavior of merges and rebases where
non-empty commits that have already been applied are simply skipped
(empty source commits are preserved).

To fix this, add the --keep-redundant-commit option to `git cherry-pick`
to make git always keep a commit when cherry-picking even when it is
empty for either reason. Then, after the cherry-pick, check if the new
commit is empty and if so back it out if the original commit _wasn't_
empty.

This two step process is necessary because git doesn't have any options
to simply skip cherry-pick commits that have already been applied to the
tree.

Removing commits that have already been applied is particularly
important in a "deploy" pipeline triggered by a Gerrit "change-merged"
event, since the scheduler will try to cherry-pick the change on top of
the commit that just merged. Without this option, the cherry-pick will
fail and the deploy pipeline will fail with a MERGE_CONFICT.

Change-Id: I326ba49e2268197662d11fd79e46f3c020675f21
2023-03-01 16:22:17 -06:00
Simon Westphahl ac534577c3
Cleanup old rebase-merge dirs on repo reset
When using the rebase merge-mode a failed "merge" will leave the repo in
a state that Zuul so far could not recover from. The rebase will create
a `.git/rebase-merge` directory which is not removed when the rebase
fails.

To fix this we will abort the rebase when it fails and also remove any
existing `.git/rebase-merge` and `.git/rebase-apply` directories when
resetting the repository.

DEBUG zuul.Merger: [e: ...] Unable to merge {'branch': 'master', 'buildset_uuid': 'f7be4215f37049b4ba0236892a5d8197', 'connection': 'github', 'merge_mode': 5, 'newrev': None, 'number': 71, 'oldrev': None, 'patchset': 'e81d0b248565db290b30d9a638095947b699c76d', 'project': 'org/project', 'ref': 'refs/pull/71/head'}
Traceback (most recent call last):
  File "/opt/zuul/lib/python3.10/site-packages/zuul/merger/merger.py", line 1099, in _mergeChange
    commit = repo.rebaseMerge(
  File "/opt/zuul/lib/python3.10/site-packages/zuul/merger/merger.py", line 626, in rebaseMerge
    repo.git.rebase(*args)
  File "/opt/zuul/lib/python3.10/site-packages/git/cmd.py", line 542, in <lambda>
    return lambda *args, **kwargs: self._call_process(name, *args, **kwargs)
  File "/opt/zuul/lib/python3.10/site-packages/git/cmd.py", line 1005, in _call_process
    return self.execute(call, **exec_kwargs)
  File "/opt/zuul/lib/python3.10/site-packages/git/cmd.py", line 822, in execute
    raise GitCommandError(command, status, stderr_value, stdout_value)
git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
  cmdline: git rebase 39fead1852ef01a716a1c6470cee9e4197ff5587
  stderr: 'fatal: It seems that there is already a rebase-merge directory, and
I wonder if you are in the middle of another rebase.  If that is the
case, please try
    git rebase (--continue | --abort | --skip)
If that is not the case, please
    rm -fr ".git/rebase-merge"
and run me again.  I am stopping in case you still have something
valuable there.

Change-Id: I8518cc5e4b3f7bbfc2c2283a2b946dee504991dd
2023-02-17 10:13:31 +01:00
Simon Westphahl 0066427084
Correctly set the repo remote URL
I8e1b5b26f03cb75727d2b2e3c9310214a3eac447 introduced a regression that
prevented us from re-cloning a repo that no longer exists on the file
system (e.g. deleted by an operator) but where we still have the cached
`Repo` object.

The problem was that we only updated the remote URL of the repo object
after we wrote it to the Git config. Unfortunately, if the repo no
longer existed on the file system we would attempt to re-clone it with a
possibly outdated remote URL.

`test_set_remote_url` is a regression test for the issue described
above.

`test_set_remote_url_invalid` verifies that the original issue is fixes,
where we updated the remote URL attribute of the repo object, but fail
to update the Git config.

Change-Id: I311842ccc7af38664c28177450ea9e80e1371638
2022-12-07 14:54:03 +01:00
Simon Westphahl b17dfc13ed
Cleanup leaked git index.lock files on checkout
When the git command crashes or is aborted due to a timeout we might end
up with a leaked index.lock file in the affected repository.

This has the effect that all subsequent git operations that try to
create the lock will fail. Since Zuul maintains a separate lock for
serializing operations on a repositotry, we can be sure that the lock
file was leaked in a previous operation and can be removed safely.

Unable to checkout 8a87ff7cc0d0c73ac14217b653f9773a7cfce3a7
Traceback (most recent call last):
  File "/opt/zuul/lib/python3.10/site-packages/zuul/merger/merger.py", line 1045, in _mergeChange
    repo.checkout(ref, zuul_event_id=zuul_event_id)
  File "/opt/zuul/lib/python3.10/site-packages/zuul/merger/merger.py", line 561, in checkout
    repo.head.reset(working_tree=True)
  File "/opt/zuul/lib/python3.10/site-packages/git/refs/head.py", line 82, in reset
    self.repo.git.reset(mode, commit, '--', paths, **kwargs)
  File "/opt/zuul/lib/python3.10/site-packages/git/cmd.py", line 542, in <lambda>
    return lambda *args, **kwargs: self._call_process(name, *args, **kwargs)
  File "/opt/zuul/lib/python3.10/site-packages/git/cmd.py", line 1005, in _call_process
    return self.execute(call, **exec_kwargs)
  File "/opt/zuul/lib/python3.10/site-packages/git/cmd.py", line 822, in execute
    raise GitCommandError(command, status, stderr_value, stdout_value)
git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
  cmdline: git reset --hard HEAD --
  stderr: 'fatal: Unable to create '/var/lib/zuul/merger-git/github/foo/foo%2Fbar/.git/index.lock': File exists.
  Another git process seems to be running in this repository, e.g.
  an editor opened by 'git commit'. Please make sure all processes
  are terminated then try again. If it still fails, a git process
  may have crashed in this repository earlier:
  remove the file manually to continue.'

Change-Id: I97334383df476809c39e0d03b1af50cb59ee0cc7
2022-11-15 07:03:21 +01:00
James E. Blair 8a8502f661 Fix race in merger shutdown
We can disconnect from ZK while the merger is still running which
can have some adverse effects and cause tests to never exit.

This moves the zk disconnect in the merger to the join method so
that we ensure that we have exited the main loop.

It also adds some improved logging so that not everything just
says "Stopped".

Change-Id: I459af85ac70ecf1f61645466d0eddc63c7e61ff9
2022-11-08 15:12:22 -08:00
James E. Blair 26b9b0e2fb Add rebase-merge merge mode
GitHub supports a "rebase" merge mode where it will rebase the PR
onto the target branch and fast-forward the target branch to the
result of the rebase.

Add support for this process to the merger so that it can prepare
an effective simulated repo, and map the merge-mode to the merge
operation in the reporter so that gating behavior matches.

This change also makes a few tweaks to the merger to improve
consistency (including renaming a variable ref->base), and corrects
some typos in the similar squash merge test methods.

Change-Id: I9db1d163bafda38204360648bb6781800d2a09b4
2022-10-17 14:27:05 -07:00
James E. Blair e68f2bfdb3 Don't trace merge jobs that we don't lock
We get a trace from every merger (including executors) for every
merge job because we start the trace before attempting the lock.
So essentially, we get one trace from the merger that runs the job,
and one trace from every other merger indicating that it did not
run the job.

This is perhaps too much detail for us.  While it's true that we
can see the response times of every system component here, it may
be sufficient to have only the response time of the first merger.
This will reduce the noise in trace visualizations significantly.

Change-Id: I88c56f00c060eae9316473f4a4e222a0db97e510
2022-10-05 11:16:18 -07:00
Simon Westphahl f1e3d67608
Trace merge requests and merger operations
The span info for the different merger operations is stored on the
request and will be returned to the scheduler via the result event.

This also adds the request UUID to the "refstat" job so that we can
attach that as a span attribute.

Change-Id: Ib6ac7b5e7032d168f53fe32e28358bd0b87df435
2022-09-19 11:25:49 +02:00
Simon Westphahl a97d9f594e Set remote URL after config was updated
To avoid issues with outdated Github access tokens in the Git config we
only update the remote URL on the repo object after the config update
was successful.

This also adds a missing repo lock when building the repo state.

Change-Id: I8e1b5b26f03cb75727d2b2e3c9310214a3eac447
2022-08-18 10:34:40 +02:00
James E. Blair 458ba317fd Add pipeline-based merge op metrics
So that operators can see in aggregate how long merge, files-changes,
and repo-state merge operations take in certain pipelines, add
metrics for the merge operations themselves (these exclude the
overhead of pipeline processing and job dispatching).

Change-Id: I8a707b8453c7c9559d22c627292741972c47c7d7
2022-07-12 10:25:59 -07:00
Zuul 64b80001ee Merge "Add support for GHE repository cache" 2022-05-30 15:56:23 +00:00
Joshua Watt 968c2ffd1f merger: Handle merges with cherry-pick merge-mode
Merges cannot be cherry-picked in git, so if a change is a merge, do a
`git merge` instead of a cherry-pick to match how Gerrit will merge the
change.

Change-Id: I9bc7025d2371913b63f0a6723aff480e7e63d8a3
Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
2022-05-11 10:06:18 -05:00
James E. Blair c41fcbe483 Add support for GHE repository cache
Change-Id: Iec87857aa58f71875d780da3698047dae01120d7
2022-05-05 13:39:41 -07:00
Zuul d1db18baa9 Merge "Fix bug in getting changed files" 2022-04-28 15:56:45 +00:00
Dong Zhang 79b6252370 Fix bug in getting changed files
The fix including 2 parts:
1. For Gtihub, we use the base_sha instead of target branch to
   be passed as "tosha" parameter to get precise changed files
2. In method getFilesChanges(), use diff() result to filter out
   those files that changed and reverted between commits.

The reason we do not direcly use diff() is that for those
drivers other than github, the "base_sha" is not available yet,
using diff() may include unexpected files when target branch
has diverged from the feature branch.

This solution works for  99.9% of the caseses, it may still get
incorrect list of changed files in following corner case:
1. In non-github connection, whose base_sha is not implented, and
2. Files changed and reverted between commits in the change, and
3. The same file has also diverged in target branch.

The above corner case can be fixed by making base_sha available in
other drivers.

Change-Id: Ifae7018a8078c16f2caf759ae675648d8b33c538
2022-04-25 15:05:48 -07:00
James E. Blair 1cd1d3f4de Delete repo if unable to reset
If a merger or executor is unable to reset a repo, we currently
simply log the message "Unable to reset repo".  Instead, let's
assume that it is permanently broken and rmtree it so that future
attempts will automatically recover.

Change-Id: I17b051d70a9c5800019bf9ef7e0800558614cadd
2022-04-13 16:53:21 -07:00
Zuul d6ace2cec6 Merge "Create remote ref when it does not exist" 2022-04-13 16:10:03 +00:00
Albin Vass d5bfbf53b6 Recover from broken process pools in merge operations
Some merge operations catch too generic exceptions which causes
BrokeProcessPool exceptions to never reach the executor to allow the
executor to recover.

Bubble these exceptions up to the exececutor for them to be handled.

Change-Id: I77d4d381e12195bcfe7d831a2b9e6d361b90f5a2
2022-03-24 14:23:40 +01:00
Zuul a72c661eb0 Merge "Support non-top-level dirs as extra config path" 2022-03-22 23:25:25 +00:00
Dong Zhang f991c3fdc6 Create remote ref when it does not exist
It can happen that the remote ref (corresponding to the branch in
cache) is not available when local workspace is cloned.

Fix this issue by creating the remote ref when it does not exist.

Change-Id: I68244e0b5aa3c8b6e15693ffc2897d4f416e0d5c
2022-03-21 08:22:46 +01:00
Zuul 30fe3a5039 Merge "Simplify _saveRepoState" 2022-03-14 11:45:38 +00:00
Zuul 60d7e636c4 Merge "Improve performance of _saveRepoState" 2022-03-14 11:21:44 +00:00
Simon Westphahl 2f69e30f90 Support non-top-level dirs as extra config path
This change adds support for configuring non-top-level directories
(e.g. `foobar/zuul.d/`) as an extra config path which did not work so
far.

It's not clear if this was a bug or intended behavior that was just not
documented.

Change-Id: I1bc468130c9324a2e1b5d7f50b42fdc045eaa741
2022-03-08 16:23:18 +01:00
Tobias Henkel 2e785ac7b6
Simplify _saveRepoState
This is an attempt to avoid getting the refs twice on the repo and
simplify _saveRepoState by directly using getPackedRefs.

Change-Id: I27876571451554caca19bdf9ae7ff502d2d4e062
2022-02-25 09:51:49 +01:00
James E. Blair 61cb275480 Report which repo failed initial merge ops
When the initial merge job for a queue item fails, users typically
see a message saying "this project or one of dependencies failed
to merge".  To help users and/or administrators more quickly identify
the problem, include connection project and change information in
a warning message posted to the code review system.

Change-Id: If1bced80b87b908f63867083efb306ebe02ed1ee
2022-02-20 13:06:39 -08:00
James E. Blair a160484a86 Add zuul-scheduler tenant-reconfigure
This is a new reconfiguration command which behaves like full-reconfigure
but only for a single tenant.  This can be useful after connection issues
with code hosting systems, or potentially with Zuul cache bugs.

Because this is the first command-socket command with an argument, some
command-socket infrastructure changes are necessary.  Additionally, this
includes some minor changes to make the services more consistent around
socket commands.

Change-Id: Ib695ab8e7ae54790a0a0e4ac04fdad96d60ee0c9
2022-02-08 14:14:17 -08:00
Clark Boylan 1d4a6e0b71 Add a merger graceful command
This command is an alias for merger stop as merger stop is already a
graceful stop. We add this command to make this more clear and
consistent with the executor.

Change-Id: Iffba56b0127575eaadf31753e2a64dfd95f12fa6
2022-02-07 09:39:44 -08:00
Simon Westphahl 9044f7e907 Revert "Fix a bug in getting changed files"
The reverted change can lead to the listing of files that are not
changed in the referenced commit(s). This can e.g. happen if the base
branch (e.g. master) has diverged from the feature branch.

This is now also tested to avoid regressions in the future. The issue
related to files that are added/removed in the same range of commits
(e.g. a PR) needs to be addressed in a separate change.

This reverts commit e63d7b0cdb.

Change-Id: I07bc4a09bf162fdbc4c2daeecb19e12d81241801
2022-01-20 11:42:54 +01:00
James E. Blair 704fef6cb9 Add readiness/liveness probes to prometheus server
To facilitate automation of rolling restarts, configure the prometheus
server to answer readiness and liveness probes.  We are 'live' if the
process is running, and we are 'ready' if our component state is
either running or paused (not initializing or stopped).

The prometheus_client library doesn't support this directly, so we need
to handle this ourselves.  We could create yet another HTTP server that
each component would need to start, or we could take advantage of the
fact that the prometheus_client is a standard WSGI service and just
wrap it in our own WSGI service that adds the extra endpoints needed.
Since that is far simpler and less resounce intensive, that is what
this change does.

The prometheus_client will actually return the metrics on any path
given to it.  In order to reduce the chances of an operator configuring
a liveness probe with a typo (eg '/healthy/ready') and getting the
metrics page served with a 200 response, we restrict the metrics to
only the '/metrics' URI which is what we specified in our documentation,
and also '/' which is very likely accidentally used by users.

Change-Id: I154ca4896b69fd52eda655209480a75c8d7dbac3
2021-12-09 07:37:29 -08:00
Zuul e14a8e65c4 Merge "Fix a bug in getting changed files" 2021-12-08 18:45:53 +00:00
Andy Ladjadj 6b4b293311 Improve performance of _saveRepoState
In case of large repository with more than 10k refs,
this method use actualy an async call from Gitpython to retrieve sha1.
Gitpython open file filesystem for each refs

For example with repository with 18k tags,
a merger instance take 100% of one CPU (not threadless) for ~ 3min
to perform the loop

To improve this, we store all sha1 of tag directly from
a git command (for_each_ref), this method open once the packedref
of repository to extract all refs.

If a ref is not in the dict we use fallback method `ref.object`

Change-Id: I8b52b39cb79527791a34ac98a25e7ee41c8d4956
2021-12-03 22:54:37 +00:00
Zuul e7555aff6c Merge "Add `playbook_context` zuul variable" 2021-11-30 23:16:58 +00:00
James E. Blair 476800d382 Add `playbook_context` zuul variable
This adds a variable which may be useful for debugging or auditing
the repo state of playbooks or roles for a job.

Change-Id: I86429a06ed8625faa72db6a19630de633f1694b6
2021-11-30 13:34:06 -08:00
James E. Blair b7e2e49f7f Use sort_keys with json almost everywhere we write to ZK
For almost any data we write to ZK (except for long-standing nodepool
classes), add the sort_keys=True so that we can more easily determine
whether an update is required.

This is in service of zkobject, and is not strictly necessary because
the json module follows dict insertion order, and our serialize methods
are obviously internally consistent (at least, if they're going to produce
the same data, which is all we care about).  But that hasn't always been
true and might not be true in the future, so this is good future-proofing.

Based on a similar thought, the argument is also added to several places
which do not use zkobject but which do write to ZK, in case we perform
a similar check in the future.  This seems like a good habit to use
throughout the code base.

Change-Id: Idca67942c057ab0e6b629b50b9b3367ccc0e4ad7
2021-11-12 15:50:02 -08:00
Dong Zhang e63d7b0cdb Fix a bug in getting changed files
The original implementation takes into account the changed fils from
all commits of a PR.
It causes a bug when files get changed and reverted in those commits.
e.g. A file is added in first commit then removed in second commit,
this file should should not be considered as a changed file in the PR.

Change-Id: I7db8b9d3f3267073c5e1a71f52e75939ffa91773
2021-11-11 13:46:01 +08:00
Felix Edel 220534c0f7 Store version information in component registry
This stores the zuul version of each component in the component
registry and updates the API endpoint.

Change-Id:  I1855b2a6db2bd330343cad69d9d6cf21ea35a1f5
2021-10-20 17:17:02 +02:00
James E. Blair 66008900a8 Send synthetic merge completed events on cleanup
When a merger crashes, the scheduler identifies merge jobs which
were left in an incomplete state and cleans them up.  However there
may be queue items waiting for merge complete events, and nothing
generates those in this case.

Update the merge job cleanup procedure to mimic the executor job
cleanup procedure which, in addition to deleting the incomplete job
requests, also creates synthetic complete events in order to prompt
the scheduler to resume processing.

Change-Id: Idea384f636a0cd9e8c82ee92d3f5a65bef0889f2
2021-09-20 10:37:39 -07:00
James E. Blair 97a76de403 Fix race involving job request locks
It's possible for the following sequence to occur (prefixed by
thread ids):

2> process job request cache update

1> finish job
1> set job request state to complete
1> unlock job request
1> delete job request
1> delete job request lock

2> get cached list of running jobs for lostRequests, start examining job
2> check if the job is unlocked (this will re-create the lock dir and return true)
2> attempt to set job request state to complete (this will raise JobRequestNotFound)
2> bail

This leaves a lock node laying around.  We have a cleanup process that
will eventually remove it in production, but it's existence can cause
the clean-state checks at the end of unit tests to fail.

To correct this:

a) Try to avoid re-creating the lock (though this is not possible in all cases)
b) If we encounter JobRequestNotFound error in the cleanup, attempt to
   delete the job nonetheless (so that we re-delete the lock dir)

The remove method is also made entirely idemptotent to support this.

Change-Id: I49ad5c38a3c6cbaf0962e805b6c228e36b97a3d2
2021-09-14 09:10:34 -07:00