In addition to the safeguard in
Iebf49a9efe193788199197bf7846e336d96edf19 we will only return the final
config for a project-branch as part of the merge result.
Change-Id: I1eb3b75d8762aff4e1ebd057661869df985a79e2
Use a fixed timestamp and merge message so that zuul mergers
produce the exact same commit sha each time they perform a merge
for a queue item. This can help correlate git repo states for
different jobs in the same change as well as across different
changes in the case of a dependent change series.
The timestamp used is the "configuration time" of the queue item
(ie, the time the buildset was created or reset). This means
that it will change on gate resets (which could be useful for
distinguishing one run of a build from another).
Change-Id: I3379b19d77badbe2a2ec8347ddacc50a2551e505
This change completes the circular dependency refactor.
The principal change is that queue items may now include
more than one change simultaneously in the case of circular
dependencies.
In dependent pipelines, the two-phase reporting process is
simplified because it happens during processing of a single
item.
In independent pipelines, non-live items are still used for
linear depnedencies, but multi-change items are used for
circular dependencies.
Previously changes were enqueued recursively and then
bundles were made out of the resulting items. Since we now
need to enqueue entire cycles in one queue item, the
dependency graph generation is performed at the start of
enqueing the first change in a cycle.
Some tests exercise situations where Zuul is processing
events for old patchsets of changes. The new change query
sequence mentioned in the previous paragraph necessitates
more accurate information about out-of-date patchsets than
the previous sequence, therefore the Gerrit driver has been
updated to query and return more data about non-current
patchsets.
This change is not backwards compatible with the existing
ZK schema, and will require Zuul systems delete all pipeline
states during the upgrade. A later change will implement
a helper command for this.
All backwards compatability handling for the last several
model_api versions which were added to prepare for this
upgrade have been removed. In general, all model data
structures involving frozen jobs are now indexed by the
frozen job's uuid and no longer include the job name since
a job name no longer uniquely identifies a job in a buildset
(either the uuid or the (job name, change) tuple must be
used to identify it).
Job deduplication is simplified and now only needs to
consider jobs within the same buildset.
The fake github driver had a bug (fakegithub.py line 694) where
it did not correctly increment the check run counter, so our
tests that verified that we closed out obsolete check runs
when re-enqueing were not valid. This has been corrected, and
in doing so, has necessitated some changes around quiet dequeing
when we re-enqueue a change.
The reporting in several drivers has been updated to support
reporting information about multiple changes in a queue item.
Change-Id: I0b9e4d3f9936b1e66a08142fc36866269dc287f1
Depends-On: https://review.opendev.org/907627
When refs are set asynchronously we don't supply a logger and expect the
`_setRefs()` to return the log messages. This was not the case in the
exception handler when we can't resolve an object.
In addition to this fix another debug message is now also return as part
of the message list as expected.
Traceback (most recent call last):
File "/opt/zuul/lib/python3.11/site-packages/zuul/merger/merger.py", line 553, in _setRefs
repo.odb.info(binsha)
File "/opt/zuul/lib/python3.11/site-packages/git/db.py", line 40, in info
hexsha, typename, size = self._git.get_object_header(bin_to_hex(binsha))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zuul/lib/python3.11/site-packages/git/cmd.py", line 1384, in get_object_header
return self.__get_object_header(cmd, ref)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zuul/lib/python3.11/site-packages/git/cmd.py", line 1371, in __get_object_header
return self._parse_object_header(cmd.stdout.readline())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zuul/lib/python3.11/site-packages/git/cmd.py", line 1332, in _parse_object_header
raise ValueError("SHA %s could not be resolved, git returned: %r" % (tokens[0], header_line.strip()))
ValueError: SHA b'8683bca8c75c1c3ae07730452d93c736b1e899db' could not be resolved, git returned: b'8683bca8c75c1c3ae07730452d93c736b1e899db missing'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.11/concurrent/futures/process.py", line 261, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zuul/lib/python3.11/site-packages/zuul/merger/merger.py", line 532, in setRefsAsync
messages = Repo._setRefs(repo, refs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zuul/lib/python3.11/site-packages/zuul/merger/merger.py", line 560, in _setRefs
log.warning("Unable to resolve reference %s at %s in %s",
^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'warning'
Change-Id: Ieb54a53f1fe09848da0a40fdb0dfcb445c65eded
When a repo that is being used for a zuul role has override-checkout
set to a tag, the job would fail because we did not reconstruct the
tag in our zuul-role checkout; we only did that for branches.
This fixes the repo state restore for any type of ref.
There is a an untested code path where a zuul role repo is checked
out to a tag using override-checkout. Add a test for that (and
also the same for a branch, for good measure).
Change-Id: I36f142cd3c4e7d0b930318dddd7276f3635cc3a2
The recent update to make setting refs more efficient could encounter
an edge case if a branch or tag was removed from the upstream repo
after the repo state was retrieved by zuul. If removing the ref caused
the underlying objects to be removed not not be sent in a fetch, then
our blindly setting the repo state with a ref pointed to an unresolvable
object could leave the repository in a corrupted state.
To recover from any potential corruption that may have somehow happened,
this change adds an additional case where we will remove the underlying
repo and re-clone.
To prevent any such corruption from happening in the first place, we add
a check that each hexsha is resolvable before we set it when restoring
the repo state. This does add a small amount of overhead, but should be
much less than manipulating the loos refs one at a time. A copy of nova
with 10,000 refs adds 100ms for this checking.
Change-Id: Ifd298905e634f83a147644d35ff3ea1c143b3d1f
When the executor clones a repo from the cache to the workspace,
it performs a lot of unecessary work:
* It checks out HEAD and prepares a workspace which we will
immediately change.
* It copies all of the branch refs, half of which we will immediately
delete, and in some configurations (exclude_unprotected_branches)
we will immediately delete most of the rest. Deleting refs with
gitpython is much more expensive than creating them.
This change updates the initial clone to do none of those, instead
relying on the repo state restoration to take care of that for us.
Change-Id: Ie8846c48ccd6255953f46640f5559bb41491d425
Starting with Github Enterprise 3.8[0] and github.com from September
2022 on[1], the merge strategy changed from using merge-recursive to
merge-ort[0].
The merge-ort strategy is available in the Git client since version
2.33.0 and became the default in 2.34.0[2].
If not configured otherwise, we've so far used the default merge
strategy of the Git client (which varies depending on the client
version). With this change, we are now explicitly choosing the default
merge strategy based on the Github version. This way, we can reduce
errors resulting from the use of different merge strategies in Zuul and
Github.
Since the newly added merge strategies must be understood by the mergers
we also need to bump the model API version.
[0] https://docs.github.com/en/enterprise-server@3.8/admin/release-notes
[1] https://github.blog/changelog/2022-09-12-merge-commits-now-created-using-the-merge-ort-strategy/
[2] https://git-scm.com/docs/merge-strategies#Documentation/merge-strategies.txt-recursive
Change-Id: I354a76fa8985426312344818320980c67171d774
The Github driver prints un-sanitized URLs to log files. This PR
uses the redact_url function to sanitize the remote URL before
logging.
Task: #48344
Signed-off-by: Flavio Percoco <flavio@pacerevenue.com>
Change-Id: Id725c8dfe3e4e782c293ff350fc7e35b23d377ab
Signed-off-by: Flavio Percoco <flavio@pacerevenue.com>
Return the Zuul event ID that is already part of the merge request with
the merge result event so logs can be correlated.
Change-Id: I018709cd4d4afa562e6851d0d52c1ddd7583dc62
From the previous log messages related to a cat job it wasn't clear
which HEAD SHA was used to get the file content.
This change adds a log message to the `getFiles()` method that contains
the HEAD commit SHA.
Change-Id: I02a3a97f9b3dfa70f6e55954ea6ef365289f0046
In normal git usage, cherry-picking a commit that has already been
applied and doesn't do anything or cherry-picking an empty commit causes
git to exit with an error to let the user decide what they want to do.
However, this doesn't match the behavior of merges and rebases where
non-empty commits that have already been applied are simply skipped
(empty source commits are preserved).
To fix this, add the --keep-redundant-commit option to `git cherry-pick`
to make git always keep a commit when cherry-picking even when it is
empty for either reason. Then, after the cherry-pick, check if the new
commit is empty and if so back it out if the original commit _wasn't_
empty.
This two step process is necessary because git doesn't have any options
to simply skip cherry-pick commits that have already been applied to the
tree.
Removing commits that have already been applied is particularly
important in a "deploy" pipeline triggered by a Gerrit "change-merged"
event, since the scheduler will try to cherry-pick the change on top of
the commit that just merged. Without this option, the cherry-pick will
fail and the deploy pipeline will fail with a MERGE_CONFICT.
Change-Id: I326ba49e2268197662d11fd79e46f3c020675f21
When using the rebase merge-mode a failed "merge" will leave the repo in
a state that Zuul so far could not recover from. The rebase will create
a `.git/rebase-merge` directory which is not removed when the rebase
fails.
To fix this we will abort the rebase when it fails and also remove any
existing `.git/rebase-merge` and `.git/rebase-apply` directories when
resetting the repository.
DEBUG zuul.Merger: [e: ...] Unable to merge {'branch': 'master', 'buildset_uuid': 'f7be4215f37049b4ba0236892a5d8197', 'connection': 'github', 'merge_mode': 5, 'newrev': None, 'number': 71, 'oldrev': None, 'patchset': 'e81d0b248565db290b30d9a638095947b699c76d', 'project': 'org/project', 'ref': 'refs/pull/71/head'}
Traceback (most recent call last):
File "/opt/zuul/lib/python3.10/site-packages/zuul/merger/merger.py", line 1099, in _mergeChange
commit = repo.rebaseMerge(
File "/opt/zuul/lib/python3.10/site-packages/zuul/merger/merger.py", line 626, in rebaseMerge
repo.git.rebase(*args)
File "/opt/zuul/lib/python3.10/site-packages/git/cmd.py", line 542, in <lambda>
return lambda *args, **kwargs: self._call_process(name, *args, **kwargs)
File "/opt/zuul/lib/python3.10/site-packages/git/cmd.py", line 1005, in _call_process
return self.execute(call, **exec_kwargs)
File "/opt/zuul/lib/python3.10/site-packages/git/cmd.py", line 822, in execute
raise GitCommandError(command, status, stderr_value, stdout_value)
git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
cmdline: git rebase 39fead1852ef01a716a1c6470cee9e4197ff5587
stderr: 'fatal: It seems that there is already a rebase-merge directory, and
I wonder if you are in the middle of another rebase. If that is the
case, please try
git rebase (--continue | --abort | --skip)
If that is not the case, please
rm -fr ".git/rebase-merge"
and run me again. I am stopping in case you still have something
valuable there.
Change-Id: I8518cc5e4b3f7bbfc2c2283a2b946dee504991dd
I8e1b5b26f03cb75727d2b2e3c9310214a3eac447 introduced a regression that
prevented us from re-cloning a repo that no longer exists on the file
system (e.g. deleted by an operator) but where we still have the cached
`Repo` object.
The problem was that we only updated the remote URL of the repo object
after we wrote it to the Git config. Unfortunately, if the repo no
longer existed on the file system we would attempt to re-clone it with a
possibly outdated remote URL.
`test_set_remote_url` is a regression test for the issue described
above.
`test_set_remote_url_invalid` verifies that the original issue is fixes,
where we updated the remote URL attribute of the repo object, but fail
to update the Git config.
Change-Id: I311842ccc7af38664c28177450ea9e80e1371638
When the git command crashes or is aborted due to a timeout we might end
up with a leaked index.lock file in the affected repository.
This has the effect that all subsequent git operations that try to
create the lock will fail. Since Zuul maintains a separate lock for
serializing operations on a repositotry, we can be sure that the lock
file was leaked in a previous operation and can be removed safely.
Unable to checkout 8a87ff7cc0d0c73ac14217b653f9773a7cfce3a7
Traceback (most recent call last):
File "/opt/zuul/lib/python3.10/site-packages/zuul/merger/merger.py", line 1045, in _mergeChange
repo.checkout(ref, zuul_event_id=zuul_event_id)
File "/opt/zuul/lib/python3.10/site-packages/zuul/merger/merger.py", line 561, in checkout
repo.head.reset(working_tree=True)
File "/opt/zuul/lib/python3.10/site-packages/git/refs/head.py", line 82, in reset
self.repo.git.reset(mode, commit, '--', paths, **kwargs)
File "/opt/zuul/lib/python3.10/site-packages/git/cmd.py", line 542, in <lambda>
return lambda *args, **kwargs: self._call_process(name, *args, **kwargs)
File "/opt/zuul/lib/python3.10/site-packages/git/cmd.py", line 1005, in _call_process
return self.execute(call, **exec_kwargs)
File "/opt/zuul/lib/python3.10/site-packages/git/cmd.py", line 822, in execute
raise GitCommandError(command, status, stderr_value, stdout_value)
git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
cmdline: git reset --hard HEAD --
stderr: 'fatal: Unable to create '/var/lib/zuul/merger-git/github/foo/foo%2Fbar/.git/index.lock': File exists.
Another git process seems to be running in this repository, e.g.
an editor opened by 'git commit'. Please make sure all processes
are terminated then try again. If it still fails, a git process
may have crashed in this repository earlier:
remove the file manually to continue.'
Change-Id: I97334383df476809c39e0d03b1af50cb59ee0cc7
We can disconnect from ZK while the merger is still running which
can have some adverse effects and cause tests to never exit.
This moves the zk disconnect in the merger to the join method so
that we ensure that we have exited the main loop.
It also adds some improved logging so that not everything just
says "Stopped".
Change-Id: I459af85ac70ecf1f61645466d0eddc63c7e61ff9
GitHub supports a "rebase" merge mode where it will rebase the PR
onto the target branch and fast-forward the target branch to the
result of the rebase.
Add support for this process to the merger so that it can prepare
an effective simulated repo, and map the merge-mode to the merge
operation in the reporter so that gating behavior matches.
This change also makes a few tweaks to the merger to improve
consistency (including renaming a variable ref->base), and corrects
some typos in the similar squash merge test methods.
Change-Id: I9db1d163bafda38204360648bb6781800d2a09b4
We get a trace from every merger (including executors) for every
merge job because we start the trace before attempting the lock.
So essentially, we get one trace from the merger that runs the job,
and one trace from every other merger indicating that it did not
run the job.
This is perhaps too much detail for us. While it's true that we
can see the response times of every system component here, it may
be sufficient to have only the response time of the first merger.
This will reduce the noise in trace visualizations significantly.
Change-Id: I88c56f00c060eae9316473f4a4e222a0db97e510
The span info for the different merger operations is stored on the
request and will be returned to the scheduler via the result event.
This also adds the request UUID to the "refstat" job so that we can
attach that as a span attribute.
Change-Id: Ib6ac7b5e7032d168f53fe32e28358bd0b87df435
To avoid issues with outdated Github access tokens in the Git config we
only update the remote URL on the repo object after the config update
was successful.
This also adds a missing repo lock when building the repo state.
Change-Id: I8e1b5b26f03cb75727d2b2e3c9310214a3eac447
So that operators can see in aggregate how long merge, files-changes,
and repo-state merge operations take in certain pipelines, add
metrics for the merge operations themselves (these exclude the
overhead of pipeline processing and job dispatching).
Change-Id: I8a707b8453c7c9559d22c627292741972c47c7d7
Merges cannot be cherry-picked in git, so if a change is a merge, do a
`git merge` instead of a cherry-pick to match how Gerrit will merge the
change.
Change-Id: I9bc7025d2371913b63f0a6723aff480e7e63d8a3
Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
The fix including 2 parts:
1. For Gtihub, we use the base_sha instead of target branch to
be passed as "tosha" parameter to get precise changed files
2. In method getFilesChanges(), use diff() result to filter out
those files that changed and reverted between commits.
The reason we do not direcly use diff() is that for those
drivers other than github, the "base_sha" is not available yet,
using diff() may include unexpected files when target branch
has diverged from the feature branch.
This solution works for 99.9% of the caseses, it may still get
incorrect list of changed files in following corner case:
1. In non-github connection, whose base_sha is not implented, and
2. Files changed and reverted between commits in the change, and
3. The same file has also diverged in target branch.
The above corner case can be fixed by making base_sha available in
other drivers.
Change-Id: Ifae7018a8078c16f2caf759ae675648d8b33c538
If a merger or executor is unable to reset a repo, we currently
simply log the message "Unable to reset repo". Instead, let's
assume that it is permanently broken and rmtree it so that future
attempts will automatically recover.
Change-Id: I17b051d70a9c5800019bf9ef7e0800558614cadd
Some merge operations catch too generic exceptions which causes
BrokeProcessPool exceptions to never reach the executor to allow the
executor to recover.
Bubble these exceptions up to the exececutor for them to be handled.
Change-Id: I77d4d381e12195bcfe7d831a2b9e6d361b90f5a2
It can happen that the remote ref (corresponding to the branch in
cache) is not available when local workspace is cloned.
Fix this issue by creating the remote ref when it does not exist.
Change-Id: I68244e0b5aa3c8b6e15693ffc2897d4f416e0d5c
This change adds support for configuring non-top-level directories
(e.g. `foobar/zuul.d/`) as an extra config path which did not work so
far.
It's not clear if this was a bug or intended behavior that was just not
documented.
Change-Id: I1bc468130c9324a2e1b5d7f50b42fdc045eaa741
This is an attempt to avoid getting the refs twice on the repo and
simplify _saveRepoState by directly using getPackedRefs.
Change-Id: I27876571451554caca19bdf9ae7ff502d2d4e062
When the initial merge job for a queue item fails, users typically
see a message saying "this project or one of dependencies failed
to merge". To help users and/or administrators more quickly identify
the problem, include connection project and change information in
a warning message posted to the code review system.
Change-Id: If1bced80b87b908f63867083efb306ebe02ed1ee
This is a new reconfiguration command which behaves like full-reconfigure
but only for a single tenant. This can be useful after connection issues
with code hosting systems, or potentially with Zuul cache bugs.
Because this is the first command-socket command with an argument, some
command-socket infrastructure changes are necessary. Additionally, this
includes some minor changes to make the services more consistent around
socket commands.
Change-Id: Ib695ab8e7ae54790a0a0e4ac04fdad96d60ee0c9
This command is an alias for merger stop as merger stop is already a
graceful stop. We add this command to make this more clear and
consistent with the executor.
Change-Id: Iffba56b0127575eaadf31753e2a64dfd95f12fa6
The reverted change can lead to the listing of files that are not
changed in the referenced commit(s). This can e.g. happen if the base
branch (e.g. master) has diverged from the feature branch.
This is now also tested to avoid regressions in the future. The issue
related to files that are added/removed in the same range of commits
(e.g. a PR) needs to be addressed in a separate change.
This reverts commit e63d7b0cdb.
Change-Id: I07bc4a09bf162fdbc4c2daeecb19e12d81241801
To facilitate automation of rolling restarts, configure the prometheus
server to answer readiness and liveness probes. We are 'live' if the
process is running, and we are 'ready' if our component state is
either running or paused (not initializing or stopped).
The prometheus_client library doesn't support this directly, so we need
to handle this ourselves. We could create yet another HTTP server that
each component would need to start, or we could take advantage of the
fact that the prometheus_client is a standard WSGI service and just
wrap it in our own WSGI service that adds the extra endpoints needed.
Since that is far simpler and less resounce intensive, that is what
this change does.
The prometheus_client will actually return the metrics on any path
given to it. In order to reduce the chances of an operator configuring
a liveness probe with a typo (eg '/healthy/ready') and getting the
metrics page served with a 200 response, we restrict the metrics to
only the '/metrics' URI which is what we specified in our documentation,
and also '/' which is very likely accidentally used by users.
Change-Id: I154ca4896b69fd52eda655209480a75c8d7dbac3
In case of large repository with more than 10k refs,
this method use actualy an async call from Gitpython to retrieve sha1.
Gitpython open file filesystem for each refs
For example with repository with 18k tags,
a merger instance take 100% of one CPU (not threadless) for ~ 3min
to perform the loop
To improve this, we store all sha1 of tag directly from
a git command (for_each_ref), this method open once the packedref
of repository to extract all refs.
If a ref is not in the dict we use fallback method `ref.object`
Change-Id: I8b52b39cb79527791a34ac98a25e7ee41c8d4956
This adds a variable which may be useful for debugging or auditing
the repo state of playbooks or roles for a job.
Change-Id: I86429a06ed8625faa72db6a19630de633f1694b6
For almost any data we write to ZK (except for long-standing nodepool
classes), add the sort_keys=True so that we can more easily determine
whether an update is required.
This is in service of zkobject, and is not strictly necessary because
the json module follows dict insertion order, and our serialize methods
are obviously internally consistent (at least, if they're going to produce
the same data, which is all we care about). But that hasn't always been
true and might not be true in the future, so this is good future-proofing.
Based on a similar thought, the argument is also added to several places
which do not use zkobject but which do write to ZK, in case we perform
a similar check in the future. This seems like a good habit to use
throughout the code base.
Change-Id: Idca67942c057ab0e6b629b50b9b3367ccc0e4ad7
The original implementation takes into account the changed fils from
all commits of a PR.
It causes a bug when files get changed and reverted in those commits.
e.g. A file is added in first commit then removed in second commit,
this file should should not be considered as a changed file in the PR.
Change-Id: I7db8b9d3f3267073c5e1a71f52e75939ffa91773
This stores the zuul version of each component in the component
registry and updates the API endpoint.
Change-Id: I1855b2a6db2bd330343cad69d9d6cf21ea35a1f5
When a merger crashes, the scheduler identifies merge jobs which
were left in an incomplete state and cleans them up. However there
may be queue items waiting for merge complete events, and nothing
generates those in this case.
Update the merge job cleanup procedure to mimic the executor job
cleanup procedure which, in addition to deleting the incomplete job
requests, also creates synthetic complete events in order to prompt
the scheduler to resume processing.
Change-Id: Idea384f636a0cd9e8c82ee92d3f5a65bef0889f2
It's possible for the following sequence to occur (prefixed by
thread ids):
2> process job request cache update
1> finish job
1> set job request state to complete
1> unlock job request
1> delete job request
1> delete job request lock
2> get cached list of running jobs for lostRequests, start examining job
2> check if the job is unlocked (this will re-create the lock dir and return true)
2> attempt to set job request state to complete (this will raise JobRequestNotFound)
2> bail
This leaves a lock node laying around. We have a cleanup process that
will eventually remove it in production, but it's existence can cause
the clean-state checks at the end of unit tests to fail.
To correct this:
a) Try to avoid re-creating the lock (though this is not possible in all cases)
b) If we encounter JobRequestNotFound error in the cleanup, attempt to
delete the job nonetheless (so that we re-delete the lock dir)
The remove method is also made entirely idemptotent to support this.
Change-Id: I49ad5c38a3c6cbaf0962e805b6c228e36b97a3d2