This change completes the circular dependency refactor.
The principal change is that queue items may now include
more than one change simultaneously in the case of circular
dependencies.
In dependent pipelines, the two-phase reporting process is
simplified because it happens during processing of a single
item.
In independent pipelines, non-live items are still used for
linear depnedencies, but multi-change items are used for
circular dependencies.
Previously changes were enqueued recursively and then
bundles were made out of the resulting items. Since we now
need to enqueue entire cycles in one queue item, the
dependency graph generation is performed at the start of
enqueing the first change in a cycle.
Some tests exercise situations where Zuul is processing
events for old patchsets of changes. The new change query
sequence mentioned in the previous paragraph necessitates
more accurate information about out-of-date patchsets than
the previous sequence, therefore the Gerrit driver has been
updated to query and return more data about non-current
patchsets.
This change is not backwards compatible with the existing
ZK schema, and will require Zuul systems delete all pipeline
states during the upgrade. A later change will implement
a helper command for this.
All backwards compatability handling for the last several
model_api versions which were added to prepare for this
upgrade have been removed. In general, all model data
structures involving frozen jobs are now indexed by the
frozen job's uuid and no longer include the job name since
a job name no longer uniquely identifies a job in a buildset
(either the uuid or the (job name, change) tuple must be
used to identify it).
Job deduplication is simplified and now only needs to
consider jobs within the same buildset.
The fake github driver had a bug (fakegithub.py line 694) where
it did not correctly increment the check run counter, so our
tests that verified that we closed out obsolete check runs
when re-enqueing were not valid. This has been corrected, and
in doing so, has necessitated some changes around quiet dequeing
when we re-enqueue a change.
The reporting in several drivers has been updated to support
reporting information about multiple changes in a queue item.
Change-Id: I0b9e4d3f9936b1e66a08142fc36866269dc287f1
Depends-On: https://review.opendev.org/907627
This refactors the sql connection to accomodate multiple
simulataneous changes in a buildset.
The change information is removed from the buildset table and
placed in a ref table. Buildsets are associated with refs
many-to-many via the zuul_buildset_ref table. Builds are also
associated with refs, many-to-one, so that we can support
multiple builds with the same job name in a buildset, but we
still know which change they are for.
In order to maintain a unique index in the new zuul_ref table (so that
we only have one entry for a given ref-like object (change, branch,
tag, ref)) we need to shorten the sha fields to 40 characters (to
accomodate mysql's index size limit) and also avoid nulls (to
accomodate postgres's inability to use null-safe comparison operators
on indexes). So that we can continue to use change=None,
patchset=None, etc, values in Python, we add a sqlalchemy
TypeDectorator to coerce None to and from null-safe values such as 0
or the empty string.
Some previous schema migration tests inserted data with null projects,
which should never have actually happened, so these tests are updated
to be more realistic since the new data migration requires non-null
project fields.
The migration itself has been tested with a data set consisting of
about 3 million buildsets with 22 million builds. The runtime on one
ssd-based test system in mysql is about 22 minutes and in postgres
about 8 minutes.
Change-Id: I21f3f3dfc8f93a23744856e5b82b3c948c118dc2
When a build is paused or resumed, we now store this information on the
build together with the event time. Instead of additional attributes for
each timestamp, we add an "event" list attribute to the build which can
also be used for other events in the future.
The events are stored in the SQL database and added to the MQTT payload
so the information can be used by the zuul-web UI (e.g. in the "build
times" gantt chart) or provided to external services.
Change-Id: I789b4f69faf96e3b8fd090a2e389df3bb9efd602
When a build result arrives for a non-current buildset we should skip
the reporting as we can no longer create the reference to the buildset.
Traceback (most recent call last):
File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 2654, in _doBuildCompletedEvent
self.sql.reportBuildEnd(
File "/opt/zuul/lib/python3.10/site-packages/zuul/driver/sql/sqlreporter.py", line 143, in reportBuildEnd
db_build = self._createBuild(db, build)
File "/opt/zuul/lib/python3.10/site-packages/zuul/driver/sql/sqlreporter.py", line 180, in _createBuild
tenant=buildset.item.pipeline.tenant.name, uuid=buildset.uuid)
AttributeError: 'NoneType' object has no attribute 'item'
Change-Id: Iccbe9ab8212fbbfa21cb29b84a17e03ca221d7bd
This corrects two shortcomings in the database handling:
1) If we are unable to create a build or buildset and a later operation
attempts to update that build or buildset, it will likely fail, possibly
aborting the pipeline processing run.
2) If a transient db error occurs, we may miss reporting data to the db.
To correct these, this change does the following:
1) Creates missing builds or buildsets at any point we try to update them.
2) Wraps every write operation in a retry loop which attempts to write to
the database 3 times with a 5 second delay. The retry loop is just
outside the transaction block, so the entire transaction will have been
aborted and we will start again.
3) If the retry loop fails, we log the exception but do not raise it
to the level of the pipeline processor.
Change-Id: I364010fada8cbdb160fc41c5ef5e25576a654b90
A recent change added extra columns to the buildset table. The
end time of the last job is guaranteed to be at least the start time
of the first job. However, if there are queue items in-flight
during the upgrade, those buildsets will not have th first job
timestamps initialized. This produces the following traceback:
Exception in pipeline processing:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/zuul/scheduler.py", line 1977, in _process_pipeline
while not self._stopped and pipeline.manager.processQueue():
File "/usr/local/lib/python3.8/site-packages/zuul/manager/__init__.py", line 1563, in processQueue
item_changed, nnfi = self._processOneItem(
File "/usr/local/lib/python3.8/site-packages/zuul/manager/__init__.py", line 1498, in _processOneItem
self.reportItem(item)
File "/usr/local/lib/python3.8/site-packages/zuul/manager/__init__.py", line 1793, in reportItem
reported=not self._reportItem(item))
File "/usr/local/lib/python3.8/site-packages/zuul/manager/__init__.py", line 1925, in _reportItem
self.sql.reportBuildsetEnd(item.current_build_set, action, final=True)
File "/usr/local/lib/python3.8/site-packages/zuul/driver/sql/sqlreporter.py", line 91, in reportBuildsetEnd
if build.end_time and build.end_time > end_time:
TypeError: '>' not supported between instances of 'datetime.datetime' and 'NoneType'
This change protects against that error; if the first build start
time is None, then we won't perform the comparison and the last
build end time will also be None.
Change-Id: I78840dc58cd950ba85b0dcf108fc0a659b051e95
Add two columns to the buildset table in the database: the timestamp
of the start of the first build and the end of the last build. These
are calculated from the builds in the webui buildset page, but they
are not available in the buildset listing without performing
a table join on the server side.
To keep the buildset query simple and fast, this adds the columns to
the buildset table (which is a minor data duplication).
Return the new values in the rest api.
Change-Id: Ie162e414ed5cf09e9dc8f5c92e07b80934592fdf
Currently, NODE_FAILURE results are not reported via SQL in case the
node request failed. Tis is because those results are directly
evaluated in the pipeline manager before the build is even started.
Thus, there are no build result events sent by the executor and the
"normal" build result event handling is skipped for those builds.
As those build results are not stored in the database they are also not
visible in the UI. Thus, there could be cases where a buildset failed
because of a NODE_FAILURE, but all builds that are shown were
successful.
To fix this, we could directly call the SQL reporter when the
NODE_FAILURE is evaluated in the pipeline manager.
Also adapt the reportBuildEnd() method in the sql reporter so that the
build entry is created in case its not present. This could be the case
if the build started event was not processed or did not happen at all
(e.g. for the NODE_FAILURE results or any result that is created via a
"fake build" directly in the pipeline manager).
Change-Id: I2603a7ccf26a41e6747c9276cb37c9b0fd668f75
The overall duration is from a user (developer) point of view, how much
time it takes from the trigger of the build (e.g. a push, a comment,
etc.), till the last build is finished.
It takes into account also the time spent in waiting in queue, launching
nodes, preparing the nodes, etc.
Technically it measures between the event timestamp and the end time of
the last build in the build set.
This duration reflects the user experience of how much time the user needs
to wait.
Change-Id: I253d023146c696d0372197e599e0df3c217ef344
We can obtain the same information from the SQL database now, so
do that and remove the filesystem-based time database. This will
help support multiple schedulers (as they will all have access to
the same data).
Nothing in the scheduler uses the state directory anymore, so clean
up the docs around that. The executor still has a state dir where
it may install ansible-related files.
The SQL query was rather slow in practice because it created a
temporary table since it was filtering mostly by buildset fields
then sorting by build.id. We can sort by buildset.id and get nearly
the same results (equally valid from our perspective) much faster.
In some configurations under postgres, we may see a performance
variation in the run-time of the query. In order to keep the time
estimation out of the critical path of job launches, we perform
the SQL query asynchronously. We may be able to remove this added
bit of complexity once the scale-out-scheduler work is finished
(and/or we we further define/restrict our database requirements).
Change-Id: Id3c64be7a05c9edc849e698200411ad436a1334d
The executor client still holds a list of local builds objects which is
used in various places. One use case is to look up necessary
information of the original build when a build result event is handled.
Using such a local list won't work with multiple schedulers in place. As
a first step we will avoid using this list for handling build result
events and instead provide all necessary information to the build result
itself and look up the remaining information from the pipeline directly.
This change also improves the log output when processing build result
events in the scheduler.
Change-Id: I9c4e573de2ce63259ec6cfb7d69c2f5be48f33ef
The SQL reporter isn't really a reporter any more, so we don't need
these methods. But we do use the reporter formatting helpers, so
let's keep the class hierarchy for now.
Change-Id: Ic6c9c599cb7ef697f0fdb838180f0f6b5fcf0a5a
We missed some cases where builds might be aborted and the results
not reported to the database. This updates the test framework to
assert that tests end with no open builds or buildsets in the
database.
To fix the actual issues, we need to report some build completions
in the scheduler instead of the pipeline manager. So to do that,
we grab a SQL reporter object when initializing the scheduler
(and we therefore no longer need to do so when initializing the
pipeline manager). The SQL reporter isn't like the rest of the
reporters -- it isn't pipeline specific, so a single global instance
is fine.
Finally, initializing the SQL reporter during scheduler init had
some conflicts with a unit test which tested that the merger could
load "source-only" connections. That test actually verified that
the *scheduler* loaded source-only connections. So to correct this,
it now verifies that the executor (which has a merger and is under
the same constraints as the merger for this purpose) can do so. We
no longer need the source_only flag in tests.
Change-Id: I1a983dcc9f4e5282c11af23813a4ca1c0f8e9d9d
We're currently recording a lot of NO_JOBS buildsets in the db,
and it's likely that no one is interested in that info. Instead,
only add a buildset entry if we know we're going to run jobs.
Change-Id: Ib89c3513a23908befaaea4f09933e846c6477aaa
The name of the nodeset used by a job may be of interest to users
as they look at historical build results (did py38 run on focal or
bionic?). Add a column for that purpose.
Meanwhile, the node_name column has not been used in a very long
time (at least v3). It is universally null now. As a singular value,
it doesn't make a lot of sense for a multinode system. Drop it.
The build object "node_labels" field is also unused; drop it as well.
The MQTT schema is updated to match SQL.
Change-Id: Iae8444dfdd52561928c80448bc3e3158744e08e6
This moves some functions of the SQL reporter into the pipeline
manager, so that builds and buildsets are always recorded in the
database when started and when completed. The 'final' flag is
used to indicate whether a build or buildset result is user-visible
or not.
Change-Id: I053e195d120ecbb2fd89cf7e1e9fc7eccc9dcd2f
Now that the SQL database is required, fail to start if the dburi has
an error (like an incorrect module specification), and wait forever
for a connection to the database before proceeding.
This can be especially helpful in container environments where starting
Zuul may race starting a SQL database.
A test which verified that Zuul would start despite problems with the
SQL connection is removed since that is no longer the desired behavior.
Change-Id: Iae8ea420297f6264ae1d265b22b96d81f1df9a12
The boolean "held" attribute is set to True if a build triggered
a autohold request, and its nodeset was held.
Allow filtering builds by "held" status.
Change-Id: I6517764c534f3ba8112094177aefbaa144abafae
We already have the infrastructure in place for adding warnings to the
reporting. Plumb that through to zuul_return so jobs can do that on
purpose as well. An example could be a post playbook that analyzes
performance statistics and emits a warning about inefficient usage of
the build node resources.
Change-Id: I4c3b85dc8f4c69c55cbc6168b8a66afce8b50a97
Currently the mqtt reporter uses the report url as log_url. This is
fine as long as report-build-page is disabled. As soon as
report-build-page is enabled on a tenant it reports the url to the
result page of the build. As mqtt is meant to be consumed by machines
this breaks e.g. log post processing.
Fix this by reporting the real log url as log_url and add the field
web_url for use cases where really the human url is required.
This fixes also a wrong indentation in the mqtt driver documentation,
resulting in all buildset.builds.* attributes being listed as buildset.*
attributes.
Change-Id: I91ce93a7000ddd0d70ce504b70742262d8239a8f
Since we added those to the MQTT reporter, we should also store them in
the SQL database. They are stored in the zuul_build table and can be
identified via the new "final" column which is set to False for those
builds (and True for all others).
The final flag can also be used in the API to specifically filter for
those builds or remove them from the result. By default, no builds are
filtered out.
The buildset API includes these retried builds under a dedicated
'retry_builds' key to not mix them up with the final builds. Thus, the
JSON format is equally to the one the MQTT reporter uses.
For the migration of the SQL database, all existing builds will be set
to final.
We could also provide a filter mechanism via the web UI, but that should
be part of a separate change (e.g. add a checkbox next to the search
filter saying "Show retried builds").
Change-Id: I5960df8a997c7ab81a07b9bd8631c14dbe22b8ab
This should be stored in the SQL database so that the build page
can present the reason why a particular build failed, instead of
just the result "ERROR".
Change-Id: I4dd25546e27b8d3f3a4e049f9980082a3622079f
Having the zuul event id available in the database and also in the build
and buildset detail page makes debugging a lot easier.
Change-Id: Ia1e4aaf50fb28bb27cbcfcfc3b5a92bba88fc85e
The build page needs the actual log_url returned by the job (without
any modification from success_url or failure_url) in order to create
links to the log site.
The reported success/failure URL isn't as important in this context,
and I suspect their days are numbered once we require the SQL
reporter and report the link to the build page instead. So we just
won't record those in the DB. If we feel that they are important,
we can add new columns for them.
Also, ensure it has a trailing / so that API users (including the JS
pages) are not annoyed by inconsistent data.
Change-Id: I5ea98158d204ae17280c4bf5921e2edf4483cf0a
A recent attempt to use the artifact return feature of zuul_return
exposed some rough edges. These two changes should make it much
easier to use.
First, return artifacts as a dictionary instead of a list. This
requires that they have unique names (which is no bad thing -- what
would two artifacts named "docs" mean anyway?). But mainly it allows
the dict merging behavior of zuul_return to be used, so that one
playbook may use zuul_return with some artifacts, and another playbook
may do the same, without either needing to load in the values of
the other first (assuming, of course, that they use different artifact
names).
Second, add a metadata field. In the database and API, this is JSON
serialized, but in zuul_return and zuul.artifacts, it is exploded into
separate fields. This lets jobs do things like associate versions or
tags with artifacts without having to abuse the url field.
Change-Id: I228687c1bd1c74ebc33b088ffd43f30c7309990d
Adds support for expressing artifact dependencies between jobs
which may run in different projects.
Change-Id: If8cce8750d296d607841800e4bbf688a24c40e08
The plan for the idea of a "promote" pipeline is to fetch
previously uploaded artifacts from the build log server
and move them to the final publication location. However,
jobs which store data (such as documentation builds,
tarballs, or container images) on the log server should not
need to know the configuration of the log server in order
to return the artifact URL to zuul. To support this, if
the job returns a relative URL for an artifact, assume it
is relative to the log URL for the build and combine the
two when storing the artifact info.
Change-Id: I4bce2401c9e59fd469e3b3da2973514c07faecf2
Add an artifact table to the SQL reporter, and allow builds to store
artifact URLs using zuul_return. The web API will return the urls.
Change-Id: I8adfc25cc93327ca73c98bbe170e8f39a0864f7f
We used the ORM to declare the table structure, but just performed
raw queries. As a step to adding more tables and more complex
relationships, fully utilize the ORM in queries.
Change-Id: I260095d908dc9bea03cf825f4cc8aae8ae43ba16
We had cases where zuul used unmerged job descriptions to a trusted
parent job (change A) in non related downstream jobs (change B) not
having zuul.yaml changes. This happened if the trusted parent job is
not defined in the same config repo as the pipeline. E.g. if change A
adds a new post playbook an unrelated change B fails with 'post
playbook not found'. This is caused by the scheduler using the wrong
unmerged job definition of change A but the final workspace contains
the correct state without change A.
In case of change B there is no dynamic layout and the current active
layout should be taken. However it is taken directly from the pipeline
object in getLayout (item.queue.pipeline.layout) which doesn't have
the correct layout referenced at any time while the layout referenced
by the tenant object is correct.
Because the pipeline definition is in a different repository than the
proposed config repo change, when the dynamic layout is created for
the config repo change, the previously cached Pipeline objects are used
to build the layout. These objects are the actual live pipelines, and
when they are added to the layout, they have their Pipeline.layout
attributes set to the dynamic layout. This dynamic layout is then not
used further (it is only created for syntax validation), but the pipelines
remain altered.
We could go ahead and just change that to
item.queue.pipeline.layout.tenant.layout but this feels awkward and
would leave the possibility of similar bugs that are hard to find and
debug. Further pipeline.layout is almost everywhere just used to get
the tenant and not the layout. So this attempt to fix this bug goes
further and completely rips out the layout from the Pipeline object
and replaces it by the tenant. Because the tenant object is never
expected to change during the lifetime of the pipeline object, holding
the reference to the tenant, rather than the layout, is safe.
Change-Id: I1e663f624db5e30a8f51b56134c37cc6e8217029
This change fixes reporting of Tag event where the branch attribute was
expected:
File "zuul/scheduler.py", line 383, in onBuildCompleted
branchname = (build.build_set.item.change.branch.
AttributeError: 'Tag' object has no attribute 'branch'
File "zuul/driver/sql/sqlreporter.py", line 55, in report
branch=item.change.branch,
AttributeError: 'Tag' object has no attribute 'branch'
Change-Id: I5dbe4663c4d1e341b08a32eedbbcfb775330e881
Always store build times in UTC instead of local time. This may lead to
a mix of timezones since the old build times are still stored in local
time.
Change-Id: Ie0cfce385854caa5adbd27f7f13042e7bfd41f1b
inserted_primary_key is a list. When using postgres this leads to the
error:
psycopg2.ProgrammingError: column "buildset_id" is of type integer
but expression is of type integer[]
Using the first value of the list fixes this.
Change-Id: Idd4d1aefebab5791002dda41cce2a4de95a67d40
The schema for the zuul_buildset table has change and patchset
columns as an integer type, so use a NULL default for them when
reporting builds which lack actual data for these fields (e.g.,
those triggered by Gerrit ref-updated events). Previously this
was attempted with an empty string, but the ORM does not coerce
that to NULL so we must use None instead to achieve that.
Change-Id: Ifaa22035182f0848248f394a8d9d4b278ff23583
* Always send a build start event for noop jobs so that we get
start and end times for them
* Handle builds without start or end times in the sql reporter.
These should no longer include noop builds, but may still include
SKIPPED builds.
Test both.
Change-Id: I73eb6bda482ebb515d231492c0769d49cf6ff28a
This change adds oldrev/newrev column to the zuul_buildset sql table to
be able to query post job for a specific commit.
Change-Id: Ic9c0b260560123d080fa47c47a76fcfa31eb3607
This change adds a ref_url column to the zuul_buildset sql table to be able
to render change/ref url when querying buildset.
Change-Id: I91a9a3e5e3b362885e36fa0993b07c750adc69d3