Commit Graph

432 Commits

Author SHA1 Message Date
Zuul 617bbb229c Merge "Fix validate-tenants isolation" 2024-02-28 02:46:55 +00:00
James E. Blair c531adacae Add --keep-config-cache option to delete-state command
The circular dependency refactor will require deleting all of the
pipeline states as well as the event queues from ZK while zuul
is offline during the upgrade.  This is fairly close to the existing
"delete-state" command, except that we can keep the config cache.
Doing so will allow for a faster recovery since we won't need to
issue all of the cat jobs again in order to fetch file contents.

To facilitate this, we add a "--keep-config-cache" argument to
the "delete-state" command which will then remove everything under
/zuul except /zuul/config.

Also, speed up both operations by implementing a fast recursive
delete method which sends async delete ops depth first and only
checks their results at the end (as opposed to the standard kazoo
delete which checks each operation at once).

This is added without a release note since it's not widely useful
and the upcoming change which requires its use will have a release
note with usage instructions.

Change-Id: I4db43e00a73f5e5b796261ffe7236ed906e6b421
2024-02-02 12:09:52 -08:00
James E. Blair fb7d24b245 Fix validate-tenants isolation
The validate-tenants scheduler subcommand is supposed to perform
complete tenant validation, and in doing so, it interacts with zk.
It is supposed to isolate itself from the production data, but
it appears to accidentally use the same unparsed config cache
as the production system.  This is mostly okay, but if the loading
paths are different, it could lead to writing cache errors into
the production file cache.

The error is caused because the ConfigLoader creates an internal
reference to the unparsed config cache and therefore ignores the
temporary/isolated unparsed config cache created by the scheduler.

To correct this, we will always pass the unparsed config cache
into the configloader.

Change-Id: I40bdbef4b767e19e99f58cbb3aa690bcb840fcd7
2024-01-31 14:58:45 -08:00
James E. Blair ebb7986c6f Client (old): don't translate null to 0000000
Like I9886cd44f8b4bae6f4a5ce3644f0598a73ecfe0a, have the zuul client
send actual null values for oldrev/newrev instead of 0000000 which
could lead to unintended behavior.

Change-Id: I44994426493d05a039b5a1051504958b36729c9d
2024-01-12 06:49:17 -08:00
Zuul 1edf5b6760 Merge "Fix delete-pipeline-state command" 2023-05-22 11:33:40 +00:00
Clark Boylan c1b0a00c60 Only check bwrap execution under the executor
The reason for this is that containers for zuul services need to run
privileged in order to successfully run bwrap. We currently only expect
users to run the executor as privilged and the new bwrap execution
checks have broken other services as a result. (Other services load the
bwrap system bceause it is a normal zuul driver and all drivers are
loaded by all services).

This works around this by add a check_bwrap flag to connection setup and
only setting it to true on the executor. A better longer term followup
fixup would be to only instantiate the bwrap driver on the executor in
the first place. This can probably be accomplished by overriding the
ZuulApp configure_connections method in the executor and dropping bwrap
creation in ZuulApp.

Temporarily stop running the quick-start job since it's apparently not
using speculative images.

Change-Id: Ibadac0450e2879ef1ccc4b308ebd65de6e5a75ab
2023-05-17 13:45:23 -07:00
Simon Westphahl cc2ff9742c
Fix delete-pipeline-state command
This change also extends the test to assert that the pipeline change
list was re-creates by asserting that the node exists in Zookeeper.

Traceback (most recent call last):
  File "/home/westphahl/src/opendev/zuul/zuul/.nox/tests/bin/zuul-admin", line 10, in <module>
    sys.exit(main())
  File "/home/westphahl/src/opendev/zuul/zuul/zuul/cmd/client.py", line 1066, in main
    Client().main()
  File "/home/westphahl/src/opendev/zuul/zuul/zuul/cmd/client.py", line 592, in main
    if self.args.func():
  File "/home/westphahl/src/opendev/zuul/zuul/zuul/cmd/client.py", line 1045, in delete_pipeline_state
    PipelineChangeList.new(context)
  File "/home/westphahl/src/opendev/zuul/zuul/zuul/zk/zkobject.py", line 225, in new
    obj._save(context, data, create=True)
  File "/home/westphahl/src/opendev/zuul/zuul/zuul/zk/zkobject.py", line 507, in _save
    path = self.getPath()
  File "/home/westphahl/src/opendev/zuul/zuul/zuul/model.py", line 982, in getPath
    return self.getChangeListPath(self.pipeline)
AttributeError: 'PipelineChangeList' object has no attribute 'pipeline'

Change-Id: I8d7bf2fdb3ebf4790ca9cf15519dff4b761fbf2e
2023-04-26 15:58:32 +02:00
Zuul 987fba9f28 Merge "Fix prune-database command" 2023-03-30 01:49:54 +00:00
James E. Blair 7153505cd5 Fix prune-database command
This command had two problems:

* It would only delete the first 50 buildsets
* Depending on DB configuration, it may not have deleted anything
  or left orphan data.

We did not tell sqlalchemy to cascade delete operations, meaning that
when we deleted the buildset, we didn't delete anything else.

If the database enforces foreign keys (innodb, psql) then the command
would have failed.  If it doesn't (myisam) then it would have deleted
the buildset rows but not anything else.

The tests use myisam, so they ran without error and without deleting
the builds.  They check that the builds are deleted, but only through
the ORM via a joined load with the buildsets, and since the buildsets
are gone, the builds weren't returned.

To address this shortcoming, the tests now use distinct ORM methods
which return objects without any joins.  This would have caught
the error had it been in place before.

Additionally, the delet operation retained the default limit of 50
rows (set in place for the web UI), meaning that when it did run,
it would only delete the most recent 50 matching builds.

We now explicitly set the limit to a user-configurable batch size
(by default, 10,000 builds) so that we keep transaction sizes
manageable and avoid monopolizing database locks.  We continue deleting
buildsets in batches as long as any matching buildsets remain. This
should allow users to remove very large amounts of data without
affecting ongoing operations too much.

Change-Id: I4c678b294eeda25589b75ab1ce7c5d0b93a07df3
2023-03-29 17:12:13 -07:00
James E. Blair b1490b1d8e Avoid layout updates after delete-pipeline-state
The delete-pipeline-state commend forces a layout update on every
scheduler, but that isn't strictly necessary.  While it may be helpful
for some issues, if it really is necessary, the operator can issue
a tenant reconfiguration after performing the delete-pipeline-state.

In most cases, where only the state information itself is causing a
problem, we can omit the layout updates and assume that the state reset
alone is sufficient.

To that end, this change removes the layout state changes from the
delete-pipeline-state command and instead simply empties and recreates
the pipeline state and change list objects.  This is very similar to
what happens in the pipeline manager _postConfig call, except in this
case, we have the tenant lock so we know we can write with imputinity,
and we know we are creating objects in ZK from scratch, so we use
direct create calls.

We set the pipeline state's layout uuid to None, which will cause the
first scheduler that comes across it to (assuming its internal layout
is up to date) perform a pipeline reset (which is almost a noop on an
empty pipeline) and update the pipeline state layout to the current
tenant layout state.

Change-Id: I1c503280b516ffa7bbe4cf456d9c900b500e16b0
2023-03-01 13:54:46 -08:00
James E. Blair 7a8882c642 Set layout state event ltime in delete-pipeline-state
The delete-pipeline-state command updates the layout state in order
to force schedulers to update their local layout (essentially perform
a local-only reconfiguration).  In doing so, it sets the last event
ltime to -1.  This is reasonable for initializing a new system, but
in an existing system, when an event arrives at the tenant trigger
event queue it is assigned the last reconfiguration event ltime seen
by that trigger event queue.  Later, when a scheduler processes such
a trigger event after the delete-pipeline-state command has run, it
will refuse to handle the event since it arrived much later than
its local layout state.

This must then be corrected manually by the operator by forcing a
tenant reconfiguration.  This means that the system essentially suffers
the delay of two sequential reconfigurations before it can proceed.

To correct this, set the last event ltime for the layout state to
the ltime of the layout state itself.  This means that once a scheduler
has updated its local layout, it can proceed in processing old events.

Change-Id: I66e798adbbdd55ff1beb1ecee39c7f5a5351fc4b
2023-02-28 07:11:41 -08:00
James E. Blair 8f774043e6 Use importlib for versioning
The semver parsing in PBR doesn't handle the full suite of pep440
versions (for example: 1.2.3+foo1 is the pep440 recommended way
of handling local versions).

Since we aren't doing anything with the parsed versions anyway,
just return the string we get from importlib.

Change-Id: I0a838c639333c40db5b12cd852b921f1b1c87fed
2023-01-23 10:51:08 -08:00
James E. Blair 3780ed548c Unpin JWT and use integer IAT values
PyJWT 2.6.0 began performing validation of iat (issued at) claims
in 9cb9401cc5

I believe the intent of RFC7519 is to support any numeric values
(including floating point) for iat, nbf, and exp, however, the
PyJWT library has made the assumption that the values should be
integers, and therefore when we supply an iat with decimal seconds,
PyJWT will round down when validating the value. In our unit tests,
this can cause validation errors.

In order to avoid any issues, we will round down the times that
we supply when generating JWT tokens and supply them as integers
in accordance with the robustness principle.

Change-Id: Ia8341b4d5de827e2df8878f11f2d1f52a1243cd4
2022-11-15 13:52:53 -08:00
James E. Blair 3a981b89a8 Parallelize some pipeline refresh ops
We may be able to speed up pipeline refreshes in cases where there
are large numbers of items or jobs/builds by parallelizing ZK reads.

Quick refresher: the ZK protocol is async, and kazoo uses a queue to
send operations to a single thread which manages IO.  We typically
call synchronous kazoo client methods which wait for the async result
before returning.  Since this is all thread-safe, we can attempt to
fill the kazoo pipe by having multiple threads call the synchronous
kazoo methods.  If kazoo is waiting on IO for an earlier call, it
will be able to start a later request simultaneously.

Quick aside: it would be difficult for us to use the async methods
directly since our overall code structure is still ordered and
effectively single threaded (we need to load a QueueItem before we
can load the BuildSet and the Builds, etc).

Thus it makes the most sense for us to retain our ordering by using
a ThreadPoolExecutor to run some operations in parallel.

This change parallelizes loading QueueItems within a ChangeQueue,
and also Builds/Jobs within a BuildSet.  These are the points in
a pipeline refresh tree which potentially have the largest number
of children and could benefit the most from the change, especially
if the ZK server has some measurable latency.

Change-Id: I0871cc05a2d13e4ddc4ac284bd67e5e3003200ad
2022-11-09 10:51:29 -08:00
James E. Blair 1eda9ccf96 Correct exit routine in web, merger
Change I216b76d6aaf7ebd01fa8cca843f03fd7a3eea16d unified the
service stop sequence but omitted changes to zuul-web.  Update
zuul-web to match and make its sequence more robust.

Also remove unecessary sys.exit calls from the merger.

Change-Id: Ifdebc17878aa44d57996e4bdd46e49e6144b406b
2022-10-05 13:25:07 -07:00
James E. Blair 9a279725f9 Strictly sequence reconfiguration events
In the before times when we only had a single scheduler, it was
naturally the case that reconfiguration events were processed as they
were encountered and no trigger events which arrived after them would
be processed until the reconfiguration was complete.  As we added more
event queues to support SOS, it became possible for trigger events
which arrived at the scheduler to be processed before a tenant
reconfiguration caused by a preceding event to be complete.  This is
now even possible with a single scheduler.

As a concrete example, imagine a change merges which updates the jobs
which should run on a tag, and then a tag is created.  A scheduler
will process both of those events in succession.  The first will cause
it to submit a tenant reconfiguration event, and then forward the
trigger event to any matching pipelines.  The second event will also
be forwarded to pipeline event queues.  The pipeline events will then
be processed, and then only at that point will the scheduler return to
the start of the run loop and process the reconfiguration event.

To correct this, we can take one of two approaches: make the
reconfiguration more synchronous, or make it safer to be
asynchronous.  To make reconfiguration more synchronous, we would need
to be able to upgrade a tenant read lock into a tenant write lock
without releasing it.  The lock recipes we use from kazoo do not
support this.  While it would be possible to extend them to do so, it
would lead us further from parity with the upstream kazoo recipes, so
this aproach is not used.

Instead, we will make it safer for reconfiguration to be asynchronous
by annotating every trigger event we forward with the last
reconfiguration event that was seen before it.  This means that every
trigger event now specifies the minimum reconfiguration time for that
event.  If our local scheduler has not reached that time, we should
stop processing trigger events and wait for it to catch up.  This
means that schedulers may continue to process events up to the point
of a reconfiguration, but will then stop.  The already existing
short-circuit to abort processing once a scheduler is ready to
reconfigure a tenant (where we check the tenant write lock contenders
for a waiting reconfiguration) helps us get out of the way of pending
reconfigurations as well.  In short, once a reconfiguration is ready
to start, we won't start processing tenant events anymore because of
the existing lock check.  And up until that happens, we will process
as many events as possible until any further events require the
reconfiguration.

We will use the ltime of the tenant trigger event as our timestamp.
As we forward tenant trigger events to the pipeline trigger event
queues, we decide whether an event should cause a reconfiguration.
Whenever one does, we note the ltime of that event and store it as
metadata on the tenant trigger event queue so that we always know what
the most recent required minimum ltime is (ie, the ltime of the most
recently seen event that should cause a reconfiguration).  Every event
that we forward to the pipeline trigger queue will be annotated to
specify that its minimum required reconfiguration ltime is that most
recently seen ltime.  And each time we reconfigure a tenant, we store
the ltime of the event that prompted the reconfiguration in the layout
state.  If we later process a pipeline trigger event with a minimum
required reconfigure ltime greater than the current one, we know we
need to stop and wait for a reconfiguration, so we abort early.

Because this system involves several event queues and objects each of
which may be serialized at any point during a rolling upgrade, every
involved object needs to have appropriate default value handling, and
a synchronized model api change is not helpful.  The remainder of this
commit message is a description of what happens with each object when
handled by either an old or new scheduler component during a rolling
upgrade.

When forwarding a trigger event and submitting a tenant
reconfiguration event:

The tenant trigger event zuul_event_ltime is initialized
from zk, so will always have a value.

The pipeline management event trigger_event_ltime is initialzed to the
tenant trigger event zuul_event_ltime, so a new scheduler will write
out the value.  If an old scheduler creates the tenant reconfiguration
event, it will be missing the trigger_event_ltime.

The _reconfigureTenant method is called with a
last_reconfigure_event_ltime parameter, which is either the
trigger_event_ltime above in the case of a tenant reconfiguration
event forwarded by a new scheduler, or -1 in all other cases
(including other types of reconfiguration, or a tenant reconfiguration
event forwarded by an old scheduler).  If it is -1, it will use the
current ltime so that if we process an event from an old scheduler
which is missing the event ltime, or we are bootstrapping a tenant or
otherwise reconfiguring in a context where we don't have a triggering
event ltime, we will use an ltime which is very new so that we don't
defer processing trigger events.  We also ensure we never go backward,
so that if we process an event from an old scheduler (and thus use the
current ltime) then process an event from a new scheduler with an
older (than "now") ltime, we retain the newer ltime.

Each time a tenant reconfiguration event is submitted, the ltime of
that reconfiguration event is stored on the trigger event queue.  This
is then used as the min_reconfigure_ltime attribute on the forwarded
trigger events.  This is updated by new schedulers, and ignored by old
ones, so if an old scheduler process a tenant trigger event queue it
won't update the min ltime.  That will just mean that any events
processed by a new scheduler may continue to use an older ltime as
their minimum, which should not cause a problem.  Any events forwarded
by an old scheduler will omit the min_reconfigure_ltime field; that
field will be initialized to -1 when loaded on a new scheduler.

When processing pipeline trigger events:

In process_pipeline_trigger_queue we compare two values: the
last_reconfigure_event_ltime on the layout state which is either set
to a value as above (by a new scheduler), or will be -1 if it was last
written by an old scheduler (including in the case it was overwritten
by an old scheduler; it will re-initialize to -1 in that case).  The
event.min_reconfigure_ltime field will either be the most recent
reconfiguration ltime seen by a new scheduler forwarding trigger
events, or -1 otherwise.  If the min_reconfigure_ltime of an event is
-1, we retain the old behavior of processing the event regardless.
Only if we have a min_reconfigure_ltime > -1 and it is greater than
the layout state last_reconfigure_event_ltime (which itself may be -1,
and thus less than the min_reconfigure_ltime) do we abort processing
the event.

(The test_config_update test for the Gerrit checks plugin is updated
to include an extra waitUntilSettled since a potential test race was
observed during development.)

Change-Id: Icb6a7858591ab867e7006c7c80bfffeb582b28ee
2022-07-18 10:51:59 -07:00
Zuul c37047fa92 Merge "Replace 'web' section with 'webclient'" 2022-07-01 08:10:57 +00:00
James E. Blair 603b826911 Add --wait-for-init scheduler option
This instructs the scheduler to wait until all tenants have been
initialized before processing pipelines.  This can be useful for
large systems with excess scheduler capacity to speed up a rolling
restart.

This also removes an unused instance variable from
SchedulerTestManager.

Change-Id: I19e733c881d1abf636674bf572f4764a0d018a8a
2022-06-18 07:57:49 -07:00
Vitaliy Lotorev ab68665f12 Replace 'web' section with 'webclient'
'web' section is used by zuul-web component while zuul REST API
client uses 'webclient' section.

Change-Id: I145c9270ca6676abd0d4977ce1c4c637d304a264
2022-06-05 17:47:17 +03:00
Zuul 6cb2692101 Merge "Add prune-database command" 2022-06-01 21:28:53 +00:00
James E. Blair 3ffbf10f25 Add prune-database command
This adds a zuul-admin command which allows operators to delete old
database entries.

Change-Id: I4e277a07394aa4852a563f4c9cdc39b5801ab4ba
2022-05-30 07:31:16 -07:00
James E. Blair 591d7e624a Unify service stop sequence
We still had some variations in how services stop.  Finger, merger,
and scheduler all used signal.pause in a while loop which is
incompatible with stopping via the command socket (since we would
always restart the pause).  Sending these components a stop or
graceful signal would cause them to wait forever.

Instead of using signal.pause, use the thread.join methods within
a while loop, and if we encounter a KeyboardInterrupt (C-c) during
the join, call our exit handler and retry the join loop.

This maintains the intent of the signal.pause loop (which is to
make C-c exit cleanly) while also being compatible with an internal
stop issued via the command socket.

The stop sequence is now unified across all components.  The executor
has an additional complication in that it forks a process to handle
streaming.  To keep a C-c shutdown clean, we also handle a keyboard
interrupt in the child process and use it to indicate the start of
a shutdown.  In the main executor process, we now close the socket
which is used to keep the child running and then wait for the child
to exit before the main process exits (so that the child doesn't
keep running and emit a log line after the parent returns control
to the terminal).

Change-Id: I216b76d6aaf7ebd01fa8cca843f03fd7a3eea16d
2022-05-28 10:27:50 -07:00
Matthieu Huin 57c78c08e1 Clarify zuul admin CLI scope
We have two CLIs: zuul-client for REST-related operations, which cover
tenant-scoped, workflow modifying actions such as enqueue, dequeue and
promote; and zuul which supercedes zuul-client and covers also true admin
operations like ZooKeeper maintenance, config checking and issueing auth tokens.
This is a bit confusing for users and operators, and can induce code
duplication.

* Rename zuul CLI into zuul-admin. zuul is still a valid endpoint
  and will be removed after next release.
* Print a deprecation warning when invoking the admin CLI as zuul
  instead of zuul-admin, and when running autohold-*, enqueue-*,
  dequeue and promote subcommands. These subcommands will need to be
  run with zuul-client after next release.
* Clarify the scopes and deprecations in the documentation.

Change-Id: I90cf6f2be4e4c8180ad0f5e2696b7eaa7380b411
2022-05-19 15:35:30 +02:00
James E. Blair 864a2b7701 Make a global component registry
We generally try to avoid global variables, but in this case, it
may be helpful to set the component registry as a global variable.

We need the component registry to determine the ZK data model API
version.  It's relatively straightforward to pass it through the
zkcontext for zkobjects, but we also may need it in other places
where we might alter processing of data we previously got from zk
(eg, the semaphore cleanup).  Or we might need it in serialize or
deserialize methods of non-zkobjects (for example, ChangeKey).

To account for all potential future uses, instantiate a global
singleton object which holds a registry and use that instead of
local-scoped component registry objects.  We also add a clear
method so that we can be sure unit tests start with clean data.

Change-Id: Ib764dbc3a3fe39ad6d70d4807b8035777d727d93
2022-02-14 10:58:34 -08:00
James E. Blair a160484a86 Add zuul-scheduler tenant-reconfigure
This is a new reconfiguration command which behaves like full-reconfigure
but only for a single tenant.  This can be useful after connection issues
with code hosting systems, or potentially with Zuul cache bugs.

Because this is the first command-socket command with an argument, some
command-socket infrastructure changes are necessary.  Additionally, this
includes some minor changes to make the services more consistent around
socket commands.

Change-Id: Ib695ab8e7ae54790a0a0e4ac04fdad96d60ee0c9
2022-02-08 14:14:17 -08:00
James E. Blair 29fbee7375 Add a model API version
This is a framework for making upgrades to the ZooKeeper data model
in a manner that can support a rolling Zuul system upgrade.

Change-Id: Iff09c95878420e19234908c2a937e9444832a6ec
2022-01-27 12:19:11 -08:00
Zuul 4808bc025e Merge "Add "zuul delete-pipeline-state" command" 2022-01-27 11:26:26 +00:00
James E. Blair 65da4efdd4 Add "zuul delete-pipeline-state" command
This is intended to aid Zuul developers who are diagnosing a bug
with a running Zuul and who have determined that Zuul may be able to
correct the situation and resume if a pipeline is completely reset.

It is intrusive and not at all guaranteed to work.  It may make things
worse.  It's basically just a convenience method to avoid firing up
the REPL and issuing Python commands directly.  I can't enumerate the
requirements where it may or may not work.  Therefore the documentation
recommends against its use and there is no release note included.

Nevertheless, we may find it useful to have such a command during
a crisis in the future.

Change-Id: Ib637c31ff3ebbb2733a4ad9b903075e7b3dc349c
2022-01-26 16:36:04 -08:00
James E. Blair 215c96f500 Remove gearman server
The gearman server is no longer required.  Remove it from tests and
the scheduler.

Change-Id: I34eda003889305dadec471930ab277e31d78d9fe
2022-01-25 06:44:17 -08:00
James E. Blair 3aa546da86 Remove the rpc client and listener
These are not used any more, remove them from the scheduler and
the "zuul" client.

Change-Id: I5a3217dde32c5f8fefbb0a7a8357a737494d2956
2022-01-25 06:44:09 -08:00
Tristan Cacqueray cb13bdb90c Remove ZooKeeperClient for tenant-conf-check
This change enables running the tenant-conf-check without access
to the ZooKeeper service.

Change-Id: I285cd44f86e5d900715b052b13bf7b2bc58e77a4
2022-01-10 20:04:02 +00:00
James E. Blair 704fef6cb9 Add readiness/liveness probes to prometheus server
To facilitate automation of rolling restarts, configure the prometheus
server to answer readiness and liveness probes.  We are 'live' if the
process is running, and we are 'ready' if our component state is
either running or paused (not initializing or stopped).

The prometheus_client library doesn't support this directly, so we need
to handle this ourselves.  We could create yet another HTTP server that
each component would need to start, or we could take advantage of the
fact that the prometheus_client is a standard WSGI service and just
wrap it in our own WSGI service that adds the extra endpoints needed.
Since that is far simpler and less resounce intensive, that is what
this change does.

The prometheus_client will actually return the metrics on any path
given to it.  In order to reduce the chances of an operator configuring
a liveness probe with a typo (eg '/healthy/ready') and getting the
metrics page served with a 200 response, we restrict the metrics to
only the '/metrics' URI which is what we specified in our documentation,
and also '/' which is very likely accidentally used by users.

Change-Id: I154ca4896b69fd52eda655209480a75c8d7dbac3
2021-12-09 07:37:29 -08:00
Clark Boylan 5b1ba567c8 Prevent duplicate config file entries
It is currently possible to list default zuul config file paths in the
extra-config-paths config directive. Doing so will duplicate the configs
in Zuul which can cause problems. Prevent this entirely via
configuration validation.

Note: There has been a bit of refactoring to ensure that the voluptuous
schema is validated when reading the config. This ensures that an
invalid config doesn't produce hard to understand error messages because
loadTPCs() has attempted to process configuration that isn't valid.
Instead we can catch schema errors early and report them with human
friendly messages.

Change-Id: I07e9d4d3614cbc6cdee06b2866f7ae41d7779135
2021-11-15 15:16:25 -08:00
Simon Westphahl 59edeaf3d1 Use pipeline summary from Zookeeper in zuul-web
With this change zuul-web will generate the status JSON on its own by
directly using the data from Zookeeper. This includes the event queue
lengths as well as the pipeline summary.

Change-Id: Ib80d9c019a15dd9de9d694cb62fd34030016c311
2021-11-10 09:49:48 +01:00
Felix Edel 791c99f64f Load system config and tenant layouts in zuul-web
This uses the configloader in zuul-web to load the system config and
tenant layouts directly from ZooKeeper.

Doing so will allow us to provide the necessary information for most API
endpoints directly in zuul-web without the need to ask the scheduler via
RPC for it.

Change-Id: I4fe19c4e41f3357a07b2fda939c5ffb4e7055e37
2021-11-10 09:25:45 +01:00
Felix Edel 3029b16489 Make the ConfigLoader work independently of the Scheduler
This is an early preparation step for removing the RPC calls between
zuul-web and the scheduler.

We want to format the status JSON and do the job freezing (job freezing
API) directly in zuul-web without utilising the scheduler via RPC. In
order to make this work, zuul-web must instantiate a ConfigLoader.
Currently this would require a scheduler instance which is not available
in zuul-web, thus we have to make this parameter optional.

Change-Id: I41214086aaa9d822ab888baf001972d2846528be
2021-11-10 09:15:53 +01:00
Felix Edel 2c900c2c4a Split up registerScheduler() and onLoad() methods
This is an early preparation step for removing the RPC calls between
zuul-web and the scheduler.

In order to do so we must initialize the ConfigLoader in zuul-web which
requires all connections to be available. Therefore, this change ensures
that we can load all connections in zuul-web without providing a
scheduler instance.

To avoid unnecessary traffic from a zuul-web instance the onLoad()
method initializes the change cache only if a scheduler instance is
available on the connection.

Change-Id: I3c1d2995e81e17763ae3454076ab2f5ce87ab1fc
2021-11-09 09:17:43 +01:00
Clark Boylan d7bca47d35 Cleanup empty secrets dirs when deleting secrets
The zuul delete-keys command can leave us with empty org and project
dirs in zookeeper. When this happens the zuul export-keys command
complaisn about secrets not being present. Address this by checking if
the project dir and org dir should be cleaned up when calling
delete-keys.

Note this happend to OpenDev after renaming all projects from foo/* to
bar/* orphaning the org level portion of the name.

Change-Id: I6bba5ea29a752593b76b8e58a0d84615cc639346
2021-10-19 09:38:21 -07:00
Albin Vass 6e96fcfc67 Exit sucessfully when manipulating project keys
Change-Id: Idb2918fab4d17aa611bf81f42d5b86abc865514f
2021-09-21 16:04:29 +02:00
James E. Blair e2dd49b5be Add delete-state command to delete everything from ZK
This will give operators a tool for manual recovery in case of
emergency.

Change-Id: Ia84beb08b685f59a24f76cb0b6adf518f6e64362
2021-08-24 10:07:41 -07:00
James E. Blair a0af6004de Add copy-keys and delete-keys zuul client commands
These can be used when renaming a project.

Change-Id: I98cf304914449622f9db48651b83e0744b676498
2021-08-24 10:07:41 -07:00
James E. Blair 49d945b5bd Add commands to export/import keys to/from ZK
This removes the filesystem-based keystore in favor of only using
ZooKeeper.  Zuul will no longer load missing keys from the filesystem,
nor will it write out decrypted copies of all keys to the filesystem.

This is more secure since it allows sites better control over when and
where secret data are written to disk.

To provide for system backups to aid in disaster recovery in the case
that the ZK data store is lost, two new scheduler commands are added:

* export-keys
* import-keys

These write the password-protected versions of the keys (in fact, a
raw dump of the ZK data) to the filesystem, and read the same data
back in.  An administrator can invoke export-keys before performing a
system backup, and run import-keys to restore the data.

A minor doc change recommending the use of ``zuul-scheduler stop`` was
added as well; this is left over from a previous version of this change
but warrants updating.

This also removes the test_keystore test file; key generation is tested
in test_v3, and key usage is tested by tests which have encrypted secrets.

Change-Id: I5e6ea37c94ab73ec6f850591871c4127118414ed
2021-08-24 10:07:41 -07:00
Zuul 970e4ed438 Merge "Move sigterm_method to zuul.conf" 2021-08-23 18:22:27 +00:00
Zuul 812c2250bc Merge "Add graceful stop environment variable" 2021-08-23 18:22:25 +00:00
James E. Blair d80555a453 Move sigterm_method to zuul.conf
Instead of using an environment variable for this particular
setting, do what we do for every other aspect of Zuul behavior:
use a setting in zuul.conf.

Change-Id: I5c075dce5b6ad23adc863252af67d7ee7ad0d4d5
2021-08-12 14:22:39 -07:00
Zuul cdcb895323 Merge "Move fingergw config to fingergw" 2021-07-24 12:54:01 +00:00
James E. Blair 7256c52c34 Add graceful stop environment variable
Add an environment variable that lets users (especially container
image users) easily select which way they would like zuul-executor
to handle SIGTERM.

Previous change: I8d42ea1c19f3e627bbfd32a535493de0cb8a04be

Change-Id: Ie15b333712302a3d8f468b083d071d29a7b9043d
2021-07-09 10:36:22 -07:00
James E. Blair 657d8c6fb2 Revert "Add graceful stop environment variable"
This reverts commit f1fca03fd1.

This needs more discussion.

Change-Id: Iebf5c01e4436899a9d6e37150337dcdb4cf9705f
2021-07-09 10:25:47 -07:00
Zuul 2743cb269b Merge "Add graceful stop environment variable" 2021-07-09 16:15:18 +00:00
James E. Blair f1fca03fd1 Add graceful stop environment variable
Add an environment variable that lets users (especially container
image users) easily select which way they would like zuul-executor
to handle SIGTERM.

Change-Id: I8d42ea1c19f3e627bbfd32a535493de0cb8a04be
2021-07-09 08:02:15 -07:00