Commit Graph

67 Commits

Author SHA1 Message Date
James E. Blair 1eda9ccf96 Correct exit routine in web, merger
Change I216b76d6aaf7ebd01fa8cca843f03fd7a3eea16d unified the
service stop sequence but omitted changes to zuul-web.  Update
zuul-web to match and make its sequence more robust.

Also remove unecessary sys.exit calls from the merger.

Change-Id: Ifdebc17878aa44d57996e4bdd46e49e6144b406b
2022-10-05 13:25:07 -07:00
James E. Blair 603b826911 Add --wait-for-init scheduler option
This instructs the scheduler to wait until all tenants have been
initialized before processing pipelines.  This can be useful for
large systems with excess scheduler capacity to speed up a rolling
restart.

This also removes an unused instance variable from
SchedulerTestManager.

Change-Id: I19e733c881d1abf636674bf572f4764a0d018a8a
2022-06-18 07:57:49 -07:00
James E. Blair 591d7e624a Unify service stop sequence
We still had some variations in how services stop.  Finger, merger,
and scheduler all used signal.pause in a while loop which is
incompatible with stopping via the command socket (since we would
always restart the pause).  Sending these components a stop or
graceful signal would cause them to wait forever.

Instead of using signal.pause, use the thread.join methods within
a while loop, and if we encounter a KeyboardInterrupt (C-c) during
the join, call our exit handler and retry the join loop.

This maintains the intent of the signal.pause loop (which is to
make C-c exit cleanly) while also being compatible with an internal
stop issued via the command socket.

The stop sequence is now unified across all components.  The executor
has an additional complication in that it forks a process to handle
streaming.  To keep a C-c shutdown clean, we also handle a keyboard
interrupt in the child process and use it to indicate the start of
a shutdown.  In the main executor process, we now close the socket
which is used to keep the child running and then wait for the child
to exit before the main process exits (so that the child doesn't
keep running and emit a log line after the parent returns control
to the terminal).

Change-Id: I216b76d6aaf7ebd01fa8cca843f03fd7a3eea16d
2022-05-28 10:27:50 -07:00
James E. Blair a160484a86 Add zuul-scheduler tenant-reconfigure
This is a new reconfiguration command which behaves like full-reconfigure
but only for a single tenant.  This can be useful after connection issues
with code hosting systems, or potentially with Zuul cache bugs.

Because this is the first command-socket command with an argument, some
command-socket infrastructure changes are necessary.  Additionally, this
includes some minor changes to make the services more consistent around
socket commands.

Change-Id: Ib695ab8e7ae54790a0a0e4ac04fdad96d60ee0c9
2022-02-08 14:14:17 -08:00
James E. Blair 29fbee7375 Add a model API version
This is a framework for making upgrades to the ZooKeeper data model
in a manner that can support a rolling Zuul system upgrade.

Change-Id: Iff09c95878420e19234908c2a937e9444832a6ec
2022-01-27 12:19:11 -08:00
James E. Blair 215c96f500 Remove gearman server
The gearman server is no longer required.  Remove it from tests and
the scheduler.

Change-Id: I34eda003889305dadec471930ab277e31d78d9fe
2022-01-25 06:44:17 -08:00
James E. Blair 704fef6cb9 Add readiness/liveness probes to prometheus server
To facilitate automation of rolling restarts, configure the prometheus
server to answer readiness and liveness probes.  We are 'live' if the
process is running, and we are 'ready' if our component state is
either running or paused (not initializing or stopped).

The prometheus_client library doesn't support this directly, so we need
to handle this ourselves.  We could create yet another HTTP server that
each component would need to start, or we could take advantage of the
fact that the prometheus_client is a standard WSGI service and just
wrap it in our own WSGI service that adds the extra endpoints needed.
Since that is far simpler and less resounce intensive, that is what
this change does.

The prometheus_client will actually return the metrics on any path
given to it.  In order to reduce the chances of an operator configuring
a liveness probe with a typo (eg '/healthy/ready') and getting the
metrics page served with a 200 response, we restrict the metrics to
only the '/metrics' URI which is what we specified in our documentation,
and also '/' which is very likely accidentally used by users.

Change-Id: I154ca4896b69fd52eda655209480a75c8d7dbac3
2021-12-09 07:37:29 -08:00
Felix Edel 2c900c2c4a Split up registerScheduler() and onLoad() methods
This is an early preparation step for removing the RPC calls between
zuul-web and the scheduler.

In order to do so we must initialize the ConfigLoader in zuul-web which
requires all connections to be available. Therefore, this change ensures
that we can load all connections in zuul-web without providing a
scheduler instance.

To avoid unnecessary traffic from a zuul-web instance the onLoad()
method initializes the change cache only if a scheduler instance is
available on the connection.

Change-Id: I3c1d2995e81e17763ae3454076ab2f5ce87ab1fc
2021-11-09 09:17:43 +01:00
Simon Westphahl db0bf681d5 Move tenant validation to separate method
In order to simplify and cleanup the reconfiguration event handling we
move to tenant validation to a separate method. This method will be
directly called from the scheduler cmd app, instead of handling the
validation via the reconfiguration event.

This should give us a clearer picture on the differences of smart and
full reconfigurations, so later on we might be able to have a single
method for handling the different types of reconfigs.

Change-Id: Ifb8715ea1436d1f7f3cc127a2be88d4f5f89e73d
2021-07-06 07:23:51 +02:00
Tristan Cacqueray 6eb2a3eb31 scheduler: call stop on SIGTERM
This change attempst to cleanly stop the scheduler when
the service receive a SIGTERM.

Change-Id: I9f8ccb9aa5f8bd998639919d760aafd8a8b47aa5
2021-06-01 19:49:20 +00:00
Tristan Cacqueray 0dbd8c0784 prometheus: add options to start the server and process collector
This change adds a new prometheus_port option to start a metric server
to be scrapped by a prometheus service. By default, the server exposes
process informations.

Change-Id: Ie329df6adc69768dfdb158d00283161f8b70f07a
2021-04-26 14:47:36 +00:00
Simon Westphahl 50cf6a994a Switch to Zookeeper backed management event queues
Management events will now be dispatched via Zookeeper. The event queues
are namespaced by tenant since the event processing will later require a
tenant lock in a multi scheduler deployment.

Events are considered immutable once in the queue which eliminates the
need for a separate read/write lock. Because of this tenant
reconfiguration events are merged on-the-fly when consuming from the
queue instead of merging them on insert.

Change-Id: Ia79089ce87ab9f4921c38b4542bbf2ea3e655055
2021-03-18 09:24:09 +01:00
Simon Westphahl 2e6cfff818 Switch to Zookeeper backed trigger event queues
Trigger events will now be dispatched via Zookeeper. The event queues
are namespaced by tenant since the event processing will later require a
tenant lock in a multi scheduler deployment.

Gitlab events hold their labels as a non-serializable set attribute; this
change adjusts them to be held in a list (but set operations are still
used for de-duplication).

Change-Id: Ie54fc16488ab8cbc15f97d003f36c12b8a648ed4
2021-03-18 09:24:09 +01:00
Jan Kubovy e7e1fa2660 Instantiate executor client, merger, nodepool and app within Scheduler
Executor client, merger, nodepool and app were instantiated outside the
scheduler and then set using "setX" methods.

Those three components are considered as mandatory and should therefore
be part of all scheduler instances.

This was useful for layout validation where the scheduler was not run
but just instantiated. Since the layout validation does not need to
instantiate a scheduler anymore, this can be simplified by instantiating
these components within the scheduler's constructor.

Change-Id: Ide96a85d17820e3950704577ca6fd0d082e26182
2021-03-09 16:06:29 -08:00
Jan Kubovy 5d1aeeffb5 Make ConnectionRegistry mandatory for Scheduler
So far the connection registry was added after the Scheduler was
instantiated.

We can make the ConnectionRegistry mandatory to simplify the
Scheduler instantiation.

Change-Id: Iff7b1a597c97f2cd13bea75f9f23585b0e7f76b3
2021-03-08 18:51:32 -08:00
Felix Edel 2dfb34a818 Initialize ZooKeeper connection in server rather than in cmd classes
Currently, the ZooKeeper connection is initialized directly in the cmd
classes like zuul.cmd.scheduler or zuul.cmd.merger and then passed to
the server instance.

Although this makes it easy to reuse a single ZooKeeper connection for
multiple components in the tests it's not very realistic.
A better approach would be to initialize the connection directly in the
server classes so that each component has its own connection to
ZooKeeper.

Those classes already get all necessary parameters, so we could get rid
of the additional "zk_client" parameter.

Furthermore it would allow us to use a dedicated ZooKeeper connection
for each component in the tests which is more realistic than sharing a
single connection between all components.

Change-Id: I12260d43be0897321cf47ef0c722ccd74599d43d
2021-03-08 07:15:32 -08:00
Markus Hosch 53ca90b3d3
Add --validate-tenants option to zuul scheduler
This option can be used to check whether a given tenant configuration
results in a valid configuration without errors. This can be used for
validating changes to the tenant config prior to merging and
reconfiguring zuul..

Change-Id: I9d27f8c2cc3c5a6286b643e8032a94dcd6bd5876
Co-authored-by: Tobias Henkel <tobias.henkel@bmw.de>
2021-02-25 10:11:42 +01:00
Felix Edel b4d8a4e74b Simplify ZooKeeper client initialization
The ZooKeeperClient now provides a fromConfig() method that parses all
necessary configuration values to instantiate a ZooKeeperClient.
Previously, this needed to be done in every component to initialize the
connection to ZooKeeper.

Change-Id: I5fa4ddab5f85c658291f1262ee0392a60086846e
2021-02-21 07:41:43 -08:00
James E. Blair 24405c9c74 Require TLS for zookeeper connections
Change-Id: I1d42b3425c948e1e735ba3acaa2ede2b92b050c7
2021-02-17 09:47:11 -08:00
James E. Blair a36400c0f3 Remove SIGHUP handling in scheduler
This was deprecated in 3.3.0; remove it for 4.0.

Change-Id: I75463f2c29f6399cba171386bd475a063fa01ef1
2021-02-15 09:53:55 -08:00
Jan Kubovy d518e56208 Prepare Zookeeper for scale-out scheduler
This change is a common root for other
Zookeeper related changed regarding
scale-out-scheduler. Zookeeper becoming
a central component requires to increase
"maxClientCnxns".

Since the ZooKeeper class is expected to grow
significantly (ZooKeeper is becoming a central part
of Zuul) a split of the ZooKeeper class (zk.py) into
zk module is done here to avoid the current god-class.

Also the zookeeper log is copied to the "zuul_output_dir".

Change-Id: I714c06052b5e17269a6964892ad53b48cf65db19
Story: 2007192
2021-02-15 14:44:18 +01:00
Jan Kubovy 9ab527971f Required SQL reporters
On the way towards a fully scale out scheduler we need to move the
times database from the local filesystem into the SQL
database. Therefore we need to make at least one SQL connection
mandatory.

SQL reporters are required (an implied sql reporter is added to
every pipeline, explicit sql reporters ignored)

Change-Id: I30723f9b320b9f2937cc1d7ff3267519161bc380
Depends-On: https://review.opendev.org/621479
Story: 2007192
Task: 38329
2021-02-03 13:41:55 -08:00
Fabien Boucher 31b83dd2e8 Remove ununecessary shebangs
The commands are managed as entry-points so remove
ununecessary shebangs. Also lib/re2util.py does not
require a shebang as well.

zuul_return.py does not have a main and is not supposed
to be run directly.

Ununecessary shebangs for non executable script causes
rpmlint issues.

Change-Id: I6015daaa0fe35b6935fcbffca1907c01c9a26134
2020-05-18 19:10:33 +02:00
Zuul 0962a4fd4a Merge "Add TLS support for ZooKeeper" 2020-04-14 21:35:33 +00:00
Jan Kubovy 9b612c0a95 Consolidate scheduler pause/exit as hibernation
The two properties, `_pause` and `_exit`, in scheduler are
both used for "hibernation", i.e., saving the queue in a
pickled file. The `resume` method is used to wake (load
the queue from the picked file) up the scheduler.

This is a preparation for pause/resume of a scheduler
which is needed in a multi-scheduler operation. This
hibernation functionality will be removed in the near
future and replaced with keeping the queue in Zookeeper.

Change-Id: I93343272b04eedcf10e963b3ba47042a287b6e9e
Story: 2007192
2020-04-03 14:49:59 +02:00
James E. Blair 93ec3daf47 Add TLS support for ZooKeeper
This adds a script to generate TLS certs for zookeeper.

It also adds new config file options for specifying certs for a
TLS connection, adds a howto document to advise admins on how
to configure ZK for TLS.

It also removes the 'required' flag for the SASL auth parameters,
since they are not actually required.

Include the default openssl.cnf file since some distros modify it
to specify paths that are incompatbile with the zk-ca.sh script.

Change-Id: Icd976cc32dfd9f75f8cfb1c9ad11e08af31723d6
2020-03-18 14:47:37 -07:00
Antoine Musso 7969b96a86 gear: remove support for custom MASS_DO packet
Zuul 2.5 Ansible launcher registered ten of thousands of functions on
each node which, when done serially, took a while.  To alleviate that
issue the Gear protocol had been extended with a custom MASS_DO packet
to register several functions in a single call (see d437159887).

The Ansible launcher has been superseeded by the executor server
removing the sole use of MASS_DO.  The extended Gear.Server had not been
cleaned up though.

Replace custom zuul.lib.gearserver.GearServer() with gear.Server() and
remove code.

For posterity, the MASS_DO idea is captured in Gearman upstream issue
tracker:
https://github.com/gearman/gearmand/issues/6

Change-Id: Ifc57f9b7a17d1d9291a535eb0d9f5e1da3713241
2020-01-29 09:28:53 +01:00
Tobias Henkel 0336205981 Add support for smart reconfigurations
Currently we only can modify the tenant configuration by triggering a
full reconfiguration. However with many large tenants this can take a
long time to finish. Zuul is stalled during this process. Especially
when the system is at quota this can lead to long job queues that
build up just after the reconfiguration. This adds support for a smart
reconfiguration that only reconfigures tenants that changed their
config. This can speed up the reconfiguration a lot in large
multi-tenant systems.

Change-Id: I6240b2850d8961a63c17d799f9bec96705435f19
2019-12-16 17:31:50 +00:00
David Shrewsbury f6b6991af2 Add caching of autohold requests
Change-Id: I94d4a0d2e8630d360ad7c5d07690b6ed33b22f75
2019-09-16 10:46:36 -04:00
James E. Blair a6b48d640c Remove default zookeeper hosts
This default is unlikely to be correct and has caused confusion
for us in the past.  Remove it (which matches the documentation).

Change-Id: I3453b0e918fb1c6783514c470f40f4e973fd683a
2019-03-07 07:49:52 -08:00
Tobias Henkel e20ebbe5cc Add command socket handler for full reconfiguration
In future we want to support different types of reconfigurations so
only relying on signals won't scale. Thus we should make the full
reconfiguration available via the command socket which will be
extensible in the future. A later change will add a reconfiguration
without clearing the cache to be able to quuickly add or remove
projects from the tenant configuration without having too much impact
into the system.

Change-Id: I9748ecbcffa8c9b65f98d8768735bdf00e78cf25
2018-08-09 20:36:11 +00:00
James E. Blair ea267a9279 Tell geard to use keepalives
This should start sending keepalives after the connection has been
open for 5 minutes, and should detect a failed connection after
5 minutes.

Change-Id: I5c30a9ab20551b276d195b83939b3a9b71c2c944
2018-04-13 07:01:49 -07:00
Antoine Musso 1c80f506db Import Zuul modules at top of files
We had to make sure to not import Paramiko before daemonization.  The
zuul-server would hang when establishing a Gerrit ssh connection due to
Random.Crypto() failling to acquire random number from /dev/urandom. It
would block on read() and never process.

When the Server command line invokes the daemonization, python-daemon
closes all file descriptors. Including /dev/urandom. Then the daemonized
establishes the SSH connection and fails to get random number because
Random.Crypto() locks on read() on a closed file description.

Paramiko issue is https://github.com/paramiko/paramiko/issues/59 and the
fix is to use os.random:
6f211115f4

That has been released with Paramiko 1.11.6 and we now require 2.0+.

Move Zuul imports at top of files and drop the comments hinting at the
Paramiko bug.

Change-Id: I5fe956df74815761e3eac2bd25f6fd7f167fc854
2018-03-05 11:23:32 +01:00
Markus Hosch 12c51791c2 Cleanly shutdown zuul scheduler if startup fails
At the moment, sys.exit is used to terminate the scheduler in case
an exception is thrown during startup phase. However, since not
all remaining threads were either daemonized or stopped, sys.exit
waited indefinitely. This change uses Scheduler.stop before exiting
so that all non-daemonized threads are terminated before exit.

Change-Id: I9e1a753e897276b0b0f5c1b5735d05f1cfa8f9f1
2018-02-08 13:55:52 +01:00
Zuul 17d2c36d44 Merge "Remove webapp" 2018-02-03 00:16:49 +00:00
Zuul 438edce212 Merge "Move github webhook from webapp to zuul-web" 2018-02-03 00:16:44 +00:00
Tobias Henkel e0bad8dc05
Remove webapp
The webapp has been superseeded by zuul-web now so remove it
completely.

Change-Id: I8125a0d7f3aef8fa7982c75d4650776b6906a612
2018-01-29 21:21:00 +01:00
Jesse Keating 80730e6c79
Move github webhook from webapp to zuul-web
We want to have zuul-web to handle all http serving stuff so also the
github webhook handling needs to be moved to zuul-web.

Note that this changes the url of the github webhooks to
/driver/github/<connection_name>/payload.

Change-Id: I6482de6c5b9655ac0b9bf353b37a59cd5406f1b7
Signed-off-by: Jesse Keating <omgjlk@us.ibm.com>
Co-Authored-by: Tobias Henkel <tobias.henkel@bmw.de>
2018-01-29 14:16:27 +01:00
Tobias Henkel b3cab1d779
Revert "Register term_handler for all zuul apps"
This reverts commit 00d7ea51fd. It
intended to refactor common code paths for signal handling. However in
our dockerized deployment this seems to completely break signal
handling. Thus it needs to be reverted.

Change-Id: Id5731557ff9a363c7a3d9438a8efcd476e38380c
2018-01-22 15:20:24 +01:00
Tobias Henkel 00d7ea51fd Register term_handler for all zuul apps
Almost all zuul apps use the method term_handler for SIGINT and
SIGTERM. Defining this centrally in ZuulDaemonApp makes this much
simpler and without repitition.

Change-Id: I68f8d1bf52b0e16340818d2bcc44cd9fc5868ca7
2017-12-27 10:45:36 +01:00
Tobias Henkel 30cbb65b43 Centrally register stack dump handler
We want the stack dump handler to be present in all zuul apps so this
can be registered in a central place.

Change-Id: I0c4a97d6ee983aa4d57928682dfb6eeffd050197
2017-12-27 10:17:29 +01:00
Tobias Henkel a56402e4d6 Remove unused method term_handler
This method seems to be superseeded by exit_handler and grep tells me
it is unused.

Change-Id: I5a4dc126acbbe1ac2f99153bc7757c4f6e46fc8c
2017-12-27 10:00:29 +01:00
Tobias Henkel fd7101b3b5 Handle sigterm in nodaemon mode
When running zuul within a container it normally runs in nodaemon mode
as pid 1. Currently in this mode zuul just ignores SIGTERM which is
used normally to stop containers. Thus when running within OpenShift
it waits for a timeout until it gets killed forcefully.

Fix this by handling SIGINT and SIGTERM equally.

Change-Id: I24bd8c953e734fdb9545714126d77cbcdc161bbd
2017-12-18 08:41:47 +01:00
Paul Belanger 40d3ce640c
Add command socket support to zuul-scheduler
Bring online commandsocket support for the scheduler.

Change-Id: Ia1719650623e79d40f239776eb770550bb73169b
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2017-12-06 17:19:40 -05:00
James E. Blair 7cbcbfa44a Fix scheduler reconfiguration handler
This method rename was missed in a previous change.

Change-Id: Idfcbc2a600b0933ac1158c612ac754932bd12950
2017-12-01 11:47:57 -08:00
James E. Blair de0248e0d5 Normalize daemon process handling
Adopt some of the structure from nodepool to make daemon process
handling more consistent.  Handle some argument parsing centrally.

Change the default pid file structure to match nodepool:
  /var/run/zuul/<processname>

Attempt to use the pidfile before daemonizing so that errors are
immediately reported.

Drop the config validation test since it is almost useless at this
point.

Change-Id: I4a9d9473ce028e0b0cd32a8c48598c1682e1c329
2017-11-29 10:03:03 -08:00
James E. Blair bdd50e6cad Add stats for executor and merger count
Report how many executors and mergers are online at any time.  Also
report how many executors are accepting new jobs.  Report the queues
associated with each.

This could be done in the executor or merger clients rather than
centrally in the scheduler, however, determining how many are online
requires a gear admin function, which is relatively expensive.  To
reduce the cost, do it once for the whole system in the scheduler.

The scheduler doesn't directly have a gearman connection (though its
associated rpc listener, merge client, and executor clients do).
It didn't seem necessary to add another client for this, and the
rpc listener seemed the most appropriate connection to borrow (it's
purpose is to expose scheduler functions over gearman).  The method
to count functions is added to it, and it in turn is now started
directly from the scheduler.

Change-Id: I09659c29431ebac7ecd1869cc4c1356026c03d26
2017-10-21 09:45:52 -07:00
James E. Blair ded241e598 Switch statsd config to zuul.conf
The automatic statsd configuration based on env variables has
proven cumbersome and counter-intuitive.  Move its configuration
into zuul.conf in preparation for other components emitting stats.

Change-Id: I3f6b5010d31c05e295f3d70925cac8460d334283
2017-10-13 14:04:42 -07:00
James E. Blair e2f0a87ad8 Add ZK session timeout option
Change-Id: If804c18f2103baa12c9c3bd0344a166fac1ea749
2017-09-28 10:35:12 -07:00
Tristan Cacqueray a7586c96a7 Add gearman server port configuration
This change adds the port configuration option to set a custom port
for the gearman server.

Change-Id: I1b65f93fa0403ff10e00a97afcdb4a3b512eb372
2017-08-29 11:08:39 +00:00