Commit Graph

28 Commits

Author SHA1 Message Date
Clark Boylan 2747ea6f56 Fix DeprecationWarning: ssl.PROTOCOL_TLS is deprecated
Since python 3.10 ssl.PROTOCOL_TLS has been deprecated. We are expected
to use ssl.PROTOCOL_TLS_CLIENT and ssl.PROTOCOL_TLS_SERVER depending on
how the sockets are to be used. Switch over to these new constants to
avoid the DeprecationWarning.

One thing to note is that PROTOCOL_TLS_CLIENT has default behaviors
around cert verification and hostname checking. Zuul is already
explicitly setting those options the way it wants to and I've left that
alone to avoid trouble if the defaults change later.

Finally, this doesn't fix the occurence of this error that happens
within kazoo. A separate PR has been made upstream to kazoo and this
should be fixed in the next kazoo release.

Change-Id: Ib41640f1d33d60503066464c8c98f865a74f003a
2023-02-07 16:37:20 -08:00
Clark Boylan 26523d8e56 Fix ResourceWarnings in fingergw
The fingergw (and its associated testing) was not properly managing ssl
sockets. The issue was we were in a context manager for the underlying
tcp socket which will get closed, but that doesn't call close() on the
ssl socket wrapping the tcp socket. Fix this by moving common recv()
code into a function then use the ssl socket in an inner context manager
if we are using ssl.

Both ssl and plain tcp will close() properly and we avoid duplicating
common code.

Change-Id: I1feefbd03a90734cf3c16baa6ed8f52cd8e00d14
2023-02-07 16:17:14 -08:00
James E. Blair 591d7e624a Unify service stop sequence
We still had some variations in how services stop.  Finger, merger,
and scheduler all used signal.pause in a while loop which is
incompatible with stopping via the command socket (since we would
always restart the pause).  Sending these components a stop or
graceful signal would cause them to wait forever.

Instead of using signal.pause, use the thread.join methods within
a while loop, and if we encounter a KeyboardInterrupt (C-c) during
the join, call our exit handler and retry the join loop.

This maintains the intent of the signal.pause loop (which is to
make C-c exit cleanly) while also being compatible with an internal
stop issued via the command socket.

The stop sequence is now unified across all components.  The executor
has an additional complication in that it forks a process to handle
streaming.  To keep a C-c shutdown clean, we also handle a keyboard
interrupt in the child process and use it to indicate the start of
a shutdown.  In the main executor process, we now close the socket
which is used to keep the child running and then wait for the child
to exit before the main process exits (so that the child doesn't
keep running and emit a log line after the parent returns control
to the terminal).

Change-Id: I216b76d6aaf7ebd01fa8cca843f03fd7a3eea16d
2022-05-28 10:27:50 -07:00
James E. Blair 864a2b7701 Make a global component registry
We generally try to avoid global variables, but in this case, it
may be helpful to set the component registry as a global variable.

We need the component registry to determine the ZK data model API
version.  It's relatively straightforward to pass it through the
zkcontext for zkobjects, but we also may need it in other places
where we might alter processing of data we previously got from zk
(eg, the semaphore cleanup).  Or we might need it in serialize or
deserialize methods of non-zkobjects (for example, ChangeKey).

To account for all potential future uses, instantiate a global
singleton object which holds a registry and use that instead of
local-scoped component registry objects.  We also add a clear
method so that we can be sure unit tests start with clean data.

Change-Id: Ib764dbc3a3fe39ad6d70d4807b8035777d727d93
2022-02-14 10:58:34 -08:00
James E. Blair a160484a86 Add zuul-scheduler tenant-reconfigure
This is a new reconfiguration command which behaves like full-reconfigure
but only for a single tenant.  This can be useful after connection issues
with code hosting systems, or potentially with Zuul cache bugs.

Because this is the first command-socket command with an argument, some
command-socket infrastructure changes are necessary.  Additionally, this
includes some minor changes to make the services more consistent around
socket commands.

Change-Id: Ib695ab8e7ae54790a0a0e4ac04fdad96d60ee0c9
2022-02-08 14:14:17 -08:00
James E. Blair 704fef6cb9 Add readiness/liveness probes to prometheus server
To facilitate automation of rolling restarts, configure the prometheus
server to answer readiness and liveness probes.  We are 'live' if the
process is running, and we are 'ready' if our component state is
either running or paused (not initializing or stopped).

The prometheus_client library doesn't support this directly, so we need
to handle this ourselves.  We could create yet another HTTP server that
each component would need to start, or we could take advantage of the
fact that the prometheus_client is a standard WSGI service and just
wrap it in our own WSGI service that adds the extra endpoints needed.
Since that is far simpler and less resounce intensive, that is what
this change does.

The prometheus_client will actually return the metrics on any path
given to it.  In order to reduce the chances of an operator configuring
a liveness probe with a typo (eg '/healthy/ready') and getting the
metrics page served with a 200 response, we restrict the metrics to
only the '/metrics' URI which is what we specified in our documentation,
and also '/' which is very likely accidentally used by users.

Change-Id: I154ca4896b69fd52eda655209480a75c8d7dbac3
2021-12-09 07:37:29 -08:00
Felix Edel 220534c0f7 Store version information in component registry
This stores the zuul version of each component in the component
registry and updates the API endpoint.

Change-Id:  I1855b2a6db2bd330343cad69d9d6cf21ea35a1f5
2021-10-20 17:17:02 +02:00
Felix Edel df32d4cf58 Let zuul-web look up the live log streaming address from ZooKeeper
This removes the RPC call (Gearman) in zuul-web to look up the live log
streaming address from the build objects in the scheduler and instead
uses the build requests stored in ZooKeeper.

As the address lookup is implemented as a shared library function which
is used by zuul-web and the fingergw, the fingergw is also switched from
RPC to ZooKeeper. The implementation itself was moved from
zuul.rpclistener to zuul.lib.streamer_utils.

To make the lookup via ZooKeeper work, the executor now stores its
worker information (hostname, log_port) on the build request when it
locks the request.

Additionally, the rpc client was removed from the fingergw as it's not
needed anymore. Instead the fingergw has now access to the component
registry and the executor api in ZooKeeper as both are needed to look up
the streaming address.

To not create unnecessary watches for build requests in each fingergw
and zuul-web component, the executor api (resp. the job_request_queue
base class) now provides a "use_cache" flag. The cache is enabled by
default, but if the flag is set to False, no watches will be created.

Overall this should reduce the load on the scheduler as it doesn't need
to handle the related RPC calls anymore.

Change-Id: I359b70f2d5700b0435544db3ce81d64cb8b73464
2021-09-22 07:25:13 +02:00
James E. Blair 51ef833eb4 Add option to check fingergw hostnames
It's conceivable that someone may want to use a public CA (like
letsencrypt) to provide certs for the finger gateway.  In that
case, we should check hostnames.  Add an option for that (and
make it the default).

Change-Id: I41f6f6f20ca9a4ecdb562e1760d8509e44b258f3
2021-07-28 07:22:36 -07:00
James E. Blair e047fc42c6 Combine fingergw certificate options
This combines the client and server certificate options to make
typical deployments simpler.  The same certificate will be used by
a fingergw acting as a client or a server.

A new option is added to tell fingergw to use the cert only for
client use; that way a fingergw can act as an unencrypted end-user
gateway while still able to connect to encrypted servers.

The options are renamed to tls_* to match zookeeper; once gearman
is removed, we will have no ssl_* options.

Documentation and a release note for TLS fingergw support is added.

Change-Id: If3e445336de4644a5303f2ecc7c4a27e4320d042
2021-07-27 15:38:49 -07:00
Tobias Henkel 496e9e3514 Support ssl encrypted fingergw
When using fingergw for inter region log streaming it can be desirable
to support ssl encrypted connections with client auth just like we do
with gearman. This will also make it easy to route traffic to the
finger gateway via an openshift route using SNI and pass-through.

Docs and release note added in a subsequent change.

Change-Id: Ia5c739a3fcf229da140c4e2ebbe1a771c63b0489
2021-07-27 15:38:46 -07:00
James E. Blair a0974f9f8c Use component registry in fingergw routing
This uses the component registry rather than gearman to perform
fingergw routing lookups.  It also adjusts the logic for routing
to match the latest version of the spec, where unzoned fingergw
process are expected to route to zoned fingergws if they exist
(because the unzoned fingergw might be a public gateway outside
of the zone).

Change-Id: I2f9fed03159db59cc4e496802b9dab05f746e1a2
2021-06-21 13:38:03 -07:00
Tobias Henkel 5c4e8d7ddd Route streams to different zones via finger gateway
In some distributed deployments we need to route traffic via single
entry points that need to dispatch the traffic. For this use case make
all components aware of their zone so it is possible to compute if
traffic needs to go via an intermediate finger gateway or not.

Therefore we register the gearman function 'fingergw:info:<zone>' if
the fingergw is zoned. That way the scheduler will be able to route
streams from different zones via finger gateways that are responsible
for their zone.

Change-Id: I655427283205ea02de6f0f271b4aa5092ac05278
2021-06-10 14:09:37 +02:00
Tobias Henkel 46d0ed8e8f Move fingergw config to fingergw
We currently read the config in the fingergw app and put the config
into the fingergw object. This gets ugly when adding more config
options so move evaluation of the config file into the FingerGateway
class. This is a preparation for adding ssl support to the
FingerGateway which will need more config options.

Depends-On: https://review.opendev.org/663413
Change-Id: I83f3863586b85f8befd84eb8f6079fa35ee3a8cb
2021-05-29 09:30:14 -07:00
Felix Edel 040f403e7f Improve component registry
This improves the usage of the component registry in various ways:

1. It adds a tree cache to the registry. The cache is eventual
   consistent, which should be sufficient for most use cases like
   calculating stats in the scheduler and getting a list of components
   without the need to ask ZooKeeper every time for the list of
   components.

2. Components can now be used as classes rather than dictionaries, which
   makes using and updating them much easier and nicer.

3. Components can be used without a registry. This makes registering
   components easier and you only need to instantiate a registry when
   you need the registry itself (e.g. in the scheduler).

With that change the registry itself is not used anywhere in the
production code because it's not required at this point. I will add this
in the next commit.

Change-Id: Ia8efba26114119eecffb9a89264083e4b8a80de0
2021-05-17 16:47:13 -07:00
Jan Kubovy 22935c1177 Component Registry in ZooKeeper
This change adds a component registry which can be used by different
components, such as executors, mergers and others to register
themselves, report their state and store arbitrary runtime information.

This is needed to e.g., monitor components or to share the
"accepting_work" state of executors later on.

Change-Id: I4b7197d6cb399513e30d314f8a5f4f55ad9266f8
2021-03-12 13:51:48 -08:00
Felix Edel 2dfb34a818 Initialize ZooKeeper connection in server rather than in cmd classes
Currently, the ZooKeeper connection is initialized directly in the cmd
classes like zuul.cmd.scheduler or zuul.cmd.merger and then passed to
the server instance.

Although this makes it easy to reuse a single ZooKeeper connection for
multiple components in the tests it's not very realistic.
A better approach would be to initialize the connection directly in the
server classes so that each component has its own connection to
ZooKeeper.

Those classes already get all necessary parameters, so we could get rid
of the additional "zk_client" parameter.

Furthermore it would allow us to use a dedicated ZooKeeper connection
for each component in the tests which is more realistic than sharing a
single connection between all components.

Change-Id: I12260d43be0897321cf47ef0c722ccd74599d43d
2021-03-08 07:15:32 -08:00
Jan Kubovy 7ae2805a5a Connect merger to Zookeeper
Part of point 5 in https://etherpad.openstack.org/p/zuulv4

Connection is idle for now.

Also update component documentation.

Change-Id: I97a97f61940fab2a555c3651e78fa7a929e8ebfb
2021-02-15 14:44:18 +01:00
Tobias Henkel 545bb19459
Join command thread on exit
The executor and fingergw miss joining the command thread. This could
lead to random test failures like [1].

[1] Trace:
Traceback (most recent call last):
  File "/home/zuul/src/opendev.org/zuul/zuul/tests/base.py", line 4187, in shutdown
    raise Exception("More than one thread is running: %s" % threads)
Exception: More than one thread is running:
[<_MainThread(MainThread, started 140602311071488)>, <Thread(command, started daemon 140601232180992)>]

Change-Id: I5246b686fe708444ffaf9d94ef4321b304f1754e
2020-07-15 17:17:44 +02:00
Antoine Musso 4aca407bd6 Add client_id to RPC client
A Gearman client can set a client id which is then used on the server
side to identify the connection. Lack of a client_id makes it harder to
follow the flow when looking at logs:

 gear.Connection.b'unknown'  INFO Connected to 127.0.0.1 port 4730
 gear.Server Accepted connection
  <gear.ServerConnection ... name: None ...>
                                   ^^^^

In RPCClient, introduce a client_id argument which is passed to
gear.Client().
Update callers to set a meaningful client_id.

Change-Id: Idbd63f15b0cde3d77fe969c7650f4eb18aec1ef6
2020-01-28 10:16:19 +01:00
James E. Blair 3dc386626b Fix secondary exception in fingergw
A recent change addded more information to the fingergw exception
handler.  In some cases those variables can be undefined.  Ensure
we use initialized variables in the exception handler.

Change-Id: I3636f4d208a1c35245581129c5690dc70e39336a
2019-01-22 15:35:44 -08:00
Paul Belanger 669e2d66fa Improve exception handling of fingerclient
This is to improve the following traceback, today we don't display what
server / port is.

  Traceback (most recent call last):
    File "/opt/venv/zuul-3.4.0/lib/python3.6/site-packages/zuul/lib/fingergw.py", line 80, in handle
      build_uuid,
    File "/opt/venv/zuul-3.4.0/lib/python3.6/site-packages/zuul/lib/fingergw.py", line 51, in _fingerClient
      with socket.create_connection((server, port), timeout=10) as s:
    File "/usr/lib/python3.6/socket.py", line 704, in create_connection
      for res in getaddrinfo(host, port, 0, SOCK_STREAM):
    File "/usr/lib/python3.6/socket.py", line 745, in getaddrinfo
      for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
  socket.gaierror: [Errno -3] Temporary failure in name resolution

Change-Id: I2adb79b8fc3e12cb971b59e3d89c3dfc24a10a67
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2019-01-21 13:11:24 -05:00
Paul Belanger 47aa6b12b2 Ensure command_socket is last thing to close
This updates all services to how zuul-scheduler works, we close the
command_socket at the last possible moment. This also means we can now
use the command socket on the filesystem as an idicator that zuul
properly shutdown.

Change-Id: I5fe1bc96c87e1177a2b94d73a9cbe505a7807202
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2019-01-07 10:19:48 -05:00
Fabien Boucher bc20de95e5 Remove unecessary shebang and exec bit
Change-Id: I54de68b11f055a9269ca5efb8a57f81d57f9d55f
2018-07-26 07:12:24 +00:00
David Shrewsbury 40d1204c2a Unset finger client timeout after connect
Our finger client will timeout after 10 seconds if no data is
received from the executor after we connect. We really only
want the timeout on the connection portion and just wait forever
until some data starts streaming back. Unset the socket timeout
after connecting.

Change-Id: I94398f78ac6e36715eed830c70ef2b178b310a34
2018-02-21 13:43:41 -05:00
David Shrewsbury 198d9e471b Don't treat finger client disconnect as exception
The fingergw currently logs client disconnects as exceptions.
This makes the log unnecessarily noisy. Just ignore them.

Change-Id: Ic28acabcb47359d4b7077a1eecddefe0f7094212
2018-01-05 11:12:38 -05:00
David Shrewsbury 1c7d1e1ba1 Handle invalid build UUID in finger gateway
The RPC call will return an empty dict if the build UUID
cannot be found. We should handle that gracefully.

Change-Id: Ie0fa49e08d9213bf7226c6301896507866c36e28
2018-01-03 11:39:52 -05:00
David Shrewsbury fe1f1944a6 Add finger gateway
This adds the zuul-fingergw app that should be run as root (so that
it can connect to the standard finger port 79), but changes user privs
immediately after binding that port.

Common streaming functions have been moved to streamer_utils.py to
be shared among modules.

Support for CommandSocket has been included.

Change-Id: Ia35492fe951e7b9367eeab0b145d96189d72c364
2017-12-13 10:07:37 -05:00