Commit Graph

176 Commits

Author SHA1 Message Date
James E. Blair 10df93540f Use Zuul-style ZooKeeper connections
We have made many improvements to connection handling in Zuul.
Bring those back to Nodepool by copying over the zuul/zk directory
which has our base ZK connection classes.

This will enable us to bring other Zuul classes over, such as the
component registry.

The existing connection-related code is removed and the remaining
model-style code is moved to nodepool.zk.zookeeper.  Almost every
file imported the model as nodepool.zk, so import adjustments are
made to compensate while keeping the code more or less as-is.

Change-Id: I9f793d7bbad573cb881dfcfdf11e3013e0f8e4a3
2022-05-23 07:40:20 -07:00
Zuul fc2e592d0d Merge "Add zookeeper-timeout connection config" 2022-03-24 15:23:02 +00:00
Benjamin Schanzel 7ae6b34e71
Pass requestor data to Nodes
Propagate the NodeRequest.requestor to a new property Node.requestor
which usually holds the zuul_system_id, or "NodePool:min_ready".
This is set initially when first creating a node, and is updated when a
ready node is assigned to a request. This is to always know for which
requestor is a node got allocated.

Change-Id: Ifd54a94bae39f31a70bdedce384a64c2a04495c1
2022-03-21 10:19:54 +01:00
James E. Blair 2777b84f07 Retain created_time in node requests
This attribute is set by Zuul, and it is useful for Zuul to be able
to report statistics.  It also may be useful for Nodepool for the
same purposes, or potentially for future information display.

To facilitate all of the above, retain the attribute across read/
write cycles in Nodepool.

Change-Id: Ic1352d9a25a1f5f3fbbab1313ec832b151ac1cc2
2022-02-28 10:03:03 -08:00
Tobias Henkel ec55126f6b
Add zookeeper-timeout connection config
The default zookeeper session timout is 10 seconds which is not enough
on a highly loaded nodepool. Like in zuul make this configurable so we
can avoid session losses.

Change-Id: Id7087141174c84c6cdcbb3933c233f5fa0e7d569
2022-02-23 23:01:11 +01:00
Benjamin Schanzel 1a73a7a33e
Avoid runtime errors in zk node iterator
While iterating over zk nodes with node IDs from the from the node
cache, there can be runtime errors when the cache was updated during the
iteration process.
(``RuntimeError: dictionary changed size during iteration``)

Avoid this by iterating over a copy of the node IDs, rather than the
dict_keys directly.

Change-Id: Iecd88b4484cf48ea2127348bfb2905443eaaf49f
2022-01-28 13:32:10 +01:00
James E. Blair 5862bef141 Add metastatic driver
This driver supplies "static" nodes that are actually backed by
another nodepool node.  The use case is to be able to request a single
large node (a "backing node") from a cloud provider, and then divide
that node up into smaller nodes that are actually used ("requested
nodes").  A backing node can support one or more requested nodes, and
backing nodes should scale up or down as necessary.

Change-Id: I29d78705a87a53ee07dce6022b81a1ce97c54f1d
2021-12-09 11:08:48 -08:00
Zuul 56ebfbf885 Merge "Add commands to export/import image data from ZK" 2021-09-15 22:01:46 +00:00
James E. Blair d1c14beafb Add user_data field to Node
This will allow users to record information about their use of the
node.  Nodepool will ignore this.

This is similar to the comment field, except that it won't show up
in the node listing output.  It's opaque data to nodepool.

This is also similar to requestor_data for the node request.

We can use this in Zuul to record the tenant/project information,
then just use the existing nodepool node records to calculate
resource usage by tenant.  Without this, we would need to create
a nearly parallel data structure for that purpose alone.  Bonus;
it will let Zuul include autohold node resource usage in its
stats (if desired).

Change-Id: If382ca1c554ddd2ef9be8ad87049b133c8b3873c
2021-09-04 11:34:28 -07:00
Benjamin Schanzel ee90100852 Add Tenant-Scoped Resource Quota
This change adds the option to put quota on resources on a per-tenant
basis (i.e. Zuul tenants).

It adds a new top-level config structure ``tenant-resource-limits``
under which one can specify a number of tenants, each with
``max-servers``, ``max-cores``, and ``max-ram`` limits.  These limits
are valid globally, i.e., for all providers. This is contrary to
currently existing provider and pool quotas, which only are consindered
for nodes of the same provider.

Change-Id: I0c0154db7d5edaa91a9fe21ebf6936e14cef4db7
2021-09-01 09:07:43 +02:00
James E. Blair 0b1fa1d57d Add commands to export/import image data from ZK
Change-Id: Id1ac6403f4fe80059b90900c519e56bca7dee0a0
2021-08-24 10:28:39 -07:00
James E. Blair 405beba83a Add requestor_data field to node request
Zuul currently keeps track of what buildset+job for which any given
node request was submitted in memory.  To support multiple schedulers,
we need to put that in persistent storage.  We could create a new
kind of object in ZK, but it would be simpler if we could simply store
a memo on the node request itself.

This adds a requestor_data field to the node request object in
ZooKeeper.  It is opaque data to Nodepool, but will allow Zuul to record
the information it needs.

Change-Id: Ie6e29e88f15709829d24dcd063d6afb37778a72e
2021-06-29 13:48:26 -07:00
Simon Westphahl 56acaf51d3 Assign waiting static nodes by request priority
The static provider will create nodes in state "building" in case a
label is currently not available. When a freed up node can be assigned
to a waiting node we must use the request prio to decide which node to
update.

Before nodepool ordered the waiting nodes by creation time. This can
lead to node requests being fulfilled in the wrong order in certain
cases.

Change-Id: Iae4091b1055d6bb0933f51ce1bbf860e62843206
2021-05-11 14:04:02 +02:00
Albin Vass 0c84b7fa4e Add shell-type config
Ansible needs to know which shell type the node uses to operate
correctly, especially for ssh connections for windows nodes because
otherwise ansible defaults to trying bash.

Change-Id: I71abfefa57aaafd88f199be19ee7caa64efda538
2021-03-05 15:14:29 +01:00
James E. Blair 4c5fa46540 Require TLS
Require TLS Zookeeper connections before making the 4.0 release.

Change-Id: I69acdcec0deddfdd191f094f13627ec1618142af
Depends-On: https://review.opendev.org/776696
2021-02-19 18:42:33 +00:00
Clark Boylan dfcf770260 Ignore unparsable/empty image build ZNode data
As with the parent it seems that we can end up with empty image build
records as well as empty image upload records. Handle the build case as
well so that we can still list image builds properly when in this state.
This logs the problem as well so that you can debug it further.

Change-Id: Ic5521f5e2d2b65db94c2a5794f474913631a7a1b
2020-10-06 12:44:18 +02:00
Zuul 0bf8712c80 Merge "Ignore unparsable/empty image upload ZNode data" 2020-09-03 05:41:59 +00:00
James E. Blair 00d62af3c3 Add image-pause CLI command
This adds a CLI commend to set a flag in ZK for images indicating
that the image should be paused.  This can be used to quickly pause
the building and uploading of one or more images globally.  This
will effectively be boolean OR'd with the pause value for diskimage
builds in the config file.

In particular, this can be used to pause images for short durations,
either because a fix is imminent, or to allow the system to remain
stable while a configuration change goes through the CI/CD workflow.

Change-Id: I21a573dfc337c51f319afe3695d5446b2c91d70b
2020-08-20 15:48:03 -07:00
Simon Westphahl 638c6dbb1f Ignore unparsable/empty image upload ZNode data
Multiple builders running cleanup of uploads in parallel can lead to a
race condition which results in empty image upload nodes in Zookeeper.

The problem is the creation of the image upload number lock inside the
image upload. The lock creation can re-created an already deleted image
upload w/o any data.

Later iterations will fail to parse the empty znode data.

It's not 100% clear how we arrive at this race condition, but we
definitely see multiple builders in the 'guarded' section trying to
delete an upload.

The following exception shows that there is some kind of race:

    2020-06-23 11:00:45,225 ERROR nodepool.builder.CleanupWorker.0:
        Exception cleaning up build <ImageBuild {'state': 'ready',
        'state_time': 1592718469.360345, 'builder':
        'nodepool-builder-2', 'builder_id':
        '8e2c599b-fbfd-4d68-b8df-2fd2efdaa705', 'formats': 'qcow2,raw',
        'username': 'Administrator', 'python_path': 'auto', 'id':
        '0000013010', 'stat': ZnodeStat(czxid=129130944630,
        mzxid=129131204208, ctime=1592712393474, mtime=1592718469363,
        version=1, cversion=1, aversion=0, ephemeralOwner=0,
        dataLength=214, numChildren=1, pzxid=129131204230)}> of image
        example-image in provider <Provider example-provider>:
    Traceback (most recent call last):
      File "/opt/nodepool/lib/python3.6/site-packages/nodepool/builder.py", line 499, in _cleanupImage
        self._cleanupProvider(provider, image, build.id)
      File "/opt/nodepool/lib/python3.6/site-packages/nodepool/builder.py", line 292, in _cleanupProvider
        self._deleteUpload(upload)
      File "/opt/nodepool/lib/python3.6/site-packages/nodepool/builder.py", line 345, in _deleteUpload
        upload.provider_name, upload.id)
      File "/opt/nodepool/lib/python3.6/site-packages/nodepool/zk.py", line 1572, in deleteUpload
        self.client.delete(path, recursive=True)
      File "/opt/nodepool/lib/python3.6/site-packages/kazoo/client.py", line 1341, in delete
        return self._delete_recursive(path)
      File "/opt/nodepool/lib/python3.6/site-packages/kazoo/client.py", line 1374, in _delete_recursive
        self._delete_recursive(child_path)
      File "/opt/nodepool/lib/python3.6/site-packages/kazoo/client.py", line 1376, in _delete_recursive
        self.delete(path)
      File "/opt/nodepool/lib/python3.6/site-packages/kazoo/client.py", line 1343, in delete
        return self.delete_async(path, version).get()
      File "/opt/nodepool/lib/python3.6/site-packages/kazoo/handlers/utils.py", line 75, in get
        raise self._exception
    kazoo.exceptions.NotEmptyError

Order of events that might have led to the above exception:
  - A locks upload
  - A deletes children of upload (including lock of A)
  - B locks upload (possible, because lock was deleted by A)
  - A tries do delete the parent node, which leads to the above exception
  - ...

Related change: https://review.opendev.org/#/c/716566/
Change-Id: I5989c9f5f4c9f1615ae7372250e2e5fdad720256
2020-08-14 07:32:50 +02:00
Ian Wienand ae784c57a2 Handle ipv6 literal zookeeper addresses
If you use a ipv6 literal as the host: parameter in zookeeper-services
section, it gets incorretly concatentated with the port when building
the host list.  Detect ipv6 literals and quote them with rfc2732 []'s
when building the host string.

Change-Id: I49a749142e8031be37a08960bf645faca4e97149
2020-05-04 06:52:07 -07:00
Zuul 28b1f67dd0 Merge "Add debug info on json parse error of image upload" 2020-04-15 12:39:18 +00:00
Zuul 775cd32028 Merge "Add ZooKeeper TLS support" 2020-04-15 01:41:47 +00:00
James E. Blair b62fa3313d Add ZooKeeper TLS support
Change-Id: I009d9f90b32881aaef2d0694da6ff28074f48f8e
2020-04-14 16:03:53 -07:00
Tobias Henkel e57b806bb2
Add debug info on json parse error of image upload
We encountered a broken upload entry which is lead to an exception in
the cleanup worker and querying the image-list [1]. Add more debug
info so we can determine the broken upload entry in zk.

[1] Trace:
2020-04-01 12:05:05,380 ERROR nodepool.builder.CleanupWorker.0: Exception in CleanupWorker:
Traceback (most recent call last):
  File "/opt/nodepool/lib/python3.6/site-packages/nodepool/builder.py", line 557, in run
    self._run()
  File "/opt/nodepool/lib/python3.6/site-packages/nodepool/builder.py", line 582, in _run
    self._cleanup()
  File "/opt/nodepool/lib/python3.6/site-packages/nodepool/builder.py", line 379, in _cleanup
    self._buildUploadRecencyTable()
  File "/opt/nodepool/lib/python3.6/site-packages/nodepool/builder.py", line 194, in _buildUploadRecencyTable
    2, image, build.id, provider, zk.READY)
  File "/opt/nodepool/lib/python3.6/site-packages/nodepool/zk.py", line 1404, in getMostRecentBuildImageUploads
    uploads = self.getUploads(image, build_number, provider, states)
  File "/opt/nodepool/lib/python3.6/site-packages/nodepool/zk.py", line 1373, in getUploads
    data = self.getImageUpload(image, build_number, provider, upload)
  File "/opt/nodepool/lib/python3.6/site-packages/nodepool/zk.py", line 1337, in getImageUpload
    d = ImageUpload.fromDict(self._bytesToDict(data),
  File "/opt/nodepool/lib/python3.6/site-packages/nodepool/zk.py", line 810, in _bytesToDict
    return json.loads(data.decode('utf8'))
  File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Change-Id: Ic0f0233edfb87204cca68ae8a36416e13d05b311
2020-04-01 14:13:17 +02:00
David Shrewsbury 394d549c07 Support image uploads in 'info' CLI command
Change the 'info' command output to include image upload data.
For each image, we'll now output each build and the uploads for the build.

Change-Id: Ib25ce30d30ed718b2b6083c2127fdb214c3691f4
2020-03-19 15:03:34 -04:00
Simon Westphahl 72dd65ac07 Annotate logs in zk module
Request related logs are annotated with the event and node request
id to improve tracability.

Change-Id: Ifc83a1d49c3fef2a16d150b8018cda9aff85c8b0
2020-01-27 15:27:58 +01:00
Simon Westphahl 94441b7997 Handle event id in node requests
Zuul is already creating node request containing the event id. To expand
the event id logging to nodepool we need to set and de-/serialize the
event_id field.

For node request that originate from within nodepool, we will generate a
random UUID as an event id.

Change-Id: I56b0bc18f3a55b8c3e3315a46b6276811e95a54f
2020-01-21 07:22:54 +01:00
Fabien Boucher f57ac1881a
Remove uneeded shebang and exec bit on some files
Having python files with exec bit and shebang defined in
/usr/lib/python-*/site-package/ is not fine in a RPM package.

Instead of carrying a patch in nodepool RPM packaging better
to fix this directly upstream.

Change-Id: I5a01e21243f175d28c67376941149e357cdacd26
2019-12-13 19:30:03 +01:00
Zuul c4db2a168d Merge "Do not overwrite image upload ZK data on delete" 2019-10-07 14:42:39 +00:00
David Shrewsbury 51030bcc6f Deregister a launcher when removed from config
We were depending only on the ephemeralness of launcher registration
nodes to indicate when a provider goes away. That only works when
nodepool-launcher is restarted since the ZK connection is shared
among all provider threads. This adds an API method to manually
deregister and modifies a test to ensure it works.

Change-Id: I363fc4a67a83b49b515b963f5d5accdcf1cb758c
2019-10-02 12:07:08 -04:00
David Shrewsbury 91dc050c3c Do not overwrite image upload ZK data on delete
This code is setting the ImageUpload state field to DELETING
and updating the data in ZooKeeper. However, it was using a new
ImageUpload object to do so, and we lose all other data, like the
external_id, external_name, etc. If the delete should fail in the
provider on the first attempt, the next attempt to delete will not
have these attributes, some of which (external_name) are needed to
actually delete the image. This means we leak the image in the
provider because the 2nd attempt will not try to delete in the
provider and will remove the ZooKeeper record.

During review of this fix, I realized that we were not locking
specific image uploads before modifying them or deleting them.
Although this likely would not cause any problems, it may have
caused the odd error in competing builder processes. This adds
a locking mechanism for that data, and tests to validate the lock.

Change-Id: I3fbf1c1a3ca10cc397facd4e939cbf4cbb2d8305
2019-09-24 10:43:25 -04:00
Simon Westphahl cb03f8b280 Don't pause static pool on single label quota
Nodes in a static pool can potentially have multiple labels. The current
implementation pauses any further request processing in case a label
could not be served immediately.

With the proposed change we will re-use the existing launch logic to
create a placeholder node in BUILDING state. As soon as a used node is
re-registerd this placeholder node will be used and the pending request
fulfilled.

This allows having a single pool for all static nodes, which prevents
orphaned Zookeeper entries. That can happen when e.g. a label specific
pool is removed or renamed.

Change-Id: If69cbeb33b8abb4f8e36146b9d849403dccb7d32
2019-07-12 14:33:36 +02:00
Tristan Cacqueray 76aa62230c Add python-path option to node
This change adds a new python_path Node attribute so that zuul executor
can remove the default hard-coded ansible_python_interpreter.

Change-Id: Iddf2cc6b2df579636ec39b091edcfe85a4a4ed10
2019-05-07 02:22:45 +00:00
James E. Blair 3561e278c6 Support requests for specific providers
In Zuul we support a paradigm where a job can be paused, whereupon
dependent jobs then request nodes and are started.  It is nearly
always the case that the user would like the nodes for the dependent
jobs to be in the same provider as the parent, as the use cases
generally involve transferring data between the two.

To support this, add a 'provider' attribute to the node request which,
if present, means that all on-line launchers for that provider must
decline the request before anyone else processes it.  This will cuase
the desired behavior if everything is working, and if some calamity
befalls that launcher, other launchers can still attempt to fulfill
the request, which might work, or might not, but performing that last
ditch effort is fine once there are no alternatives.

Change-Id: I91fe05081695d454651f6068eac5c08ac30ff899
2019-02-26 16:00:35 -08:00
Paul Belanger cc994d3a6a Include host_id for openstack provider
When we create a server using the openstack provider, in certain
conditions it is helpful to collect the host_id value. As an example,
Zuul will pass through this information into a job inventory file which
will allow an operator to better profile where jobs run with in a cloud.

This can be helpful trying to debug random jobs timing out within a
cloud.

Change-Id: If29397e67a470462561f24c746375b8291ac43ab
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2018-12-06 18:49:48 -05:00
David Shrewsbury 387b134a52 Add cleanup routine to delete empty nodes
We've discovered that our node deletion process has the possibility
to leave empty (i.e., no data) node znodes in ZooKeeper. Although a
fix for this has been merged, we need a way to remove this extraneous
data.

Change-Id: I6596060f5026088ce987e5d0d7c18b00a6b77c5a
2018-12-04 17:33:33 -05:00
James E. Blair b3053779a6 Fix race when deleting Node znodes
There is a race where:

  Thread A               Thread B
  locks the node
  deletes the instance
  deletes the lock
                         locks the node
  deletes the node data
                         loads empty data from the node

Thread A only deletes the node data because during it's recursive
delete, a new child node (the lock from B) has appeared.  Thread B
proceeds with invalid data and errors out.

We can detect this condition because the node has no data.  It may
be theoretically possible to lock the node and load the data before
the node data are deleted as well, so to protect against this case,
we set a new node state, "deleted" before we start deleting anything.

If thread B encounters either of those two conditions (no data, or
a "deleted" state), we know we've hit this race and can safely attempt
to recursively delete the node again.

Change-Id: Iea5558f9eb471cf1096120b06c098f8f41ab59d9
2018-12-04 09:49:40 -08:00
Tobias Henkel 3e675082ba
Don't update caches with empty zNodes
We found out that we leak some empty znodes that were just ignored by
nodepool before the caching changes. Now that we know that these exist
ignore them as well so we don't get spammed by exceptions.

Change-Id: I00a0ad2c7f645a2d03bd1674bf5d050c38b1dd50
2018-11-30 23:26:19 +01:00
James E. Blair a0e4ff87f7 Log exceptions in cache listener events
Change-Id: Ife566e09c23b644d8d777c0f59f1effb6be3ec6c
2018-11-30 13:47:57 -08:00
Zuul 7111fcb407 Merge "Support relative priority of node requests" 2018-11-30 01:58:40 +00:00
James E. Blair 854f638e35 Support relative priority of node requests
The launcher now processes node requests in relative priority
order.  This relies on the new node request cache because the
relative priority field may be updated at any time by the
requestor.

Needed-By: https://review.openstack.org/615356
Change-Id: If893c34c6652b9649bfb6f1d9f7b942c549c98b4
2018-11-29 14:26:11 -08:00
Zuul 81936cd818 Merge "Asynchronously update node statistics" 2018-11-29 21:00:14 +00:00
Zuul 5f2281a59e Merge "Add arbitrary node attributes config option" 2018-11-29 20:14:31 +00:00
Tobias Henkel 64487baef0
Asynchronously update node statistics
We currently updarte the node statistics on every node launch or
delete. This cannot use caching at the moment because when the
statistics are updated we might end up pushing slightly outdated
data. If then there is no further update for a longer time we end up
with broken gauges. We already get update events from the node cache
so we can use that to centrally trigger node statistics updates.

This is combined with leader election so there is only a single
launcher that keeps the statistics up to date. This will ensure that
the statistics are not cluttered because of several launchers pushing
their own slightly different view into the stats.

As a side effect this reduces the runtime of a test that creates 200
nodes from 100s to 70s on my local machine.

Change-Id: I77c6edc1db45b5b45be1812cf19eea66fdfab014
2018-11-29 16:48:30 +01:00
David Shrewsbury 16325d5c4c Add arbitrary node attributes config option
This config option, available under each provider pool section, can
contain static key-value pairs that will be stored in ZooKeeper on
each Node znode. This will allow us to pass along abitrary data from
nodepool to any user of nodepool (specifically, zuul). Initially, this
will be used to pass along zone information to zuul executors.

Change-Id: I126d37a8c0a4f44dca59c11f76a583b9181ab653
2018-11-29 09:35:59 -05:00
Tobias Henkel 9d77f05d8e
Only setup zNode caches in launcher
We currently only need to setup the zNode caches in the
launcher. Within the commandline client and the builders this is just
unneccessary work.

Change-Id: I03aa2a11b75cab3932e4b45c5e964811a7e0b3d4
2018-11-26 20:13:39 +01:00
Tobias Henkel 6eb80deb36
Add second level cache to node requests
When implementing dynamic node request priorities we need to be able
to quickly get all requests and reorder them in a priority queue. This
won't be efficient if this involves excessive json parsing. So this
adds an event based second level cache.

Change-Id: I923195de1890fdb74f7e5b33a0165f400dbbf374
2018-11-26 20:10:35 +01:00
Tobias Henkel 35094dbb62
Add second level cache of nodes
An earlier change introduced zNode caching of Nodes. This first
version still required frequent re-parsing of the node json data. This
change introduces a second level cache that updates a dict of the
current nodes based on the events emitted by the TreeCache. Using this
we can reduce json parsing and make node caching mode effective.

Change-Id: I4834a7ea722cf2ac7df79455ce077832ae966e63
2018-11-26 20:04:04 +01:00
Tobias Henkel 1d278ffaee
Update node request during locking
After locking a node request we most of the time update ot to double
check if the state changed on us. This is important to not operate on
stale data as we only have a guarantee that the node request data is
updated if we update the data after getting the lock (regardless if we
operate on cached data or not). However it is cumbersome to do this
after every lockNodeRequest call so doing this centrally reduces the
risk to forget that.

This does not introduce any behavior change.

Change-Id: Ie6713fdd01225bbcff7839605ac9f94dc6cb2f78
2018-11-26 19:40:43 +01:00
Tobias Henkel 36666800ff
Cache node request zNodes
When implementing dynamic priority queues in nodepool we need to cache
the requests. This introduces a simple node request caching using the
TreeCache supplied by kazoo. This currently only caches the raw
zookeeper data and still requires json parsing when traversing the
requests. In a later change we can introduce a second layer of caching
that updates the NodeRequest objects in place so we can avoid
excessive json parsing.

Change-Id: I84b3474636ea1f14faa6bf7c8a6c6f83598c3741
2018-11-26 19:38:10 +01:00