Commit Graph

3250 Commits

Author SHA1 Message Date
Zuul 75ebae3162 Merge "gce: clear machine type cache on bad data" 2024-03-25 19:35:17 +00:00
Zuul 28c6c1d4f7 Merge "Continuously ensure the component registry is up to date" 2024-03-25 18:17:01 +00:00
Zuul 22cd51fdf6 Merge "Fix leaked upload cleanup" 2024-03-20 13:04:01 +00:00
Zuul b1a40f1fd3 Merge "Add delete-after-upload option" 2024-03-18 21:06:46 +00:00
Benjamin Schanzel 1199ac77e5
Fix nodepool builder stats test
In the metric name, we use the builders fqdn as a key, but in the test
we used the hostname. So this test fails on systems where that's not the
same.

Change-Id: If286f19371d1fd70dc9bee4b7af814d13396357b
2024-03-18 15:33:03 +01:00
Zuul 17c57eea85 Merge "Improve logging around manual/scheduled image builds" 2024-03-11 10:43:02 +00:00
James E. Blair fc95724601 Fix leaked upload cleanup
The cleanup routine for leaked image uploads based its detection
on upload ids, but they are not unique except in the context of
a provider and build.  This meant that, for example, as long as
there was an upload with id 0000000001 for any image build for
the provider (very likely!) we would skip cleaning up any leaked
uploads with id 0000000001.

Correct this by using a key generated on build+upload (provider
is implied because we only consider uploads for our current
provider).

Update the tests relevant to this code to exercise this condition.

Change-Id: Ic68932b735d7439ca39e2fbfbe1f73c7942152d6
2024-03-09 13:46:17 -08:00
James E. Blair fd454706ca Add delete-after-upload option
This allows operators to delete large diskimage files after uploads
are complete, in order to save space.

A setting is also provided to keep certain formats, so that if
operators would like to delete large formats such as "raw" while
retaining a qcow2 copy (which, in an emergency, could be used to
inspect the image, or manually converted and uploaded for use),
that is possible.

Change-Id: I97ca3422044174f956d6c5c3c35c2dbba9b4cadf
2024-03-09 06:51:56 -08:00
Zuul f4941d4f03 Merge "Add some builder operational stats" 2024-03-07 20:36:14 +00:00
Zuul 9764937c49 Merge "Add rackspaceauth as a nodepool dependency" 2024-03-07 18:09:28 +00:00
James E. Blair 1ac3f8bda4 Pin kazoo
To match Zuul, pin the version of kazoo we're using.

Change-Id: I104faa50c6a2fb4bfa7955c0ead34453a0505672
2024-03-06 15:01:28 -08:00
James E. Blair 963cb3f5b4 gce: clear machine type cache on bad data
We have observed GCE returning bad machine type data which we
then cache.  If that happens, clear the cache to avoid getting
stuck with the bad data.

Change-Id: I32fac2a92d4f9d400fe2db41fffd8d189d097542
2024-03-05 16:05:18 -08:00
Clark Boylan 6820b309a1 Add rackspaceauth as a nodepool dependency
Rackspace has announced that MFA will be required starting on March 26,
2024. When MFA is enabled on an account you will no longer be able to
log in to rackspace using a username and password with
openstacksdk/openstackclient/etc as the APIs apparently don't support
negotiating the MFA token. Instead we can either use a rackspace
specific api_key or a keystone bearer token.

We opt for the rackspace specific api_key because it doesn't expire
like the bearer tokens do. But using the rackspace api_key does require
a keystoneauth1 plugin called `rackspaceauth` to be installed which this
change adds to nodepool.

This new dep is Apache2 licensed according to the License file in the
sdist. The new dep has minimal deps of its own and they are all alread
shared by the existing dep tree. Seems reasonable to install this small
lib in hopes that we can keep rackspace working with nodepool

As a final note the OpenDev team plans to test use of the api_key with
this library against a single rackspace region. It is possible this
won't work out of the box and we may need to make additional updates.
Unfortunately, it isn't easy to test this without talking directly to
rax so we opt for the lib install and testing via OpenDev.

Change-Id: Ibff32bb44e05413391dd7a320ba356f521bb30e8
2024-03-05 09:49:04 -08:00
Zuul eef07d21f3 Merge "Metastatic: Copy cloud attribute from backing node" 2024-03-05 16:51:00 +00:00
Zuul dfd2877c81 Merge "Use seprate log package for NodescanRequest messages" 2024-03-05 15:32:18 +00:00
James E. Blair 619dee016c Continuously ensure the component registry is up to date
On startup, the launcher waits up to 5 seconds until it has seen
its own registry entry because it uses the registry to decide if
other components are able to handle a request, and if not, fail
the request.

In the case of a ZK disconnection, we will lose all information
about registered components as well as the tree caches.  Upon
reconnection, we will repopulate the tree caches and re-register
our component.

If the tree cache repopulation happens first, our component
registration may be in line behind several thousand ZK events.  It
may take more than 5 seconds to repopulate and it would be better
for the launcher to wait until the component registry is up to date
before it resumes processing.

To fix this, instead of only waiting on the initial registration,
we check each time through the launcher's main loop that the registry
is up-to-date before we start processing.  This should include
disconnections because we expect the main loop to abort with an
error and restart in those cases.

This operates only on local cached data, so it doesn't generate any
extra ZK traffic.

Change-Id: I1949ec56610fe810d9e088b00666053f2cc37a9a
2024-03-04 14:28:11 -08:00
Zuul 392cf017c3 Merge "Add support for AWS IMDSv2" 2024-02-28 02:46:53 +00:00
Zuul 8775e54e5d Merge "Remove hostname-format option" 2024-02-28 02:46:52 +00:00
Zuul fe70068909 Merge "Add host-key-checking to metastatic driver" 2024-02-27 19:03:48 +00:00
Zuul 202188b2f5 Merge "Reconcile docs/validation for some options" 2024-02-27 18:19:33 +00:00
Benjamin Schanzel 4c4e8aefdb
Metastatic: Copy cloud attribute from backing node
Like done with several other meta data, copy the `cloud` attribute from
the backing node to the metastatic node.

Change-Id: Id83b3e09147baaab8a85ace4d5beba77d1eb87bd
2024-02-23 14:20:09 +01:00
James E. Blair 8259170516 Change the AWS default image volume-type from gp2 to gp3
gp3 is better in almost every way (cheaper, faster, more configurable).
It seems difficult to find a situation where gp2 would be a better
choice, so update the default when creating images to use gp3.

There are two locations where we can specify volume-type: image creation
(where the volume type becomes the default type for the image) and
instance creation (where we can override what the image specifies).
This change updates only the first (image creation), but not the second,
which has no default (which means to use whatever the image specified).

https://aws.amazon.com/ebs/general-purpose/

Change-Id: Ibfc5dfd3958e5b7dbd73c26584d6a5b8d3a1b4eb
2024-02-20 13:04:26 -08:00
James E. Blair 646b7f4927 Add some builder operational stats
This adds some stats keys that may be useful when monitoring
the operation of individual nodepool builders.

Change-Id: Iffdeccd39b3a157a997cf37062064100c17b1cb3
2024-02-19 15:47:17 -08:00
Zuul 46268e56ee Merge "Switch functional openstack job to jammy" 2024-02-16 00:18:02 +00:00
Zuul 2240f001eb Merge "Refactor config loading in builder and launcher" 2024-02-15 10:23:51 +00:00
James E. Blair 21f1b88b75 Add host-key-checking to metastatic driver
If a long-running backing node used by the metastatic driver develops
problems, performing a host-key-check each time we allocate a new
metastatic node may detect these problems.  If that happens, mark
the backing node as failed so that no more nodes are allocated to
it and it is eventually removed.

Change-Id: Ib1763cf8c6e694a4957cb158b3b6afa53d20e606
2024-02-13 14:12:52 -08:00
Zuul 8572e1874a Merge "Fix max concurrency log line" 2024-02-12 19:51:08 +00:00
Clark Boylan c7f52ed97f Fetch compatibile dnf download command in container image
The dnf-plugins-core repo updates its download command to use a
dnf.utils method that is not present in the dnf version installed by
Debian packages. Update the fetch of dnf-plugins-core to use the last
version of the download plugin that is compatible with the dnf package
in Debian.

Note that we don't use the bookworm dnf-plugins-core package to address
this because dnf-plugins-core specifies that it breaks and replaces
zypper. There doesn't seem to be a good reason for this as there is no
file overlap between the packages according to `apt-file list`.

Change-Id: I6fbf7db87a8272dae2552f9075addec2d5c82e56
2024-02-09 11:55:17 -08:00
James E. Blair 53c3e5b221 Switch functional openstack job to jammy
It appears that centos-9 stream image builds are broken.  We don't
actually care what image we build in this job, so switch to jammy
which should be working.

Change-Id: If574a4b6d26230d7bb98cb2c9eab819a08f10eff
2024-02-09 09:45:45 -08:00
James E. Blair e097731339 Remove hostname-format option
This option has not been used since at least the migratio to the
statemachine framework.

Change-Id: I7a0e928889f72606fcbba0c94c2d49fbb3ffe55f
2024-02-08 09:40:41 -08:00
James E. Blair f89b41f6ad Reconcile docs/validation for some options
Some drivers were missing docs and/or validation for options that
they actually support.  This change:

adds launch-timeout to:
  metastatic docs and validation
  aws validation
  gce docs and validation
adds post-upload-hook to:
  aws validation
adds boot-timeout to:
  metastatic docs and validation
adds launch-retries to:
  metastatic docs and validation

Change-Id: Id3f4bb687c1b2c39a1feb926a50c46b23ae9df9a
2024-02-08 09:36:35 -08:00
Zuul 7abd12906f Merge "Introduce error.capacity states_key for InsufficientInstanceCapacity error" 2024-02-08 08:23:17 +00:00
James E. Blair c78fe769f2 Allow custom k8s pod specs
This change adds the ability to use the k8s (and friends) drivers
to create pods with custom specs.  This will allow nodepool admins
to define labels that create pods with options not otherwise supported
by Nodepool, as well as pods with multiple containers.

This can be used to implement the versatile sidecar pattern, which,
in a system where it is difficult to background a system process (such
as a database server or container runtime) is useful to run jobs with
such requirements.

It is still the case that a single resource is returned to Zuul, so
a single pod will be added to the inventory.  Therefore, the expectation
that it should be possible to shell into the first container in the
pod is documented.

Change-Id: I4a24a953a61239a8a52c9e7a2b68a7ec779f7a3d
2024-01-30 15:59:34 -08:00
Simon Westphahl 4ea2e82e56
Improve logging around manual/scheduled image builds
Change-Id: I51cff9785842feb927d3e9740283309d80773bf6
2024-01-30 14:01:05 +01:00
Simon Westphahl 4ae0a6f9a6
Refactor config loading in builder and launcher
In I93400cc156d09ea1add4fc753846df923242c0e6 we've refactore the
launcher config loading to use the last modified timestamps of the
config files to detect if a reload is necessary.

In the builder the situation is even worse as we reload and compare the
config much more often e.g. in the build worker when checking for manual
or scheduled image updates.

With a larger config (2-3MB range) this is a significant performance
problem that can lead to builders being busy with config loading instead
of building images.

Yappi profile (performed with the optimization proposed in
I786daa20ca428039a44d14b1e389d4d3fd62a735, which doesn't fully solve the
problem):

name                                  ncall  tsub      ttot      tavg
..py:880 AwsProviderDiskImage.__eq__  812..  17346.57  27435.41  0.000034
..odepool/config.py:281 Label.__eq__  155..  1.189220  27403.11  0.176285
..643 BuildWorker._checkConfigRecent  58     0.000000  27031.40  466.0586
..depool/config.py:118 Config.__eq__  58     0.000000  26733.50  460.9225

Change-Id: I929bdb757eb9e077012b530f6f872bea96ec8bbc
2024-01-30 13:59:36 +01:00
Zuul e59bd1e331 Merge "Fix duplicate fd registration in nodescan" 2024-01-30 00:34:18 +00:00
Clark Boylan e5c1790be7 Rollback to 1.28/stable microk8s in functional testing
We use latest/stable by default which very recently updated to
1.29/stable. Unfortunately it appears there are issues [0] with this
version on Debian Bookworm which also happens to be the platform we test
on. Our jobs have been consistently failing in a manner that appears
related to this issue. Update the job to collect logs so that we can
better confirm this is the case and rollback to 1.28 which should be
working.

Also update the AWS tests to handle a recent moto release which
requires us to use mock_aws rather than individual mock_* classes.

[0] https://github.com/canonical/microk8s/issues/4361

Change-Id: I72310521bdabfc3e34a9f2e87ff80f6d7c27c180
Co-Authored-By: James E. Blair <jim@acmegating.com>
Co-Authored-By: Jeremy Stanley <fungi@yuggoth.org>
2024-01-29 14:15:54 -08:00
James E. Blair 3f4fb008b0 Add support for AWS IMDSv2
This is an authenticated http metadata service which is typically
available by default, but a more secure setup is to enforce its
usage.

This change adds the ability to do that for both instances and
AMIs.

Change-Id: Ia8554ff0baec260289da0574b92932b37ffe5f04
2024-01-24 15:11:35 -08:00
James E. Blair 70be90e742 Fix duplicate fd registration in nodescan
In an attempt to make the nodescan process as quick as possible,
we start the connection in the provider statemachine thread before
handing the remaining work off to the nodescan statemachine thread.

However, if the nodescan worker is near the end of its request list
when the provider adds the request, then it may end up performing
the initial connection nearly simultaneously with the provider
thread.  They may both create a socket and attempt to register
the FD.  If the race results in them registering the same FD,
the following exception occurs:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/nodepool/driver/statemachine.py", line 253, in runStateMachine
    keys = self.nodescan_request.result()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/nodepool/driver/statemachine.py", line 1295, in result
    raise self.exception
  File "/usr/local/lib/python3.11/site-packages/nodepool/driver/statemachine.py", line 1147, in addRequest
    self._advance(request, False)
  File "/usr/local/lib/python3.11/site-packages/nodepool/driver/statemachine.py", line 1187, in _advance
    request.advance(socket_ready)
  File "/usr/local/lib/python3.11/site-packages/nodepool/driver/statemachine.py", line 1379, in advance
    self._connect()
  File "/usr/local/lib/python3.11/site-packages/nodepool/driver/statemachine.py", line 1340, in _connect
    self.worker.registerDescriptor(self.sock)
  File "/usr/local/lib/python3.11/site-packages/nodepool/driver/statemachine.py", line 1173, in registerDescriptor
    self.poll.register(
FileExistsError: [Errno 17] File exists

To address this, rather than attempting to coordinate work between
these two threads, let's just let the nodescan worker handle it.
To try to keep the process responsive, we'll wake the nodescan worker
if it's sleeping.

Change-Id: I5ceda68b856c09bf7606e62ac72ca5c5c76d2661
2024-01-24 14:54:43 -08:00
Dong Zhang aa113be19f Introduce error.capacity states_key for InsufficientInstanceCapacity error
We want to handle the "InsufficientInstanceCapacity" error different than
other "error.unknown" errors in our monitoring/alerting system. With this
change, it would produce a "error.capacity" instead of "error.unknown".

Change-Id: Id3a49d4b2d4b4733f801e65df69b505e913985a7
2024-01-04 15:37:51 +01:00
James E. Blair cc51696a33 Fix max concurrency log line
The extra comma produces a (non-fatal) log formatting error.  Correct it.

Change-Id: I144347da2ac99cba788da6e60889d2b2bc320c6e
2023-12-21 07:45:14 -08:00
Zuul 42f9100d82 Merge "Fix gpu parameter type in openshiftpods" 2023-12-12 20:36:18 +00:00
Zuul ac703de734 Merge "Resolve statsd client once at startup" 2023-12-09 16:18:57 +00:00
James E. Blair 9c496e2939 Redact k8s connection info in node list
The node list (web and cli) displays the connection port for the
node, but the k8s drivers use that to send service account
credential info to zuul.

To avoid exposing this to users if operators have chosen to make
the nodepool-launcher webserver accessible, redact the connection
port if it is not an integer.

This also affects the command-line nodepool-list in the same way.

Change-Id: I7a309f95417d47612e40d983b3a2ec6ee4d0183a
2023-12-01 10:30:03 -08:00
James E. Blair 5849c3b9a7 Fix gpu parameter type in openshiftpods
In config validation, the gpu parametr type was specified as str
rather than float.  This is corrected.

This was not discovered in testing because the only tests which use
the gpu parameter for the other k8s drivers are not present in the
openshiftpods driver.  This change also adds the missing tests for
the default resource and resource limits feature which exercises the
gpu limits.

Change-Id: Ife932acaeb5a90ebc94ad36c3b4615a4469f0c40
2023-12-01 08:06:26 -08:00
Zuul f39d2b955a Merge "Fix logging of failed node launches" 2023-11-29 16:19:18 +00:00
James E. Blair cb8366d70c Use backing node attributes as metastatic default
To support the use case where one has multiple pools providing
metastatic backing nodes, and those pools are in different regions,
and a user wishes to use Zuul executor zones to communicate with
whatever metastatic nodes eventually produced from those regions,
this change updates the launcher and metastatic driver to use
the node attributes (where zuul executor region names are specified)
as default values for metastatic node attributes.  This lets users
configure nodepool with zuul executor zones only on the backing pools.

Change-Id: Ie6bdad190f8f0d61dab0fec37642d7a078ab52b3
Co-Authored-By: Benedikt Loeffler <benedikt.loeffler@bmw.de>
2023-11-27 10:34:24 -08:00
James E. Blair 7a1c75f918 Fix metastatic missing pool config
The metastatic driver was ignoring the 3 standard pool configuration
options (max-servers, priority, and node-attributes) due to a missing
superclass method call.  Correct that and update tests to validate.

Further, the node-attributes option was undocumented for the metastatic
driver, so add it to the docs.

Change-Id: I6a65ea5b8ddb319bc131f87e0793f3626379e15f
Co-Authored-By: Benedikt Loeffler <benedikt.loeffler@bmw.de>
2023-11-27 10:34:19 -08:00
Zuul 28d36ad5a5 Merge "Fix UnboundLocalError in error handling of runStateMachine" 2023-11-20 15:27:35 +00:00
Benjamin Schanzel 65a81ad7b5
Use seprate log package for NodescanRequest messages
The state transition log messages for the Nodescan statemachine can be
quite excessive. While they might be useful for debugging, it's not
always needed to have all the log messages available.
To provide an easier way to filter these messages, use a dedicated log
package in the NodescanRequest class.

Change-Id: I2b1a625f5e5e375317951e410a27ff4243d4a0ef
2023-11-20 14:46:36 +01:00