Commit Graph

242 Commits

Author SHA1 Message Date
James E. Blair 6a71686b6b Fix deletion of intermediate image builds
The delete-after-upload option would sometimes delete before
upload if there are intermediate build files which match its
criteria.  To ensure that we don't delete any build files until
the build is at least complete, only run that method if the ZK
build state is marked as ready.

Change-Id: Ia263b478b0a2b9d77833bc19d0967a65dcbf5b27
2024-04-02 15:26:41 -07:00
Zuul b1a40f1fd3 Merge "Add delete-after-upload option" 2024-03-18 21:06:46 +00:00
Zuul 17c57eea85 Merge "Improve logging around manual/scheduled image builds" 2024-03-11 10:43:02 +00:00
James E. Blair fd454706ca Add delete-after-upload option
This allows operators to delete large diskimage files after uploads
are complete, in order to save space.

A setting is also provided to keep certain formats, so that if
operators would like to delete large formats such as "raw" while
retaining a qcow2 copy (which, in an emergency, could be used to
inspect the image, or manually converted and uploaded for use),
that is possible.

Change-Id: I97ca3422044174f956d6c5c3c35c2dbba9b4cadf
2024-03-09 06:51:56 -08:00
James E. Blair 646b7f4927 Add some builder operational stats
This adds some stats keys that may be useful when monitoring
the operation of individual nodepool builders.

Change-Id: Iffdeccd39b3a157a997cf37062064100c17b1cb3
2024-02-19 15:47:17 -08:00
Simon Westphahl 4ea2e82e56
Improve logging around manual/scheduled image builds
Change-Id: I51cff9785842feb927d3e9740283309d80773bf6
2024-01-30 14:01:05 +01:00
Simon Westphahl 4ae0a6f9a6
Refactor config loading in builder and launcher
In I93400cc156d09ea1add4fc753846df923242c0e6 we've refactore the
launcher config loading to use the last modified timestamps of the
config files to detect if a reload is necessary.

In the builder the situation is even worse as we reload and compare the
config much more often e.g. in the build worker when checking for manual
or scheduled image updates.

With a larger config (2-3MB range) this is a significant performance
problem that can lead to builders being busy with config loading instead
of building images.

Yappi profile (performed with the optimization proposed in
I786daa20ca428039a44d14b1e389d4d3fd62a735, which doesn't fully solve the
problem):

name                                  ncall  tsub      ttot      tavg
..py:880 AwsProviderDiskImage.__eq__  812..  17346.57  27435.41  0.000034
..odepool/config.py:281 Label.__eq__  155..  1.189220  27403.11  0.176285
..643 BuildWorker._checkConfigRecent  58     0.000000  27031.40  466.0586
..depool/config.py:118 Config.__eq__  58     0.000000  26733.50  460.9225

Change-Id: I929bdb757eb9e077012b530f6f872bea96ec8bbc
2024-01-30 13:59:36 +01:00
James E. Blair 3b434098c6 Add an image upload timeout to the openstack driver
Some uploads in opendev are taking hours.

We used to wait 6 hours for this, but we ended up using the SDK
default of 1 hour in recent versions.  Since we're seeing so much
disparity in time, make it user configurable.

Remove the unused 6 hour constant.

Change-Id: I9ca5fdbf7c66f176eb4f650fd287514708f46c16
2023-09-06 08:04:51 -07:00
James E. Blair 4ef3ebade8 Update references of build "number" to "id"
This follows the previous change and is intended to have little
or no behavior changes (only a few unit tests are updated to use
different placeholder values).  It updates all textual references
of build numbers to build ids to better reflect that they are
UUIDs instead of integers.

Change-Id: I04b5eec732918f5b9b712f8caab2ea4ec90e9a9f
2023-08-02 11:18:15 -07:00
James E. Blair 3815cce7aa Change image ID from int sequence to UUID
When we export and import image data (for backup/restore purposes),
we need to reset the ZK sequence counter for image builds in order
to avoid collisions.  The only way we can do that is to create and
then delete a large number of znodes.  Some sites (including
OpenDev) have sequence numbers that are in the hundreds of thousands.

To avoid this time-consuming operation (which is only intended to
be run to restore from backup -- when operators are already under
additional stress!), this change switches the build IDs from integer
sequences to UUIDs.  This avoids the problem with collisions after
import (at least, to the degree that UUIDs avoid collisions).

The actual change is fairly simple, but many unit tests need to be
updated.

Since the change is user-visible in the command output (image lists,
etc), a release note is added.

A related change which updates all of the textual references of
build "number" to build "id" follows this one for clarity and ease
of review.

Change-Id: Ie7c68b094bc9733914a808756eeee8b62f696713
2023-08-02 11:18:15 -07:00
James E. Blair 084c6d56c2 Add a cache for image lists
For sites with large numbers of image uploads, performing a web API
request of /image-list can be time-consuming.  To improve response
time, maintain a TreeCache of image uploads and use that when listing
images.

The "image-list" CLI is unable to use this cache, and is already as
fast as it can be given the number of ZK requests it needs to issue.

The same is true for "dib-image-list", and this change adds a cache
for that as well.

Finally, since we otherwise would have three or four nearly identical
cache implementations, the TreeCache system is refactored into a base
class that can be used for all the caches.  It has some hook points
to deal with the very slight behavior differences.

Change-Id: Ibff0a9016936b461eccb1b48dcf42f5ad8d8434e
2023-01-31 16:43:43 -08:00
Benedikt Loeffler 44c708dd26 Cleanup local builds without .d folder
When using a custom image build tool the .d folder does not exists and
the cleanup of those local builds is skipped.
For these reason we should not relei on the .d folder, instead we
should look on all create images files and determinate the local builds
based on these.

Change-Id: I1c60af3347868089ebe489ddcadfbad2dc8fadde
2022-10-13 14:50:49 +02:00
James E. Blair 916d62a374 Allow specifying diskimage metadata/tags
For drivers that support tagging/metadata (openstack, aws, azure),
Add or enhance support for supplying tags for uploaded diskimages.

This allows users to set metadata on the global diskimage object
which will then be used as default values for metadata on the
provider diskimage values.  The resulting merged dictionary forms
the basis of metadata to be associated with the uploaded image.

The changes needed to reconcile this for the three drivers mentioned
above are:

All: the diskimages[].meta key is added to supply the default values
for provider metadata.

OpenStack: provider diskimage metadata is already supported using
providers[].diskimages[].meta, so no further changes are needed.

AWS, Azure: provider diskimage tags are added using the key
providers[].diskimages[].tags since these providers already use
the "tags" nomenclature for instances.

This results in the somewhat incongruous situation where we have
diskimage "metadata" being combined with provider "tags", but it's
either that or have images with "metadata" while we have instances
with "tags", both of which are "tags" in EC2.  The chosen approach
has consistency within the driver.

Change-Id: I30aadadf022af3aa97772011cda8dbae0113a3d8
2022-08-23 06:39:08 -07:00
James E. Blair 6a56940275 Fix race with two builders deleting images
In a situation with multiple builders, each configured with different
providers, it is possible for one builder to delete the ZK ImageBuild
record for a build from another builder between the time that the build
is completed but before the first upload starts.

This is because every builder looks for images to delete from ZK.  It
keeps the 2 most recent ready images (this should normally cover the
time period between a build and upload), unless the image is not
configured for any provider this builder knows about.  This is where
the disjoint providers come into play -- builder1 in our scenario
is not expected to have a configuration for provider2.

To correct this, we adjust this check so that the only time we
bypass the 2-most-recent-ready-images check is if the diskimage is
not configured at all.

That means that we still expect all builders to have a "diskimage"
entry for every image, but we don't need those to be configured
for any providers which this builder is not expected to handle.

Change-Id: Ic2fefda293fa0bcbc98ee7313198b37df0576299
2022-07-25 13:06:25 -07:00
Zuul e668905f08 Merge "Update ZooKeeper class connection methods" 2022-06-29 20:40:40 +00:00
James E. Blair 7bbdfdc9fd Update ZooKeeper class connection methods
This updates the ZooKeeper class to inherit from ZooKeeperBase
and utilize its connection methods.

It also moves the connection loss detection used by the builder
to be more localized and removes unused methods.

Change-Id: I6c9dbe17976560bc024f74cd31bdb6305d51168d
2022-06-29 07:46:34 -07:00
James E. Blair 6386170914 Support deleting DIB images while builder is offline
We don't currently support deleting a DIB image while the builder
that built it is offline.  The reason for this is to ensure that
we actually remove the files from disk on the builder.  The mechanism
is for all other builders to defer handling "DELETING" image nodes
in ZK to the builder which built them.

This can be problematic if the builder is offline for an extended
period, or permanently.

To address this case without compromising the original goal, we now
let any builder delete the uploads and ZK nodes for a DIB build.
Subsequently, every builder will now look for DIB manifest directories
within its image-dir, and if it finds one that does not have a
corresponding ZK node, it garbage collects that image from disk.

Change-Id: I65efb31ca02cea4bcf7ef8f962a00b5263ccf69c
2022-06-27 13:03:27 -07:00
James E. Blair f615ad922f Add nodepool.image_build_requests metric
This reports the number of outstanding manual image build requests.

Change-Id: I365516bdb1fa20a3129099a81825e8506b3af4df
2022-06-21 14:52:53 -07:00
James E. Blair a612aa603c Add the component registry from Zuul
This uses a cache and lets us update metadata about components
and act on changes quickly (as compared to the current launcher
registry which doesn't have provision for live updates).

This removes the launcher registry, so operators should take care
to update all launchers within a short period of time since the
functionality to yield to a specific provider depends on it.

Change-Id: I6409db0edf022d711f4e825e2b3eb487e7a79922
2022-05-23 07:41:27 -07:00
James E. Blair 10df93540f Use Zuul-style ZooKeeper connections
We have made many improvements to connection handling in Zuul.
Bring those back to Nodepool by copying over the zuul/zk directory
which has our base ZK connection classes.

This will enable us to bring other Zuul classes over, such as the
component registry.

The existing connection-related code is removed and the remaining
model-style code is moved to nodepool.zk.zookeeper.  Almost every
file imported the model as nodepool.zk, so import adjustments are
made to compensate while keeping the code more or less as-is.

Change-Id: I9f793d7bbad573cb881dfcfdf11e3013e0f8e4a3
2022-05-23 07:40:20 -07:00
Zuul fc2e592d0d Merge "Add zookeeper-timeout connection config" 2022-03-24 15:23:02 +00:00
Tobias Henkel ec55126f6b
Add zookeeper-timeout connection config
The default zookeeper session timout is 10 seconds which is not enough
on a highly loaded nodepool. Like in zuul make this configurable so we
can avoid session losses.

Change-Id: Id7087141174c84c6cdcbb3933c233f5fa0e7d569
2022-02-23 23:01:11 +01:00
James E. Blair 86631344e3 Add provider image config to statemachine image upload
This adds an extra argument to the provider image upload method
so that it can have access to the provider image configuration
which it may need in order to obtain extra information such as
the architecture.

It also adds the upload number to the image name format so that
we may name image uploads by their number like we do instances.

Change-Id: I0f47b4443d86f021641f315af4b69da26c4713a6
2022-02-22 13:33:31 -08:00
Ian Wienand 0f257982f4 builder: Remove optional extension from DibImageFile
A DibImageFile represents one dib image-file on disk, so the extension
is required (see prior change
I214581ad80b7740e7ca749b574672d2c33b92474 where we modified callers
who were using this interface).

This fixes a bug by removing code; the pathlib with_suffix replacement
is not safe for image names with a period in them; consider

>>> pathlib.Path('image-v1.2-foo').with_suffix('.vhd')
PosixPath('image-v1.vhd')

We can now simply unconditionally append the extension in
DibImageFile.to_path().

Change-Id: I1bc812ddffacbcc414b8f7f372d9fca78bd87292
2021-12-22 15:13:39 +11:00
Ian Wienand 5704ce75fc builder: don't use DibImageFile to construct dib image path
This refactors the _buildImage function to not use DibImageFile to
construct the path to the dib output files.

DibImageFile represents one on-disk image.  The dib build argument is
different -- you give it the basename in "-o" and the output types,
and it creates "basename.ext" where "ext" represents the varioius
output types it produced.

This converts _buildImage to the simple thing and makes it a bit
clearer this is a basename.

Change-Id: I214581ad80b7740e7ca749b574672d2c33b92474
2021-12-21 16:41:47 +11:00
Ian Wienand e0a16431d3 builder: remove from_path interface
The static method from_path is only used from one place and is simply
joining a path; we can inline this and remove it for clarity.

Change-Id: Iade6e024516bf9ce212491d6461e00affb5971a0
2021-12-21 16:41:26 +11:00
Ian Wienand da8780526a builder: remove unused from_images_dir interface
Use of this function was removed in change
I210bb6ba1580403ddc3e5d87eb9bf60ffc7433d8

Change-Id: Iddf7a7b5bab75c98f01d960401ec093b5636f415
2021-12-21 14:52:58 +11:00
Albin Vass 0c84b7fa4e Add shell-type config
Ansible needs to know which shell type the node uses to operate
correctly, especially for ssh connections for windows nodes because
otherwise ansible defaults to trying bash.

Change-Id: I71abfefa57aaafd88f199be19ee7caa64efda538
2021-03-05 15:14:29 +01:00
Ian Wienand 45704178e9 rename imagesdir/elementdir with underscore
This makes absolutely no functional difference, but in a follow-on I
wish to add images_dir_required_free and the mix of underscores with
these values is inconsistent.

Change-Id: I2566afbd5c78c39bfa8f2fce32e2374a589b45aa
2020-11-26 15:15:37 +11:00
Zuul 62f52878d2 Merge "Fix fake image build under Mac OS" 2020-11-12 11:36:58 +00:00
Tobias Henkel 23b8c31216
Fix fake image build under Mac OS
Change I7f0617b91e071294fe6051d14475ead1d7df56b7 broke test cases
under Mac OS. This happens because the POLLIN and PULLHUP flags can be
set in the same event. Handling both flags for every event fixes this.

Change-Id: Ice9d05eb444ee8e84228c7464d7ebb0a6153b56d
2020-10-30 13:12:07 +01:00
Clark Boylan dfcf770260 Ignore unparsable/empty image build ZNode data
As with the parent it seems that we can end up with empty image build
records as well as empty image upload records. Handle the build case as
well so that we can still list image builds properly when in this state.
This logs the problem as well so that you can debug it further.

Change-Id: Ic5521f5e2d2b65db94c2a5794f474913631a7a1b
2020-10-06 12:44:18 +02:00
Zuul e0e4d08a87 Merge "Load diskimage configs before building them" 2020-09-21 16:09:04 +00:00
Zuul 0bf8712c80 Merge "Ignore unparsable/empty image upload ZNode data" 2020-09-03 05:41:59 +00:00
Clark Boylan 3f22d3f927 Load diskimage configs before building them
Currently we load our diskimage configs then iterate over all of them
rebuilding them if necessary. The problem is that since disk image
builds take so long we don't pick up pause configurations until we finish
attempting to build all images. Fix this by checking if our config
is current before building images. If not, return and force a rerun
through the rebuilding process which will load new configs.

This also fixes a test race that was noted in comments but a non issue
until this change. This change causes us to trip over the test race
frequently.

Change-Id: If0b7321ee2c804a4ae8123f2369a660780272b52
2020-08-21 10:37:34 -07:00
James E. Blair 00d62af3c3 Add image-pause CLI command
This adds a CLI commend to set a flag in ZK for images indicating
that the image should be paused.  This can be used to quickly pause
the building and uploading of one or more images globally.  This
will effectively be boolean OR'd with the pause value for diskimage
builds in the config file.

In particular, this can be used to pause images for short durations,
either because a fix is imminent, or to allow the system to remain
stable while a configuration change goes through the CI/CD workflow.

Change-Id: I21a573dfc337c51f319afe3695d5446b2c91d70b
2020-08-20 15:48:03 -07:00
Simon Westphahl 638c6dbb1f Ignore unparsable/empty image upload ZNode data
Multiple builders running cleanup of uploads in parallel can lead to a
race condition which results in empty image upload nodes in Zookeeper.

The problem is the creation of the image upload number lock inside the
image upload. The lock creation can re-created an already deleted image
upload w/o any data.

Later iterations will fail to parse the empty znode data.

It's not 100% clear how we arrive at this race condition, but we
definitely see multiple builders in the 'guarded' section trying to
delete an upload.

The following exception shows that there is some kind of race:

    2020-06-23 11:00:45,225 ERROR nodepool.builder.CleanupWorker.0:
        Exception cleaning up build <ImageBuild {'state': 'ready',
        'state_time': 1592718469.360345, 'builder':
        'nodepool-builder-2', 'builder_id':
        '8e2c599b-fbfd-4d68-b8df-2fd2efdaa705', 'formats': 'qcow2,raw',
        'username': 'Administrator', 'python_path': 'auto', 'id':
        '0000013010', 'stat': ZnodeStat(czxid=129130944630,
        mzxid=129131204208, ctime=1592712393474, mtime=1592718469363,
        version=1, cversion=1, aversion=0, ephemeralOwner=0,
        dataLength=214, numChildren=1, pzxid=129131204230)}> of image
        example-image in provider <Provider example-provider>:
    Traceback (most recent call last):
      File "/opt/nodepool/lib/python3.6/site-packages/nodepool/builder.py", line 499, in _cleanupImage
        self._cleanupProvider(provider, image, build.id)
      File "/opt/nodepool/lib/python3.6/site-packages/nodepool/builder.py", line 292, in _cleanupProvider
        self._deleteUpload(upload)
      File "/opt/nodepool/lib/python3.6/site-packages/nodepool/builder.py", line 345, in _deleteUpload
        upload.provider_name, upload.id)
      File "/opt/nodepool/lib/python3.6/site-packages/nodepool/zk.py", line 1572, in deleteUpload
        self.client.delete(path, recursive=True)
      File "/opt/nodepool/lib/python3.6/site-packages/kazoo/client.py", line 1341, in delete
        return self._delete_recursive(path)
      File "/opt/nodepool/lib/python3.6/site-packages/kazoo/client.py", line 1374, in _delete_recursive
        self._delete_recursive(child_path)
      File "/opt/nodepool/lib/python3.6/site-packages/kazoo/client.py", line 1376, in _delete_recursive
        self.delete(path)
      File "/opt/nodepool/lib/python3.6/site-packages/kazoo/client.py", line 1343, in delete
        return self.delete_async(path, version).get()
      File "/opt/nodepool/lib/python3.6/site-packages/kazoo/handlers/utils.py", line 75, in get
        raise self._exception
    kazoo.exceptions.NotEmptyError

Order of events that might have led to the above exception:
  - A locks upload
  - A deletes children of upload (including lock of A)
  - B locks upload (possible, because lock was deleted by A)
  - A tries do delete the parent node, which leads to the above exception
  - ...

Related change: https://review.opendev.org/#/c/716566/
Change-Id: I5989c9f5f4c9f1615ae7372250e2e5fdad720256
2020-08-14 07:32:50 +02:00
Simon Westphahl 2ec2661655 Remove default qcow2 format in diskimage config
When removing a label from a provider that previously required raw
images (while still keeping the diskimage config), the image was
automatically rebuilt in qcow2 format.

It seems the original intent [0] of having the diskimage formats was to
allow building diskimages without needing a provider.

Because manually triggering a diskimage build without a format lead to a
failure, the qcow2 default was added [1] and later fixed [2] to only
provide a default when the diskimage wasn't used by any provider.

By removing the qcow2 default and preventing builds without a format, we
retain the ability to allow diskimage only builds when a format is
given. Otherwise we don't assume a default image format and prevent
builds with no image format.

[0] https://review.opendev.org/#/c/412160/
[1] https://review.opendev.org/#/c/566437/
[2] https://review.opendev.org/#/c/572836/

Change-Id: I374f40b5f9cfcd55e7a4f567fd6480c940f2bc20
2020-07-15 14:31:07 +02:00
Zuul 47193a46b9 Merge "Allow disabling build-log-retention" 2020-05-12 20:59:37 +00:00
Ian Wienand e8bb64a52a Use fqdn for builder hostname info
After Ia94f20b15ab9cf2ae4969891eedccec8d5291d36 the hostname fields
are just used for informational purposes.  Use the FQDN so something
like 'nb01.opendev.org' and 'nb01.openstack.org' are differentiated in
the output of something like 'nodepool dib-image-list'.

Change-Id: I1f061958ff271f707fddbe9ef74fd2e2a228e4ca
2020-05-07 12:47:25 +10:00
Ian Wienand b9f6f6bf62 Allow disabling build-log-retention
This allows setting build-log-retention to -1 to disable automatic
collection of logs.  This would facilitate managing these logs with an
external tool like logrotate.  Another case is where you have the
builds failing very quickly -- say, one of the builds has destroyed
the container and so builds fail to even exec dib correctly.  In this
case it's difficult to get to the root-cause of the problem because
the first build's logs (the one that destroyed the container) have
been repead just seconds after the failure.

Change-Id: I259c78e6a0e30b4c0a8d2f4c12a6941a2d227c38
2020-04-29 13:07:07 +10:00
Zuul a5924345d5 Merge "Mark dib as deleting before erasing files from disk" 2020-04-15 18:26:58 +00:00
James E. Blair fe5685237c Mark dib as deleting before erasing files from disk
To avoid a situation where nodepool deletes an image build but
continues to indicate that it is "ready" in the ui, make sure that
we set the state to "deleting" before we start deleting.

Change-Id: I0f87dd93262ba46f42931d83d123244dfe85cd2f
2020-04-15 10:29:27 -07:00
James E. Blair b62fa3313d Add ZooKeeper TLS support
Change-Id: I009d9f90b32881aaef2d0694da6ff28074f48f8e
2020-04-14 16:03:53 -07:00
David Shrewsbury 719730e534 Stop comparing hostnames to determine ownership
We shouldn't need to compare hostnames at this point since that
was an artifact of maintaining compatibility with existing ZK
data before we added unique build host IDs. That data should long
since be gone everywhere.

Change-Id: Ia94f20b15ab9cf2ae4969891eedccec8d5291d36
2020-03-13 16:13:38 -04:00
Simon Westphahl 5c309f44ac Make flake8 config compatible with latest version
New ignored errors/warning:
- W504: line break after binary operator
- E741: do not use variables named ‘l’, ‘O’, or ‘I’

Fixed issues:
- E117: over-indented

This was tested locally with flake8==3.7.9 with Python 3.8.

The currently used version of flake8 (2.6.2) is not compatible with
Python 3.8. Unfortunately the dependency is coming in via
diskimage-builder>=2.0 -> hacking<1.2.0,>=1.1.0 -> flake8<2.7.0,>=2.6.0

Change-Id: Idf8b6033232fdce91cb37c8396dab6cc27c75968
2020-01-20 15:52:28 +01:00
Clark Boylan 4ac8bf0cd9 Delete dib images when all uploads set to deleting
We have noticed that in the real world that sometimes images refuse to
delete. Boot from volume with leaked volumes can cause this as well as
other unknown issues. When this happens we keep old dib builds around
longer than expected which eventually leads to full builder disks and
sadness.

Once we've set all provider uploads to the delete state we know we no
longer what that image at all either locally or in the cloud. We can
delete the local copies of the files as soon as we reach this state so
that we aren't waiting for the provider image uploads to go away first.

This should make the builders more reliable as they'll clean their disks
as soon as is possible.

Change-Id: I7549dca5172bc82f143463994ac8578cc98a4375
2020-01-13 13:22:16 -08:00
Fabien Boucher f57ac1881a
Remove uneeded shebang and exec bit on some files
Having python files with exec bit and shebang defined in
/usr/lib/python-*/site-package/ is not fine in a RPM package.

Instead of carrying a patch in nodepool RPM packaging better
to fix this directly upstream.

Change-Id: I5a01e21243f175d28c67376941149e357cdacd26
2019-12-13 19:30:03 +01:00
Zuul ab1dc76dc7 Merge "Support optional post upload hooks" 2019-12-03 17:51:52 +00:00
Tobias Henkel 0dc40d33e4
Support optional post upload hooks
There are several scenarios where it can be useful hook into nodepool
after an image got uploaded but before it is taken into use by the
launchers. One use case is to be able to run validations on the image
(e.g. image size, boot test, etc.) before nodepool tries to use that
image and causing potentially node_failures. Another more advanced use
case is to be able to pre-distribute an image to all compute nodes in
a cloud before an image is used at scale.

To facilitate these use cases this adds a new config option
post-upload-hook to the provider config. This takes a path to a user
defined executable script which then can perform various tasks. If the
process fails with an rc != 0 the image gets deleted again and the
upload fails.

Change-Id: I099cf1243b1bd262b8ee96ab323dbd34c7578c10
2019-11-25 13:37:28 +01:00