nodepool

Commit Graph

Author	SHA1	Message	Date
James E. Blair	6a71686b6b	Fix deletion of intermediate image builds The delete-after-upload option would sometimes delete before upload if there are intermediate build files which match its criteria. To ensure that we don't delete any build files until the build is at least complete, only run that method if the ZK build state is marked as ready. Change-Id: Ia263b478b0a2b9d77833bc19d0967a65dcbf5b27	2024-04-02 15:26:41 -07:00
Zuul	b1a40f1fd3	Merge "Add delete-after-upload option"	2024-03-18 21:06:46 +00:00
Benjamin Schanzel	1199ac77e5	Fix nodepool builder stats test In the metric name, we use the builders fqdn as a key, but in the test we used the hostname. So this test fails on systems where that's not the same. Change-Id: If286f19371d1fd70dc9bee4b7af814d13396357b	2024-03-18 15:33:03 +01:00
James E. Blair	fd454706ca	Add delete-after-upload option This allows operators to delete large diskimage files after uploads are complete, in order to save space. A setting is also provided to keep certain formats, so that if operators would like to delete large formats such as "raw" while retaining a qcow2 copy (which, in an emergency, could be used to inspect the image, or manually converted and uploaded for use), that is possible. Change-Id: I97ca3422044174f956d6c5c3c35c2dbba9b4cadf	2024-03-09 06:51:56 -08:00
James E. Blair	646b7f4927	Add some builder operational stats This adds some stats keys that may be useful when monitoring the operation of individual nodepool builders. Change-Id: Iffdeccd39b3a157a997cf37062064100c17b1cb3	2024-02-19 15:47:17 -08:00
James E. Blair	3815cce7aa	Change image ID from int sequence to UUID When we export and import image data (for backup/restore purposes), we need to reset the ZK sequence counter for image builds in order to avoid collisions. The only way we can do that is to create and then delete a large number of znodes. Some sites (including OpenDev) have sequence numbers that are in the hundreds of thousands. To avoid this time-consuming operation (which is only intended to be run to restore from backup -- when operators are already under additional stress!), this change switches the build IDs from integer sequences to UUIDs. This avoids the problem with collisions after import (at least, to the degree that UUIDs avoid collisions). The actual change is fairly simple, but many unit tests need to be updated. Since the change is user-visible in the command output (image lists, etc), a release note is added. A related change which updates all of the textual references of build "number" to build "id" follows this one for clarity and ease of review. Change-Id: Ie7c68b094bc9733914a808756eeee8b62f696713	2023-08-02 11:18:15 -07:00
James E. Blair	b0a40f0b47	Use image cache when launching nodes We consult ZooKeeper to determine the most recent image upload when we decide whether we should accept or decline a request. If we accept the request, we also consult it again for the same information when we start building the node. In both cases, we can use the cache to avoid what may potentially be (especially in the case of a large number of images or uploads) quite a lot of ZK requests. Our cache should be almost up to date (typically milliseconds, or at the worst, seconds behind), and the worst case is equivalent to what would happen if an image build took just a few seconds longer. The tradeoff is worth it. Similarly, when we create min-ready requests, we can also consult the cache. With those 3 changes, all references to getMostRecentImageUpload in Nodepool use the cache. The original un-cached method is kept as well, because there are an enormous number of references to it in the unit tests and they don't have caching enabled. In order to reduce the chances of races in many tests, the startup sequence is normalized to: 1) start the builder 2) wait for an image to be available 3) start the launcher 4) check that the image cache in the launcher matches what is actually in ZK This sequence (apart from #4) was already used by a minority of tests (mostly newer tests). Older tests have been updated. A helper method, startPool, implements #4 and additionally includes the wait_for_config method which was used by a random assortment of tests. Change-Id: Iac1ff8adfbdb8eb9a286929a59cf07cd0b4ac7ad	2023-04-10 15:57:01 -07:00
James E. Blair	be3edd3e17	Convert openstack driver to statemachine This updates the OpenStack driver to use the statemachine framework. The goal is to revise all remaining drivers to use the statemachine framework for two reasons: 1) We can dramatically reduce the number of threads in Nodepool which is our biggest scaling bottleneck. The OpenStack driver already includes some work in that direction, but in a way that is unique to it and not easily shared by other drivers. The statemachine framework is an extension of that idea implemented so that every driver can use it. This change further reduces the number of threads needed even for the openstack driver. 2) By unifying all the drivers with a simple interface, we can prepare to move them into Zuul. There are a few updates to the statemachine framework to accomodate some features that only the OpenStack driver used to date. A number of tests need slight alteration since the openstack driver is the basis of the "fake" driver used for tests. Change-Id: Ie59a4e9f09990622b192ad840d9c948db717cce2	2023-01-10 10:30:14 -08:00
James E. Blair	6a56940275	Fix race with two builders deleting images In a situation with multiple builders, each configured with different providers, it is possible for one builder to delete the ZK ImageBuild record for a build from another builder between the time that the build is completed but before the first upload starts. This is because every builder looks for images to delete from ZK. It keeps the 2 most recent ready images (this should normally cover the time period between a build and upload), unless the image is not configured for any provider this builder knows about. This is where the disjoint providers come into play -- builder1 in our scenario is not expected to have a configuration for provider2. To correct this, we adjust this check so that the only time we bypass the 2-most-recent-ready-images check is if the diskimage is not configured at all. That means that we still expect all builders to have a "diskimage" entry for every image, but we don't need those to be configured for any providers which this builder is not expected to handle. Change-Id: Ic2fefda293fa0bcbc98ee7313198b37df0576299	2022-07-25 13:06:25 -07:00
James E. Blair	10df93540f	Use Zuul-style ZooKeeper connections We have made many improvements to connection handling in Zuul. Bring those back to Nodepool by copying over the zuul/zk directory which has our base ZK connection classes. This will enable us to bring other Zuul classes over, such as the component registry. The existing connection-related code is removed and the remaining model-style code is moved to nodepool.zk.zookeeper. Almost every file imported the model as nodepool.zk, so import adjustments are made to compensate while keeping the code more or less as-is. Change-Id: I9f793d7bbad573cb881dfcfdf11e3013e0f8e4a3	2022-05-23 07:40:20 -07:00
Ian Wienand	0f257982f4	builder: Remove optional extension from DibImageFile A DibImageFile represents one dib image-file on disk, so the extension is required (see prior change I214581ad80b7740e7ca749b574672d2c33b92474 where we modified callers who were using this interface). This fixes a bug by removing code; the pathlib with_suffix replacement is not safe for image names with a period in them; consider >>> pathlib.Path('image-v1.2-foo').with_suffix('.vhd') PosixPath('image-v1.vhd') We can now simply unconditionally append the extension in DibImageFile.to_path(). Change-Id: I1bc812ddffacbcc414b8f7f372d9fca78bd87292	2021-12-22 15:13:39 +11:00
Ian Wienand	e0a16431d3	builder: remove from_path interface The static method from_path is only used from one place and is simply joining a path; we can inline this and remove it for clarity. Change-Id: Iade6e024516bf9ce212491d6461e00affb5971a0	2021-12-21 16:41:26 +11:00
Ian Wienand	da8780526a	builder: remove unused from_images_dir interface Use of this function was removed in change I210bb6ba1580403ddc3e5d87eb9bf60ffc7433d8 Change-Id: Iddf7a7b5bab75c98f01d960401ec093b5636f415	2021-12-21 14:52:58 +11:00
Ian Wienand	45704178e9	rename imagesdir/elementdir with underscore This makes absolutely no functional difference, but in a follow-on I wish to add images_dir_required_free and the mix of underscores with these values is inconsistent. Change-Id: I2566afbd5c78c39bfa8f2fce32e2374a589b45aa	2020-11-26 15:15:37 +11:00
Clark Boylan	412b78701c	Followup test improvement to use iterate_timeout This was suggested in review. Pushing this so we don't forget. Change-Id: If899965db23d821dfc45d77629a86e3c854be172	2020-09-17 10:34:38 -07:00
Clark Boylan	3f22d3f927	Load diskimage configs before building them Currently we load our diskimage configs then iterate over all of them rebuilding them if necessary. The problem is that since disk image builds take so long we don't pick up pause configurations until we finish attempting to build all images. Fix this by checking if our config is current before building images. If not, return and force a rerun through the rebuilding process which will load new configs. This also fixes a test race that was noted in comments but a non issue until this change. This change causes us to trip over the test race frequently. Change-Id: If0b7321ee2c804a4ae8123f2369a660780272b52	2020-08-21 10:37:34 -07:00
Simon Westphahl	2ec2661655	Remove default qcow2 format in diskimage config When removing a label from a provider that previously required raw images (while still keeping the diskimage config), the image was automatically rebuilt in qcow2 format. It seems the original intent [0] of having the diskimage formats was to allow building diskimages without needing a provider. Because manually triggering a diskimage build without a format lead to a failure, the qcow2 default was added [1] and later fixed [2] to only provide a default when the diskimage wasn't used by any provider. By removing the qcow2 default and preventing builds without a format, we retain the ability to allow diskimage only builds when a format is given. Otherwise we don't assume a default image format and prevent builds with no image format. [0] https://review.opendev.org/#/c/412160/ [1] https://review.opendev.org/#/c/566437/ [2] https://review.opendev.org/#/c/572836/ Change-Id: I374f40b5f9cfcd55e7a4f567fd6480c940f2bc20	2020-07-15 14:31:07 +02:00
Ian Wienand	b5b20b6e2c	Add parent and abstract flags for diskimages While YAML does have inbuilt support for anchors to greatly reduce duplicated sections, anchors have no support for merging values. For diskimages, this can result in a lot of duplicated values for each image which you can not otherwise avoid. This provides two new values for diskimages; a "parent" and "abstract". Specifying a parent means you inherit all the configuration values from that image. Anything specified within the child image overwrites the parent values as you would expect; caveats, as described in the documentation, are that the elements field appends and the env-vars field has update() semantics. An "abstract" diskimage is not instantiated into a real image, it is only used for configuration inheritance. This way you can make a abstrat "base" image with common values and inherit that everywhere without having to worry about bringing in values you don't want. You can also chain parents together and the inheritance flows through. Documentation is updated, and several tests are added to ensure the correct parenting, merging and override behaviour of the new values. Change-Id: I170016ef7d8443b9830912b9b0667370e6afcde7	2020-03-20 07:53:08 +11:00
Clark Boylan	4ac8bf0cd9	Delete dib images when all uploads set to deleting We have noticed that in the real world that sometimes images refuse to delete. Boot from volume with leaked volumes can cause this as well as other unknown issues. When this happens we keep old dib builds around longer than expected which eventually leads to full builder disks and sadness. Once we've set all provider uploads to the delete state we know we no longer what that image at all either locally or in the cloud. We can delete the local copies of the files as soon as we reach this state so that we aren't waiting for the provider image uploads to go away first. This should make the builders more reliable as they'll clean their disks as soon as is possible. Change-Id: I7549dca5172bc82f143463994ac8578cc98a4375	2020-01-13 13:22:16 -08:00
Tobias Henkel	0dc40d33e4	Support optional post upload hooks There are several scenarios where it can be useful hook into nodepool after an image got uploaded but before it is taken into use by the launchers. One use case is to be able to run validations on the image (e.g. image size, boot test, etc.) before nodepool tries to use that image and causing potentially node_failures. Another more advanced use case is to be able to pre-distribute an image to all compute nodes in a cloud before an image is used at scale. To facilitate these use cases this adds a new config option post-upload-hook to the provider config. This takes a path to a user defined executable script which then can perform various tasks. If the process fails with an rc != 0 the image gets deleted again and the upload fails. Change-Id: I099cf1243b1bd262b8ee96ab323dbd34c7578c10	2019-11-25 13:37:28 +01:00
David Shrewsbury	91dc050c3c	Do not overwrite image upload ZK data on delete This code is setting the ImageUpload state field to DELETING and updating the data in ZooKeeper. However, it was using a new ImageUpload object to do so, and we lose all other data, like the external_id, external_name, etc. If the delete should fail in the provider on the first attempt, the next attempt to delete will not have these attributes, some of which (external_name) are needed to actually delete the image. This means we leak the image in the provider because the 2nd attempt will not try to delete in the provider and will remove the ZooKeeper record. During review of this fix, I realized that we were not locking specific image uploads before modifying them or deleting them. Although this likely would not cause any problems, it may have caused the odd error in competing builder processes. This adds a locking mechanism for that data, and tests to validate the lock. Change-Id: I3fbf1c1a3ca10cc397facd4e939cbf4cbb2d8305	2019-09-24 10:43:25 -04:00
David Shrewsbury	d0bb3d004d	Use py3 pathlib in DibImageFile The pathlib library was introduced in python 3.4 and is a more concise way to deal with paths versus the overburdened os lib. Change-Id: I468c4aa0a41f07c0b0009c2c387712b001a03e5d	2019-05-21 12:21:31 -04:00
Zuul	ba8dd4a354	Merge "Update dib stats"	2019-04-04 18:14:55 +00:00
David Shrewsbury	fa2d4bd17c	Fix for image build leaks If, during a long DIB image build, we lose the ZooKeeper session, it's likely that the CleanupWorker thread could have run and removed the ZK record for the build (its state would be BUILDING and unlocked, indicating something went wrong). In that scenario, when the DIB process finishes (possibly writing out DIB files), it will never get cleaned up since the ZK record would now be gone. If we fail to update the ZK record at the end of the build, just delete the leaked DIB files immediately after the build. Change-Id: I5cb58318efe51b5b0c3413b7a01f02a50215a8b6	2019-04-01 15:44:31 -04:00
Ian Wienand	6fa73eac26	Update dib stats This updates dib stats after creating a dashboard to use them. Firstly, the individual return codes and runtime for each image type are unnecessary, because they call come from the same invocation of dib. While it is definitely useful to track the size of each output image, the overall status for a build is only a single value. This moves these duplciated values to ".status.<rc\|duration>". Unfortunately, there's really no way to say "what was the time of the last non-null value" in grafana+graphite [1]. This means you can't do something useful like show a singlestat of the relative time of the last build "X hours ago" using the timer value. We can work around this by putting the timestamp of the last build in a gauge value; this monotonically increases and is easy to turn into a relative time. [1] https://github.com/grafana/grafana/issues/10550 Change-Id: Ia9518b6faecb30d45e0509bda4a9b2ab7fdc6261	2019-02-22 13:26:05 +11:00
Ian Wienand	c68dbb9636	Use a pipeline for dib stats I noticed in OpenStack production we don't seem to be getting all the stats from dib, particularly from our very remote builder. This is likely because there is some packet loss quickly blasting out small UDP packets with the stats. A pipeline bundles the stats together into the largest size packets it can (this has been a problem before; see I3f68450c7164d1cf0f1f57f9a31e5dca2f72bc43). Add some additional checks for the size stats which did not seem to be covered by existing testing. I also noticed that the documentation had an extra ".builder." in the key which isn't actually there in the stats output. Change-Id: Ib744f19385906d1e72231958d11c98f15b72d6bd	2019-02-22 11:36:27 +11:00
David Shrewsbury	890ea4975e	Revert "Revert "Add a timeout for the image build"" This reverts commit `ccf40a462a`. The previous version would not work properly when daemonized because there was no stdout. This version maintains stdout and uses select/poll with non-blocking stdout to capture the output to a log file. Depends-On: https://review.openstack.org/634266 Change-Id: I7f0617b91e071294fe6051d14475ead1d7df56b7	2019-01-31 11:36:47 -05:00
David Shrewsbury	ccf40a462a	Revert "Add a timeout for the image build" This reverts commit `7225354ec0`. The disk-image-create command does not appear to be starting. Change-Id: I81abe25a253a385cae08a57561129a678546f18f	2019-01-25 17:36:31 +00:00
David Shrewsbury	7225354ec0	Add a timeout for the image build A builder thread can wedge if the build process wedges. Add a timeout to the subprocess. Since it was the call to readline() that would block, we change the process to have DIB write directly to the log. This allows us to set a timeout in the Popen.wait() call. And we kill the dib subprocess, as well. The timeout value can be controlled in the diskimage configuration and defaults to 8 hours. Change-Id: I188e8a74dc39b55a4b50ade5c1a96832fea76a7d	2019-01-23 16:27:19 -05:00
David Shrewsbury	511ffd9c29	Add tox functional testing for drivers Reorganizes the unit tests into a new subdirectory, and adds a new functional test subdirectory (that will be further divided into subdirs for drivers). Change-Id: I027bba619deef9379d293fdd0457df6a1377aea8	2018-11-01 15:33:44 +01:00

30 Commits