Commit Graph

173 Commits

Author SHA1 Message Date
Zuul b1a40f1fd3 Merge "Add delete-after-upload option" 2024-03-18 21:06:46 +00:00
James E. Blair fd454706ca Add delete-after-upload option
This allows operators to delete large diskimage files after uploads
are complete, in order to save space.

A setting is also provided to keep certain formats, so that if
operators would like to delete large formats such as "raw" while
retaining a qcow2 copy (which, in an emergency, could be used to
inspect the image, or manually converted and uploaded for use),
that is possible.

Change-Id: I97ca3422044174f956d6c5c3c35c2dbba9b4cadf
2024-03-09 06:51:56 -08:00
Zuul f4941d4f03 Merge "Add some builder operational stats" 2024-03-07 20:36:14 +00:00
Zuul 392cf017c3 Merge "Add support for AWS IMDSv2" 2024-02-28 02:46:53 +00:00
Zuul 8775e54e5d Merge "Remove hostname-format option" 2024-02-28 02:46:52 +00:00
Zuul fe70068909 Merge "Add host-key-checking to metastatic driver" 2024-02-27 19:03:48 +00:00
James E. Blair 8259170516 Change the AWS default image volume-type from gp2 to gp3
gp3 is better in almost every way (cheaper, faster, more configurable).
It seems difficult to find a situation where gp2 would be a better
choice, so update the default when creating images to use gp3.

There are two locations where we can specify volume-type: image creation
(where the volume type becomes the default type for the image) and
instance creation (where we can override what the image specifies).
This change updates only the first (image creation), but not the second,
which has no default (which means to use whatever the image specified).

https://aws.amazon.com/ebs/general-purpose/

Change-Id: Ibfc5dfd3958e5b7dbd73c26584d6a5b8d3a1b4eb
2024-02-20 13:04:26 -08:00
James E. Blair 646b7f4927 Add some builder operational stats
This adds some stats keys that may be useful when monitoring
the operation of individual nodepool builders.

Change-Id: Iffdeccd39b3a157a997cf37062064100c17b1cb3
2024-02-19 15:47:17 -08:00
James E. Blair 21f1b88b75 Add host-key-checking to metastatic driver
If a long-running backing node used by the metastatic driver develops
problems, performing a host-key-check each time we allocate a new
metastatic node may detect these problems.  If that happens, mark
the backing node as failed so that no more nodes are allocated to
it and it is eventually removed.

Change-Id: Ib1763cf8c6e694a4957cb158b3b6afa53d20e606
2024-02-13 14:12:52 -08:00
James E. Blair e097731339 Remove hostname-format option
This option has not been used since at least the migratio to the
statemachine framework.

Change-Id: I7a0e928889f72606fcbba0c94c2d49fbb3ffe55f
2024-02-08 09:40:41 -08:00
James E. Blair c78fe769f2 Allow custom k8s pod specs
This change adds the ability to use the k8s (and friends) drivers
to create pods with custom specs.  This will allow nodepool admins
to define labels that create pods with options not otherwise supported
by Nodepool, as well as pods with multiple containers.

This can be used to implement the versatile sidecar pattern, which,
in a system where it is difficult to background a system process (such
as a database server or container runtime) is useful to run jobs with
such requirements.

It is still the case that a single resource is returned to Zuul, so
a single pod will be added to the inventory.  Therefore, the expectation
that it should be possible to shell into the first container in the
pod is documented.

Change-Id: I4a24a953a61239a8a52c9e7a2b68a7ec779f7a3d
2024-01-30 15:59:34 -08:00
James E. Blair 3f4fb008b0 Add support for AWS IMDSv2
This is an authenticated http metadata service which is typically
available by default, but a more secure setup is to enforce its
usage.

This change adds the ability to do that for both instances and
AMIs.

Change-Id: Ia8554ff0baec260289da0574b92932b37ffe5f04
2024-01-24 15:11:35 -08:00
James E. Blair cb8366d70c Use backing node attributes as metastatic default
To support the use case where one has multiple pools providing
metastatic backing nodes, and those pools are in different regions,
and a user wishes to use Zuul executor zones to communicate with
whatever metastatic nodes eventually produced from those regions,
this change updates the launcher and metastatic driver to use
the node attributes (where zuul executor region names are specified)
as default values for metastatic node attributes.  This lets users
configure nodepool with zuul executor zones only on the backing pools.

Change-Id: Ie6bdad190f8f0d61dab0fec37642d7a078ab52b3
Co-Authored-By: Benedikt Loeffler <benedikt.loeffler@bmw.de>
2023-11-27 10:34:24 -08:00
Zuul 2813a7df1f Merge "Kubernetes/OpenShift drivers: allow setting dynamic k8s labels" 2023-09-25 07:43:11 +00:00
Benjamin Schanzel 4660bb9aa7
Kubernetes/OpenShift drivers: allow setting dynamic k8s labels
Just like for the OpenStack/AWS/Azure drivers, allow to configure
dynamic metadata (labels) for kubernetes resources with information
about the corresponding node request.

Change-Id: I5d174edc6b7a49c2ab579a9a0b1b560389d6de82
2023-09-11 10:49:27 +02:00
James E. Blair 3b434098c6 Add an image upload timeout to the openstack driver
Some uploads in opendev are taking hours.

We used to wait 6 hours for this, but we ended up using the SDK
default of 1 hour in recent versions.  Since we're seeing so much
disparity in time, make it user configurable.

Remove the unused 6 hour constant.

Change-Id: I9ca5fdbf7c66f176eb4f650fd287514708f46c16
2023-09-06 08:04:51 -07:00
Zuul 785f7dcbc9 Merge "AWS: Add support for retrying image imports" 2023-08-28 18:43:56 +00:00
Zuul 7c9e1bc0d8 Merge "Add AWS volume quota support" 2023-08-23 00:22:47 +00:00
Zuul 909973ff06 Merge "Update Azure API and add volume-size" 2023-08-16 23:28:14 +00:00
Zuul 0c9099a20d Merge "Add Azure gallery image support" 2023-08-16 23:28:12 +00:00
James E. Blair 98994f791d Only support Python 3.11
To match Zuul, update the pypi classifiers and testing to indicate
that only Python 3.11 is tested and supported.

Change-Id: Id7d422aaae94961a7ee746e7c69308f04d94954d
Depends-On: https://review.opendev.org/891339
2023-08-14 17:54:07 +00:00
James E. Blair c2d9c45655 AWS: Add support for retrying image imports
AWS has limits on the number of image import tasks that can run
simultaneously.  In a busy system with large images, it would be
better to wait until those limits clear rather than delete the
uploaded s3 object and start over, uploading it again.  To support
this, we now detect that condition and optionally retry for a
specified amount of time.

The default remains to bail on the first error.

Change-Id: I6aa7f79b2f73c4aa6743f11221907a731a82be34
2023-08-12 11:45:22 -07:00
James E. Blair 3815cce7aa Change image ID from int sequence to UUID
When we export and import image data (for backup/restore purposes),
we need to reset the ZK sequence counter for image builds in order
to avoid collisions.  The only way we can do that is to create and
then delete a large number of znodes.  Some sites (including
OpenDev) have sequence numbers that are in the hundreds of thousands.

To avoid this time-consuming operation (which is only intended to
be run to restore from backup -- when operators are already under
additional stress!), this change switches the build IDs from integer
sequences to UUIDs.  This avoids the problem with collisions after
import (at least, to the degree that UUIDs avoid collisions).

The actual change is fairly simple, but many unit tests need to be
updated.

Since the change is user-visible in the command output (image lists,
etc), a release note is added.

A related change which updates all of the textual references of
build "number" to build "id" follows this one for clarity and ease
of review.

Change-Id: Ie7c68b094bc9733914a808756eeee8b62f696713
2023-08-02 11:18:15 -07:00
James E. Blair acb6772c3a Add AWS volume quota support
Like the OpenStack driver, this automatically applies volume quota
limits if specified in the label configuration.

Change-Id: I71c1b95de08dc72cc777099952892de659d45d41
2023-07-17 15:17:50 -07:00
mbecker 3fa6821437 Add gpu support for k8s/openshift pods
This adds the option to request GPUs for kubernetes and openshift pods.

Since the resource name depends on the GPU vendor and the cluster
installation, this option is left for the user to define it in the
node pool.
To leverage the ability of some schedulers to use fractional GPUs,
the actual GPU value is read as a string.

For GPUs, requests and limits cannot be decoupled (cf.
https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/),
so the same value will be used for requests and limits.

Change-Id: Ibe33b06c374a431f164080edb34c3a501c360df7
2023-07-11 07:10:30 -07:00
James E. Blair 77d9512764 Update Azure API and add volume-size
The addition of the volume-size attribute necessitates an upgrade
to the Azure API.  The newer version of the API allows us to
remove the individual handling we have for NICs, PIPs, and Disks.
That simplifies the driver greatly, but comes with some caveats
that are noted in the docs and release notes.

Finally, the volume-size attribute is added as well.

Change-Id: I6e335318cedbf0ac8944107aff9d1a2cfcab271a
2023-06-21 18:22:47 -07:00
James E. Blair 4279b4766d Add Azure gallery image support
The shared and community gallery images are another way to specify
what image to use when creating a VM.  Shared galleries are intended
for use within an organization, and community galleries are public.

This adds support for using these images.  It requires an API version
bump since the virtual machine attributes to specify them are new.

Change-Id: Ia981fcbeea6680a9d14ee8e4ec401bf227a7cc12
2023-06-21 18:21:40 -07:00
Zuul 7150db6005 Merge "Add ZK 3.6.0 release note" 2023-05-03 00:43:15 +00:00
Zuul 63900e9bc4 Merge "Report leaked resource metrics in statemachine driver" 2023-05-02 23:29:19 +00:00
James E. Blair ad52c68321 Add ZK 3.6.0 release note
Because we are using persistent recursive watches, we require
at least version 3.6.0.  Highlight that in case someone isn't
keeping up with releases.

Change-Id: I2cc3a435989cdd798d209bf8264fc8f399889fea
2023-05-02 15:24:49 -07:00
James E. Blair d4f2c8b9e7 Report leaked resource metrics in statemachine driver
The OpenStack driver reports some leaked metrics.  Extend that in
a generic way to all statemachine drivers.  Doing so also adds
some more metrics to the OpenStack driver.

Change-Id: I97c01b54b576f922b201b28b117d34b5ee1a597d
2023-04-26 06:40:12 -07:00
Christian Mueller 36dbff84ba Amazon EC2 Spot support
This adds support for launching Amazon EC2 Spot instances
(https://aws.amazon.com/ec2/spot/), which comes with huge cost saving
opportunities.

Amazon EC2 Spot instances are spare Amazon EC2 capacity, you can get
with an discount of up to 90% compared to on-demand pricing.
In contrast to on-demand instances, Spot instances can be relaimed with a
2 minute notification in advance
(https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html).

When :attr:`providers.[aws].pools.labels.use-spot` is set to True, the AWS
driver will launch Spot instances. If an instance get interrupted, it will be
terminated and no replacement instance will be launched.

Change-Id: I9868d014991d78e7b2421439403ae1371b33524c
2023-04-16 21:12:06 +02:00
Zuul 0e7de19664 Merge "Add support for specifying pod resource limits" 2023-03-13 08:38:55 +00:00
Zuul 7ffb3f1fd6 Merge "Add scheduler, volumes, and labels to k8s/openshift" 2023-03-13 08:32:16 +00:00
Zuul 13007b6825 Merge "Add OpenStack volume quota" 2023-02-24 20:12:41 +00:00
Zuul aaecb9659e Merge "Add import_image support to AWS" 2023-02-24 16:05:08 +00:00
James E. Blair de02ac5a20 Add OpenStack volume quota
This adds support for staying within OpenStack volume quota limits
on instances that utilize boot-from-volume.

Change-Id: I1b7bc177581d23cecd9443a392fb058176409c46
2023-02-13 06:56:03 -08:00
James E. Blair 669552f6f9 Add support for specifying pod resource limits
We currently allow users to specify pod resource requests and limits
for cpu, ram, and ephemeral storage.  But if a user specifies one of
these, the value is used for both the request and the limit.

This updates the specification to allow the use of separate request
and limit values.

It also normalizes related behavior across all 3 pod drivers,
including adding resource reporting to the openshift drivers.

Change-Id: I49f918b01f83d6fd0fd07f61c3e9a975aa8e59fb
2023-02-12 07:14:30 -08:00
James E. Blair 9bf44b4a4c Add scheduler, volumes, and labels to k8s/openshift
This adds support for specifying the scheduler name, volumes (and
volume mounts), and additional metadata labels to the Kubernetes
and OpenShift (and OpenShift pods) drivers.

This also extends the k8s and openshift test frameworks so that we
can exercise the new code paths (as well as some previous similar
settings).  Tests and assertions for both a minimal (mostly defaults)
configuration as well as a configuration that uses all the optional
settings are added.

Change-Id: I648e88a518c311b53c8ee26013a324a5013f3be3
2023-02-11 12:03:45 -08:00
James E. Blair fdc093a8de Add import_image support to AWS
In I9478c0050777bf35e1201395bd34b9d01b8d5795 we switched from using the
import_image method to import_snapshot in the AWS driver.  This method
is faster and more like other drivers in Nodepool.  However, some operating
systems (such as Windows, RHEL or SLES) require licensing metadata
associated with an AMI which is not available to be set when we register
an AMI from a snapshot.  For these systems, the only viable way to upload
images is with the import_image method.

This change restores the previous method as an option, but keeps the
"snapshot" method as the default.

Change-Id: I81daabebbc9dbe968d8aaf65e6b70f5cdfdd01bf
2023-01-30 20:25:56 -08:00
James E. Blair aa8580ce32 Add support for privileged containers
To allow users to run docker-in-docker style workloads on k8s
and openshift clusters, add support for adding the privileged
flag to containers created in k8s and openshift pods.

Change-Id: I349d61bf200d7fb6d1effe112f7505815b06e9a8
2023-01-25 11:09:25 -08:00
Ian Wienand 46f44b8669
driver/openstack: order flavor results
I hit a situation in a new cloud where I had defined two flavors with
the same amount of RAM "opendev-control" and "opendev".  The config
had min-ram set [1] with a flavor-name of "opendev" -- I expected it
to match the exact name first, but nodepool was choosing
"opendev-control".

I guess the default order returned by the cloud is flavorid [2].  I'd
propose that sorting on a tuple of (ram, name) -- so that we match
names in alphabetical order -- is a more intuitive way to run the
match.

Documentation is updated, and a release note added.

[1] which actually we didn't want, really, because we wanted to
    exactly match the flavor:
     https://review.opendev.org/c/openstack/project-config/+/870677
[2] https://docs.openstack.org/api-ref/compute/?expanded=list-flavors-detail#list-flavors

Change-Id: I268dd598ca9f1b617c5062b41ad27d0305df60b9
2023-01-17 11:57:40 +11:00
James E. Blair fcabfcac6a Add release notes for 8.0.1
There are a few changes that were worth noting, but notes were missed.

Change-Id: I8dcb633b9dfcdcaf0fe7ac75a91d56fa0d98d509
2023-01-10 13:48:26 -08:00
Zuul d130c0bee3 Merge "Aws: add support for volume iops and throughput" 2022-10-26 21:08:47 +00:00
Zuul 67c4ee1b29 Merge "Add config option to limit ephemeral storage on K8s Pod labels" 2022-10-24 06:48:55 +00:00
James E. Blair 4ea824cfa9 Aws: add support for volume iops and throughput
Users can request specific IOPS and throughput allocations from EC2.
The availability and defaults vary for volume type, but IOPS are
available for all volumes, and throughput is available on gp3 volumes.

Change-Id: Icc7432d8ce1c3514bfe9d8fda20bd399b67ede7a
2022-10-14 07:08:30 -07:00
Benjamin Schanzel 6c9c219eb0
Add config option to limit ephemeral storage on K8s Pod labels
This adds config options for limiting the amount of ephemeral storage
allocatable by a container of a K8s Pod-type label.
This optional config translates to K8s settings

* spec.containers[].resources.limits.ephemeral-storage
* spec.containers[].resources.requests.ephemeral-storage

This is to provide a mechanism that prevents Pods from filling up their
hosts storage and thereby interfering with or breaking other workloads
on the same host (esp. on shared clusters).

Like for cpu and memory limits, a pool-scoped default can also be
specified.

Change-Id: I23e90ae53cc2b2eb0e51cc9e3dc5802c86cc0ac9
2022-10-13 13:56:43 +02:00
James E. Blair 6320b06950 Add support for dynamic tags
This allows users to create tags (or properties in the case of OpenStack)
on instances using string interpolation values.  The use case is to be
able to add information about the tenant* which requested the instance
to cloud-provider tags.

* Note that ultimately Nodepool may not end up using a given node for
the request which originally prompted its creation, so care should be
taken when using information like this.  The documentation notes that.

This feature uses a new configuration attribute on the provider-label
rather than the existing "tags" or "instance-properties" because existing
values may not be safe for use as Python format strings (e.g., an
existing value might be a JSON blob).  This could be solved with YAML
tags (like !unsafe) but the most sensible default for that would be to
assume format strings and use a YAML tag to disable formatting, which
doesn't help with our backwards-compatibility problem.  Additionally,
Nodepool configuration does not use YAML anchors (yet), so this would
be a significant change that might affect people's use of external tools
on the config file.

Testing this was beyond the ability of the AWS test framework as written,
so some redesign for how we handle patching boto-related methods is
included.  The new approach is simpler, more readable, and flexible
in that it can better accomodate future changes.

Change-Id: I5f1befa6e2f2625431523d8d94685f79426b6ae5
2022-08-23 11:06:55 -07:00
James E. Blair 916d62a374 Allow specifying diskimage metadata/tags
For drivers that support tagging/metadata (openstack, aws, azure),
Add or enhance support for supplying tags for uploaded diskimages.

This allows users to set metadata on the global diskimage object
which will then be used as default values for metadata on the
provider diskimage values.  The resulting merged dictionary forms
the basis of metadata to be associated with the uploaded image.

The changes needed to reconcile this for the three drivers mentioned
above are:

All: the diskimages[].meta key is added to supply the default values
for provider metadata.

OpenStack: provider diskimage metadata is already supported using
providers[].diskimages[].meta, so no further changes are needed.

AWS, Azure: provider diskimage tags are added using the key
providers[].diskimages[].tags since these providers already use
the "tags" nomenclature for instances.

This results in the somewhat incongruous situation where we have
diskimage "metadata" being combined with provider "tags", but it's
either that or have images with "metadata" while we have instances
with "tags", both of which are "tags" in EC2.  The chosen approach
has consistency within the driver.

Change-Id: I30aadadf022af3aa97772011cda8dbae0113a3d8
2022-08-23 06:39:08 -07:00
Zuul 123a32f922 Merge "AWS multi quota support" 2022-07-29 17:01:09 +00:00