Commit Graph

439 Commits

Author SHA1 Message Date
Zuul b1a40f1fd3 Merge "Add delete-after-upload option" 2024-03-18 21:06:46 +00:00
James E. Blair fd454706ca Add delete-after-upload option
This allows operators to delete large diskimage files after uploads
are complete, in order to save space.

A setting is also provided to keep certain formats, so that if
operators would like to delete large formats such as "raw" while
retaining a qcow2 copy (which, in an emergency, could be used to
inspect the image, or manually converted and uploaded for use),
that is possible.

Change-Id: I97ca3422044174f956d6c5c3c35c2dbba9b4cadf
2024-03-09 06:51:56 -08:00
Zuul f4941d4f03 Merge "Add some builder operational stats" 2024-03-07 20:36:14 +00:00
Zuul 392cf017c3 Merge "Add support for AWS IMDSv2" 2024-02-28 02:46:53 +00:00
Zuul 8775e54e5d Merge "Remove hostname-format option" 2024-02-28 02:46:52 +00:00
Zuul fe70068909 Merge "Add host-key-checking to metastatic driver" 2024-02-27 19:03:48 +00:00
Zuul 202188b2f5 Merge "Reconcile docs/validation for some options" 2024-02-27 18:19:33 +00:00
James E. Blair 8259170516 Change the AWS default image volume-type from gp2 to gp3
gp3 is better in almost every way (cheaper, faster, more configurable).
It seems difficult to find a situation where gp2 would be a better
choice, so update the default when creating images to use gp3.

There are two locations where we can specify volume-type: image creation
(where the volume type becomes the default type for the image) and
instance creation (where we can override what the image specifies).
This change updates only the first (image creation), but not the second,
which has no default (which means to use whatever the image specified).

https://aws.amazon.com/ebs/general-purpose/

Change-Id: Ibfc5dfd3958e5b7dbd73c26584d6a5b8d3a1b4eb
2024-02-20 13:04:26 -08:00
James E. Blair 646b7f4927 Add some builder operational stats
This adds some stats keys that may be useful when monitoring
the operation of individual nodepool builders.

Change-Id: Iffdeccd39b3a157a997cf37062064100c17b1cb3
2024-02-19 15:47:17 -08:00
James E. Blair 21f1b88b75 Add host-key-checking to metastatic driver
If a long-running backing node used by the metastatic driver develops
problems, performing a host-key-check each time we allocate a new
metastatic node may detect these problems.  If that happens, mark
the backing node as failed so that no more nodes are allocated to
it and it is eventually removed.

Change-Id: Ib1763cf8c6e694a4957cb158b3b6afa53d20e606
2024-02-13 14:12:52 -08:00
James E. Blair e097731339 Remove hostname-format option
This option has not been used since at least the migratio to the
statemachine framework.

Change-Id: I7a0e928889f72606fcbba0c94c2d49fbb3ffe55f
2024-02-08 09:40:41 -08:00
James E. Blair f89b41f6ad Reconcile docs/validation for some options
Some drivers were missing docs and/or validation for options that
they actually support.  This change:

adds launch-timeout to:
  metastatic docs and validation
  aws validation
  gce docs and validation
adds post-upload-hook to:
  aws validation
adds boot-timeout to:
  metastatic docs and validation
adds launch-retries to:
  metastatic docs and validation

Change-Id: Id3f4bb687c1b2c39a1feb926a50c46b23ae9df9a
2024-02-08 09:36:35 -08:00
James E. Blair c78fe769f2 Allow custom k8s pod specs
This change adds the ability to use the k8s (and friends) drivers
to create pods with custom specs.  This will allow nodepool admins
to define labels that create pods with options not otherwise supported
by Nodepool, as well as pods with multiple containers.

This can be used to implement the versatile sidecar pattern, which,
in a system where it is difficult to background a system process (such
as a database server or container runtime) is useful to run jobs with
such requirements.

It is still the case that a single resource is returned to Zuul, so
a single pod will be added to the inventory.  Therefore, the expectation
that it should be possible to shell into the first container in the
pod is documented.

Change-Id: I4a24a953a61239a8a52c9e7a2b68a7ec779f7a3d
2024-01-30 15:59:34 -08:00
James E. Blair 3f4fb008b0 Add support for AWS IMDSv2
This is an authenticated http metadata service which is typically
available by default, but a more secure setup is to enforce its
usage.

This change adds the ability to do that for both instances and
AMIs.

Change-Id: Ia8554ff0baec260289da0574b92932b37ffe5f04
2024-01-24 15:11:35 -08:00
James E. Blair cb8366d70c Use backing node attributes as metastatic default
To support the use case where one has multiple pools providing
metastatic backing nodes, and those pools are in different regions,
and a user wishes to use Zuul executor zones to communicate with
whatever metastatic nodes eventually produced from those regions,
this change updates the launcher and metastatic driver to use
the node attributes (where zuul executor region names are specified)
as default values for metastatic node attributes.  This lets users
configure nodepool with zuul executor zones only on the backing pools.

Change-Id: Ie6bdad190f8f0d61dab0fec37642d7a078ab52b3
Co-Authored-By: Benedikt Loeffler <benedikt.loeffler@bmw.de>
2023-11-27 10:34:24 -08:00
James E. Blair 7a1c75f918 Fix metastatic missing pool config
The metastatic driver was ignoring the 3 standard pool configuration
options (max-servers, priority, and node-attributes) due to a missing
superclass method call.  Correct that and update tests to validate.

Further, the node-attributes option was undocumented for the metastatic
driver, so add it to the docs.

Change-Id: I6a65ea5b8ddb319bc131f87e0793f3626379e15f
Co-Authored-By: Benedikt Loeffler <benedikt.loeffler@bmw.de>
2023-11-27 10:34:19 -08:00
Zuul 2813a7df1f Merge "Kubernetes/OpenShift drivers: allow setting dynamic k8s labels" 2023-09-25 07:43:11 +00:00
Benjamin Schanzel 4660bb9aa7
Kubernetes/OpenShift drivers: allow setting dynamic k8s labels
Just like for the OpenStack/AWS/Azure drivers, allow to configure
dynamic metadata (labels) for kubernetes resources with information
about the corresponding node request.

Change-Id: I5d174edc6b7a49c2ab579a9a0b1b560389d6de82
2023-09-11 10:49:27 +02:00
James E. Blair 3b434098c6 Add an image upload timeout to the openstack driver
Some uploads in opendev are taking hours.

We used to wait 6 hours for this, but we ended up using the SDK
default of 1 hour in recent versions.  Since we're seeing so much
disparity in time, make it user configurable.

Remove the unused 6 hour constant.

Change-Id: I9ca5fdbf7c66f176eb4f650fd287514708f46c16
2023-09-06 08:04:51 -07:00
Benjamin Schanzel 3328e22d53
Fix sphinx doc build
Version 7.2.5 breaks with a module load error,
cf. https://github.com/sphinx-doc/sphinx/issues/11662

Change-Id: I49146695305351661183cacdc3cb4a2503b49687
2023-09-01 11:02:05 +02:00
Zuul 785f7dcbc9 Merge "AWS: Add support for retrying image imports" 2023-08-28 18:43:56 +00:00
Zuul d6c2422bc3 Merge "Use diskimage username in AWS and Azure drivers" 2023-08-23 00:57:54 +00:00
Zuul 909973ff06 Merge "Update Azure API and add volume-size" 2023-08-16 23:28:14 +00:00
Zuul 0c9099a20d Merge "Add Azure gallery image support" 2023-08-16 23:28:12 +00:00
James E. Blair c2d9c45655 AWS: Add support for retrying image imports
AWS has limits on the number of image import tasks that can run
simultaneously.  In a busy system with large images, it would be
better to wait until those limits clear rather than delete the
uploaded s3 object and start over, uploading it again.  To support
this, we now detect that condition and optionally retry for a
specified amount of time.

The default remains to bail on the first error.

Change-Id: I6aa7f79b2f73c4aa6743f11221907a731a82be34
2023-08-12 11:45:22 -07:00
James E. Blair 202230e16b Use diskimage username in AWS and Azure drivers
The AWS and Azure drivers incorrectly required the user to supply
the username in the pool configuration when using diskimages.
The OpenStack and IBMVPC drivers correctly use the top-level
diskimage configuration to determine the username.

Correct this by deprecating the pool-level configuration in the
drivers that offer it, and default it to using the top-level
configuration.

Change-Id: I4e6b4d4268b32ab7b397a11dd0ccd08b18c09a86
2023-08-03 12:31:31 -07:00
James E. Blair 07c83f555d Add ZK cache stats
To observe the performance of the ZK connection and the new tree
caches, add some statsd metrics for each of these.  This will
let us monitor queue size over time.

Also, update the assertReportedStat method to output all received
stats if the expected stat was not found (like Zuul).

Change-Id: Ia7e1e0980fdc34007f80371ee0a77d4478948518
Depends-On: https://review.opendev.org/886552
2023-08-03 10:27:25 -07:00
mbecker 3fa6821437 Add gpu support for k8s/openshift pods
This adds the option to request GPUs for kubernetes and openshift pods.

Since the resource name depends on the GPU vendor and the cluster
installation, this option is left for the user to define it in the
node pool.
To leverage the ability of some schedulers to use fractional GPUs,
the actual GPU value is read as a string.

For GPUs, requests and limits cannot be decoupled (cf.
https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/),
so the same value will be used for requests and limits.

Change-Id: Ibe33b06c374a431f164080edb34c3a501c360df7
2023-07-11 07:10:30 -07:00
James E. Blair eedd6b9d2a Add extra-resources handling to openshift drivers
This adds the extra-resources handling that was just added to the
k8s driver to openshift.

Change-Id: I56e5eaf6ec22d10e88420094e92041c0b39b04e5
2023-06-27 14:06:11 -07:00
James E. Blair ac187302a3 Add extra-resources quota handling to the k8s driver
Some k8s schedulers like run.ai use custom pod annotations rather
than standard k8s resources to specify required resources such as
gpus.  To facilitate quota handling for these resources in nodepool,
this change adds an extra-resources attribute to labels that can be
used to ensure nodepool doesn't try to launch more resources than
can be handled.

Users can already specify a 'max-resources' limit for arbitrary
resources in the nodepool config; this change allows them to also
specify arbitrary resource consumption with 'extra-resources'.

Change-Id: I3d2612a7d168bf415d58029aa295e60c3c83cecd
2023-06-27 14:06:08 -07:00
mbecker 1822976350 Add k8s annotations to pods
This allows adding key/value pairs under
metadata.annotations in the kubernetes
resource specification.
This information can be used by different tools
to govern handling of resources.

One particular use-case is the runai-scheduler which
uses annotations to allocate fractional GPU resources
to a pod.

Change-Id: Ib319caffe51e00bedda2861e8e1f2bbe04340322
2023-06-27 14:06:01 -07:00
James E. Blair 77d9512764 Update Azure API and add volume-size
The addition of the volume-size attribute necessitates an upgrade
to the Azure API.  The newer version of the API allows us to
remove the individual handling we have for NICs, PIPs, and Disks.
That simplifies the driver greatly, but comes with some caveats
that are noted in the docs and release notes.

Finally, the volume-size attribute is added as well.

Change-Id: I6e335318cedbf0ac8944107aff9d1a2cfcab271a
2023-06-21 18:22:47 -07:00
James E. Blair 4279b4766d Add Azure gallery image support
The shared and community gallery images are another way to specify
what image to use when creating a VM.  Shared galleries are intended
for use within an organization, and community galleries are public.

This adds support for using these images.  It requires an API version
bump since the virtual machine attributes to specify them are new.

Change-Id: Ia981fcbeea6680a9d14ee8e4ec401bf227a7cc12
2023-06-21 18:21:40 -07:00
Zuul 63900e9bc4 Merge "Report leaked resource metrics in statemachine driver" 2023-05-02 23:29:19 +00:00
James E. Blair d4f2c8b9e7 Report leaked resource metrics in statemachine driver
The OpenStack driver reports some leaked metrics.  Extend that in
a generic way to all statemachine drivers.  Doing so also adds
some more metrics to the OpenStack driver.

Change-Id: I97c01b54b576f922b201b28b117d34b5ee1a597d
2023-04-26 06:40:12 -07:00
Christian Mueller 36dbff84ba Amazon EC2 Spot support
This adds support for launching Amazon EC2 Spot instances
(https://aws.amazon.com/ec2/spot/), which comes with huge cost saving
opportunities.

Amazon EC2 Spot instances are spare Amazon EC2 capacity, you can get
with an discount of up to 90% compared to on-demand pricing.
In contrast to on-demand instances, Spot instances can be relaimed with a
2 minute notification in advance
(https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html).

When :attr:`providers.[aws].pools.labels.use-spot` is set to True, the AWS
driver will launch Spot instances. If an instance get interrupted, it will be
terminated and no replacement instance will be launched.

Change-Id: I9868d014991d78e7b2421439403ae1371b33524c
2023-04-16 21:12:06 +02:00
Zuul 0e7de19664 Merge "Add support for specifying pod resource limits" 2023-03-13 08:38:55 +00:00
Zuul 7ffb3f1fd6 Merge "Add scheduler, volumes, and labels to k8s/openshift" 2023-03-13 08:32:16 +00:00
Zuul 13007b6825 Merge "Add OpenStack volume quota" 2023-02-24 20:12:41 +00:00
Zuul aaecb9659e Merge "Add import_image support to AWS" 2023-02-24 16:05:08 +00:00
James E. Blair de02ac5a20 Add OpenStack volume quota
This adds support for staying within OpenStack volume quota limits
on instances that utilize boot-from-volume.

Change-Id: I1b7bc177581d23cecd9443a392fb058176409c46
2023-02-13 06:56:03 -08:00
James E. Blair 669552f6f9 Add support for specifying pod resource limits
We currently allow users to specify pod resource requests and limits
for cpu, ram, and ephemeral storage.  But if a user specifies one of
these, the value is used for both the request and the limit.

This updates the specification to allow the use of separate request
and limit values.

It also normalizes related behavior across all 3 pod drivers,
including adding resource reporting to the openshift drivers.

Change-Id: I49f918b01f83d6fd0fd07f61c3e9a975aa8e59fb
2023-02-12 07:14:30 -08:00
James E. Blair 9bf44b4a4c Add scheduler, volumes, and labels to k8s/openshift
This adds support for specifying the scheduler name, volumes (and
volume mounts), and additional metadata labels to the Kubernetes
and OpenShift (and OpenShift pods) drivers.

This also extends the k8s and openshift test frameworks so that we
can exercise the new code paths (as well as some previous similar
settings).  Tests and assertions for both a minimal (mostly defaults)
configuration as well as a configuration that uses all the optional
settings are added.

Change-Id: I648e88a518c311b53c8ee26013a324a5013f3be3
2023-02-11 12:03:45 -08:00
James E. Blair fdc093a8de Add import_image support to AWS
In I9478c0050777bf35e1201395bd34b9d01b8d5795 we switched from using the
import_image method to import_snapshot in the AWS driver.  This method
is faster and more like other drivers in Nodepool.  However, some operating
systems (such as Windows, RHEL or SLES) require licensing metadata
associated with an AMI which is not available to be set when we register
an AMI from a snapshot.  For these systems, the only viable way to upload
images is with the import_image method.

This change restores the previous method as an option, but keeps the
"snapshot" method as the default.

Change-Id: I81daabebbc9dbe968d8aaf65e6b70f5cdfdd01bf
2023-01-30 20:25:56 -08:00
James E. Blair aa8580ce32 Add support for privileged containers
To allow users to run docker-in-docker style workloads on k8s
and openshift clusters, add support for adding the privileged
flag to containers created in k8s and openshift pods.

Change-Id: I349d61bf200d7fb6d1effe112f7505815b06e9a8
2023-01-25 11:09:25 -08:00
James E. Blair 89f8d7a000 Fix openshift doc attribute hierarchy
The entire provider.pools.labels configuration attribute hierarchy was
outdented one level in the openshift and openshiftpods documentation,
causing it to incorrectly appear as provider.labels.  This change
corrects this and any existing references.

Change-Id: I6b682a05bc1d7622038ea6c62935259f0cffc585
2023-01-25 11:07:52 -08:00
Zuul ad7bf9aaeb Merge "Fix AWS quota limits for vCPUs" 2023-01-19 07:29:54 +00:00
Ian Wienand 46f44b8669
driver/openstack: order flavor results
I hit a situation in a new cloud where I had defined two flavors with
the same amount of RAM "opendev-control" and "opendev".  The config
had min-ram set [1] with a flavor-name of "opendev" -- I expected it
to match the exact name first, but nodepool was choosing
"opendev-control".

I guess the default order returned by the cloud is flavorid [2].  I'd
propose that sorting on a tuple of (ram, name) -- so that we match
names in alphabetical order -- is a more intuitive way to run the
match.

Documentation is updated, and a release note added.

[1] which actually we didn't want, really, because we wanted to
    exactly match the flavor:
     https://review.opendev.org/c/openstack/project-config/+/870677
[2] https://docs.openstack.org/api-ref/compute/?expanded=list-flavors-detail#list-flavors

Change-Id: I268dd598ca9f1b617c5062b41ad27d0305df60b9
2023-01-17 11:57:40 +11:00
Christian von Schultz a828513ae8 Fix AWS quota limits for vCPUs
In the AWS adapter, when getting the quota for an instance type, set
the quota for the AWS service quota code to be the number of vCPUs
rather than the number of cores. The number of vCPUs is typically
twice the number of cores. This fixes "VcpuLimitExceeded" errors from
AWS.

Change-Id: I880e6abb84b0527363893576057aa105a5a448a5
2022-12-14 14:13:47 +01:00
Simon Westphahl 887fea5706
Correct documentation for image upload metric
The metric for the time spent uploading an image is in milliseconds, not
seconds.

Change-Id: I151bf774ca17bef34ce2d5ac794e2187da9a9b07
2022-12-05 08:35:11 +01:00