Commit Graph

600 Commits

Author SHA1 Message Date
James E. Blair 8d74042c30 Demote launch keyscan exceptions to warnings
Similarly to the recently demoted timeouts, these exceptions are
not useful since they are internally generated.  Log them as
warnings without tracebacks.

Change-Id: I84c04b65c3006f9173e5880b38694acc368b8f44
2024-04-18 14:12:12 -07:00
James E. Blair 64452f1a26 Demote launch/delete timeeouts to warnings
If we hit the internal timeout while launching or deleting a server,
we raise an exception and then log the traceback.  This is a not-
unexpected occurance, and the traceback is not useful since it's
just one stack frame within the same class, so instead, let's log
these timeouts at warning level without the traceback.

Change-Id: Id4806d8ea2d0a232504e5a75d69cec239bcac670
2024-04-17 14:46:36 -07:00
Zuul 75ebae3162 Merge "gce: clear machine type cache on bad data" 2024-03-25 19:35:17 +00:00
Zuul 22cd51fdf6 Merge "Fix leaked upload cleanup" 2024-03-20 13:04:01 +00:00
Zuul b1a40f1fd3 Merge "Add delete-after-upload option" 2024-03-18 21:06:46 +00:00
James E. Blair fc95724601 Fix leaked upload cleanup
The cleanup routine for leaked image uploads based its detection
on upload ids, but they are not unique except in the context of
a provider and build.  This meant that, for example, as long as
there was an upload with id 0000000001 for any image build for
the provider (very likely!) we would skip cleaning up any leaked
uploads with id 0000000001.

Correct this by using a key generated on build+upload (provider
is implied because we only consider uploads for our current
provider).

Update the tests relevant to this code to exercise this condition.

Change-Id: Ic68932b735d7439ca39e2fbfbe1f73c7942152d6
2024-03-09 13:46:17 -08:00
James E. Blair fd454706ca Add delete-after-upload option
This allows operators to delete large diskimage files after uploads
are complete, in order to save space.

A setting is also provided to keep certain formats, so that if
operators would like to delete large formats such as "raw" while
retaining a qcow2 copy (which, in an emergency, could be used to
inspect the image, or manually converted and uploaded for use),
that is possible.

Change-Id: I97ca3422044174f956d6c5c3c35c2dbba9b4cadf
2024-03-09 06:51:56 -08:00
James E. Blair 963cb3f5b4 gce: clear machine type cache on bad data
We have observed GCE returning bad machine type data which we
then cache.  If that happens, clear the cache to avoid getting
stuck with the bad data.

Change-Id: I32fac2a92d4f9d400fe2db41fffd8d189d097542
2024-03-05 16:05:18 -08:00
Zuul eef07d21f3 Merge "Metastatic: Copy cloud attribute from backing node" 2024-03-05 16:51:00 +00:00
Zuul dfd2877c81 Merge "Use seprate log package for NodescanRequest messages" 2024-03-05 15:32:18 +00:00
Zuul 392cf017c3 Merge "Add support for AWS IMDSv2" 2024-02-28 02:46:53 +00:00
Zuul 8775e54e5d Merge "Remove hostname-format option" 2024-02-28 02:46:52 +00:00
Zuul fe70068909 Merge "Add host-key-checking to metastatic driver" 2024-02-27 19:03:48 +00:00
Zuul 202188b2f5 Merge "Reconcile docs/validation for some options" 2024-02-27 18:19:33 +00:00
Benjamin Schanzel 4c4e8aefdb
Metastatic: Copy cloud attribute from backing node
Like done with several other meta data, copy the `cloud` attribute from
the backing node to the metastatic node.

Change-Id: Id83b3e09147baaab8a85ace4d5beba77d1eb87bd
2024-02-23 14:20:09 +01:00
James E. Blair 8259170516 Change the AWS default image volume-type from gp2 to gp3
gp3 is better in almost every way (cheaper, faster, more configurable).
It seems difficult to find a situation where gp2 would be a better
choice, so update the default when creating images to use gp3.

There are two locations where we can specify volume-type: image creation
(where the volume type becomes the default type for the image) and
instance creation (where we can override what the image specifies).
This change updates only the first (image creation), but not the second,
which has no default (which means to use whatever the image specified).

https://aws.amazon.com/ebs/general-purpose/

Change-Id: Ibfc5dfd3958e5b7dbd73c26584d6a5b8d3a1b4eb
2024-02-20 13:04:26 -08:00
James E. Blair 21f1b88b75 Add host-key-checking to metastatic driver
If a long-running backing node used by the metastatic driver develops
problems, performing a host-key-check each time we allocate a new
metastatic node may detect these problems.  If that happens, mark
the backing node as failed so that no more nodes are allocated to
it and it is eventually removed.

Change-Id: Ib1763cf8c6e694a4957cb158b3b6afa53d20e606
2024-02-13 14:12:52 -08:00
James E. Blair e097731339 Remove hostname-format option
This option has not been used since at least the migratio to the
statemachine framework.

Change-Id: I7a0e928889f72606fcbba0c94c2d49fbb3ffe55f
2024-02-08 09:40:41 -08:00
James E. Blair f89b41f6ad Reconcile docs/validation for some options
Some drivers were missing docs and/or validation for options that
they actually support.  This change:

adds launch-timeout to:
  metastatic docs and validation
  aws validation
  gce docs and validation
adds post-upload-hook to:
  aws validation
adds boot-timeout to:
  metastatic docs and validation
adds launch-retries to:
  metastatic docs and validation

Change-Id: Id3f4bb687c1b2c39a1feb926a50c46b23ae9df9a
2024-02-08 09:36:35 -08:00
Zuul 7abd12906f Merge "Introduce error.capacity states_key for InsufficientInstanceCapacity error" 2024-02-08 08:23:17 +00:00
James E. Blair c78fe769f2 Allow custom k8s pod specs
This change adds the ability to use the k8s (and friends) drivers
to create pods with custom specs.  This will allow nodepool admins
to define labels that create pods with options not otherwise supported
by Nodepool, as well as pods with multiple containers.

This can be used to implement the versatile sidecar pattern, which,
in a system where it is difficult to background a system process (such
as a database server or container runtime) is useful to run jobs with
such requirements.

It is still the case that a single resource is returned to Zuul, so
a single pod will be added to the inventory.  Therefore, the expectation
that it should be possible to shell into the first container in the
pod is documented.

Change-Id: I4a24a953a61239a8a52c9e7a2b68a7ec779f7a3d
2024-01-30 15:59:34 -08:00
James E. Blair 3f4fb008b0 Add support for AWS IMDSv2
This is an authenticated http metadata service which is typically
available by default, but a more secure setup is to enforce its
usage.

This change adds the ability to do that for both instances and
AMIs.

Change-Id: Ia8554ff0baec260289da0574b92932b37ffe5f04
2024-01-24 15:11:35 -08:00
James E. Blair 70be90e742 Fix duplicate fd registration in nodescan
In an attempt to make the nodescan process as quick as possible,
we start the connection in the provider statemachine thread before
handing the remaining work off to the nodescan statemachine thread.

However, if the nodescan worker is near the end of its request list
when the provider adds the request, then it may end up performing
the initial connection nearly simultaneously with the provider
thread.  They may both create a socket and attempt to register
the FD.  If the race results in them registering the same FD,
the following exception occurs:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/nodepool/driver/statemachine.py", line 253, in runStateMachine
    keys = self.nodescan_request.result()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/nodepool/driver/statemachine.py", line 1295, in result
    raise self.exception
  File "/usr/local/lib/python3.11/site-packages/nodepool/driver/statemachine.py", line 1147, in addRequest
    self._advance(request, False)
  File "/usr/local/lib/python3.11/site-packages/nodepool/driver/statemachine.py", line 1187, in _advance
    request.advance(socket_ready)
  File "/usr/local/lib/python3.11/site-packages/nodepool/driver/statemachine.py", line 1379, in advance
    self._connect()
  File "/usr/local/lib/python3.11/site-packages/nodepool/driver/statemachine.py", line 1340, in _connect
    self.worker.registerDescriptor(self.sock)
  File "/usr/local/lib/python3.11/site-packages/nodepool/driver/statemachine.py", line 1173, in registerDescriptor
    self.poll.register(
FileExistsError: [Errno 17] File exists

To address this, rather than attempting to coordinate work between
these two threads, let's just let the nodescan worker handle it.
To try to keep the process responsive, we'll wake the nodescan worker
if it's sleeping.

Change-Id: I5ceda68b856c09bf7606e62ac72ca5c5c76d2661
2024-01-24 14:54:43 -08:00
Dong Zhang aa113be19f Introduce error.capacity states_key for InsufficientInstanceCapacity error
We want to handle the "InsufficientInstanceCapacity" error different than
other "error.unknown" errors in our monitoring/alerting system. With this
change, it would produce a "error.capacity" instead of "error.unknown".

Change-Id: Id3a49d4b2d4b4733f801e65df69b505e913985a7
2024-01-04 15:37:51 +01:00
Zuul 42f9100d82 Merge "Fix gpu parameter type in openshiftpods" 2023-12-12 20:36:18 +00:00
Zuul ac703de734 Merge "Resolve statsd client once at startup" 2023-12-09 16:18:57 +00:00
James E. Blair 5849c3b9a7 Fix gpu parameter type in openshiftpods
In config validation, the gpu parametr type was specified as str
rather than float.  This is corrected.

This was not discovered in testing because the only tests which use
the gpu parameter for the other k8s drivers are not present in the
openshiftpods driver.  This change also adds the missing tests for
the default resource and resource limits feature which exercises the
gpu limits.

Change-Id: Ife932acaeb5a90ebc94ad36c3b4615a4469f0c40
2023-12-01 08:06:26 -08:00
Zuul f39d2b955a Merge "Fix logging of failed node launches" 2023-11-29 16:19:18 +00:00
James E. Blair cb8366d70c Use backing node attributes as metastatic default
To support the use case where one has multiple pools providing
metastatic backing nodes, and those pools are in different regions,
and a user wishes to use Zuul executor zones to communicate with
whatever metastatic nodes eventually produced from those regions,
this change updates the launcher and metastatic driver to use
the node attributes (where zuul executor region names are specified)
as default values for metastatic node attributes.  This lets users
configure nodepool with zuul executor zones only on the backing pools.

Change-Id: Ie6bdad190f8f0d61dab0fec37642d7a078ab52b3
Co-Authored-By: Benedikt Loeffler <benedikt.loeffler@bmw.de>
2023-11-27 10:34:24 -08:00
James E. Blair 7a1c75f918 Fix metastatic missing pool config
The metastatic driver was ignoring the 3 standard pool configuration
options (max-servers, priority, and node-attributes) due to a missing
superclass method call.  Correct that and update tests to validate.

Further, the node-attributes option was undocumented for the metastatic
driver, so add it to the docs.

Change-Id: I6a65ea5b8ddb319bc131f87e0793f3626379e15f
Co-Authored-By: Benedikt Loeffler <benedikt.loeffler@bmw.de>
2023-11-27 10:34:19 -08:00
Zuul 28d36ad5a5 Merge "Fix UnboundLocalError in error handling of runStateMachine" 2023-11-20 15:27:35 +00:00
Benjamin Schanzel 65a81ad7b5
Use seprate log package for NodescanRequest messages
The state transition log messages for the Nodescan statemachine can be
quite excessive. While they might be useful for debugging, it's not
always needed to have all the log messages available.
To provide an easier way to filter these messages, use a dedicated log
package in the NodescanRequest class.

Change-Id: I2b1a625f5e5e375317951e410a27ff4243d4a0ef
2023-11-20 14:46:36 +01:00
Simon Westphahl 2aeaee92f1
Ignore unrelated error labels in request handler
Nodepool was declining node requests when other unrelated instance types
of a provider were unavailable:

    Declining node request <NodeRequest {... 'node_types': ['ubuntu'],
    ... }> due to ['node type(s) [ubuntu-invalid] not available']

To fix this we will the check error labels against the requested labels
before including them in the list of invalid node types.

Change-Id: I7bbb3b813ca82baf80821a9e84cc10385ea95a01
2023-11-09 13:58:54 +01:00
James E. Blair 49e7dab5f5 Minor improvements to nodescan state machine
* Change the state change logging level to debug -- it's chatty

* Don't allow individual connection attempts to take > 10 seconds

  This is a behavior that is in the old nodescan method that
  wasn't ported over but should be.  As a port comes online as
  port of the boot process, early connection attempts may hang
  while later ones may succeed.  We want to continually try new
  connections whether they return an error or hang.

* Fall through to the complete state even if the last key is
  ignored

  Previously, if the last key we scanned was not compatible, the
  state machine would need to go through one extra state
  transition in order to set the complete flag, due to an early
  return call.  We now rearrange that state transition so that we
  fall through to completion regardless of whether the last key
  was added.

Change-Id: Ic6fd1551c3ef1bbd8eaf3b733e9ecc2609bce47f
2023-11-07 11:20:45 -08:00
James E. Blair 5984a2638a Fix AWS external id setting
We set the AWS external id to the hostname when building, but that
causes problems if we need to retry the build -- we won't delete
the instance we're trying to abort because we don't have the actual
external id (InstanceId).

Instead, delay setting it just a little longer until we get the real
InstanceId back from AWS.

Change-Id: Ibc7ab55ccd54c22ad006c13a0af3e9598056f7a4
2023-11-04 08:54:24 -07:00
James E. Blair 8669acfe6b Use a state machine for ssh key scanning
We currently use a threadpool executor to scan up to 10 nodes at
a time for ssh keys.  If they are slow to respond, that can create
a bottleneck.  To alleviate this, use a state machine model for
managing the process, and drive each state machine from a single
thread.

We use select.epoll() to handle the potentially large number of
connections that could be happening simultaneously.

Note: the paramiko/ssh portion of this process spawns its own
thread in the background (and always has).  Since we are now allowing
more keyscan processis in parallel, we could end up with an
unbounded set of paramiko threads in the background.  If this is
a concern we may need to cap the number of requests handled
simultaneously.  Even if we do that, this will still result in
far fewer threads than simply increasing the cap on the threadpool
executor.

Change-Id: I42b76f4c923fd9441fb705e7bffd6bc9ea7240b1
2023-11-04 08:54:20 -07:00
James E. Blair 7d7d81cd46 AWS: improve service quota handling
The AWS API call to get the service quota has its own rate limit
that is separate from EC2.  It is not documented, but the defaults
appear to be very small; experimentally it appear to be something
like a bucket size of 30 tokens and a refill rate somewhere
between 3 and 10 tokens per minute.

This change moves the quota lookup calls to their own rate limiter
so they are accounted for separately from other calls.

We should configure that rate limiter with the new very low values,
however, that would significantly slow startup since we need to issue
serveral calls at once when we start; after that we are not sensitive
to a delay.  The API can handle a burst at startup (with a bucket
size of 30) but our rate limiter doesn't have a burst option.  Instead
of cofiguring it properly, we will just configure it with the rate
limit we use for normal operations (so that we at least have some
delay), but otherwise, rely on caching so that we know that we won't
actually exceed the rate limit.

This change therefore also adds a Lazy Executor TTL cache to the
operations with a timeout of 5 minutes.  This means that we will issue
bursts of requests every 5 minutes, and as long as the number of
requests is less than the token replacement rate, we'll be fine.

Because this cache is on the adapter, multiple pool workers will use
the same cache.  This will cause a reduction in API calls since
currently there is only pool-worker level caching of nodepool quota
information objects.  When the 5 minute cache on the nodepool quota
info object expires, we will now hit the adapter cache (with its own
5 minute timeout) rather than go directly to the API repeatedly for
each pool worker.  This does mean that quota changes may take between
5 and 10 minutes to appear in nodepool.

The current code only looks up quota information for instance and
volume types actually used.  If that number is low, all is well, but
if it is high, then we could potentially approach or exceed the token
replacement rate.  To make this more predictable, we will switch the
API call to list all quotas instead of fetching only the ones we need.
Due to pagination, this results in a total of 8 API calls as of writing;
5 for ec2 quotas and 3 for ebs.  These are likely to grow over time,
but very slowly.

Taken all together, these changes mean that a single launcher should
issue at most 8 quota service api requests every 5 minutes, which is
below the lowest observed token replacement rate.

Change-Id: Idb3fb114f5b8cda8a7b6d5edc9c011cb7261be9f
2023-10-17 14:36:37 -07:00
Simon Westphahl c973be0a1b Log all AWS API list operation times
Change-Id: I582fa409d4f8cb7d3e25e7201c514c7e040f98a0
2023-10-17 14:36:37 -07:00
Simon Westphahl 3c71fc9f4b Use thread pool executor for AWS API requests
So far we've cached most of the AWS API listings (instances, volumes,
AMIs, snapshots, objects) but with refreshes happening synchronously.

Since some of those methods are used as part of other methods during
request handling we make them asynchronous.

Change-Id: I22403699ebb39f3e4dcce778efaeb09328acd932
2023-10-17 14:36:37 -07:00
James E. Blair ee5cd42292 Use ec2_client for images and snapshots
This expands on the previous commit and removes the remaining use
of the EC2 resource API in favor of the lower-level and more
explicit ec2_client API.

Change-Id: Ic176a3702018233d353752f4b108eb3aa992d07b
2023-10-17 14:36:37 -07:00
James E. Blair 40eb0fd8d4 Use ec2_client for instances and volumes
There are two modes of use for the boto3 interface library for AWS:
resource or client calls.  The resource approach returns magic objects
representing instances of each type of resource which have their own
methods and can lazy-load relationships to other objects.  The client
approach is more low-level and returns dictionary representations of
the HTTP results.

In general in the AWS driver, we try to use resources when we can
and fall back to client methods otherwise.

When we construct our own AwsInstance objects (which translates what we
get from AWS to what nodepool expects), we filled the "az" field by
getting the availability zone from the subnet.  The subnet is an object
relationship that is lazy-loaded, which means that each time we
retrieved an instance record, we issued another API call to look up
the subnet (and this information is not cached).

To correct this, we switch to getting the subnet from the placement
field of the instance (which is more sensible anyway).

To prevent future accidental lazy-loads of un-cached data, we switch
instance and volume operations from the resource style of usage to
the client.

The result is a massive increase in performance (as the subnet lazy-load
could take 0.1 seconds for each instance running, and we very frequently
list all instances).

Change-Id: I529318896fc8096bbd9dbdac60d1a29c3ac641b6
2023-10-17 14:36:37 -07:00
Benjamin Schanzel f0b5d1f149
Fix logging of failed node launches
A Node has no attribute `label`, instead this information is taken from
`node.type`. Additionally, print the node id.

2023-10-06 16:26:27,834 ERROR nodepool.NodeLauncher: [e: 1bfeefa5044b481fa3e18781a9972773] [node_request: 200-0044066729] [node: 0045664756] Launch attempt 3/3 failed for node 0045664756:
[...snip...]
kubernetes.client.exceptions.ApiException: (409)
[...snip...]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/opt/nodepool/lib/python3.11/site-packages/nodepool/driver/utils.py", line 101, in run
    self.node.label)
    ^^^^^^^^^^^^^^^
AttributeError: 'Node' object has no attribute 'label'

Change-Id: I42913c26318247f6d352308fa2d7b5fdf8a5fcd0
2023-10-12 14:14:12 +02:00
Benjamin Schanzel a159e40f85
AWS driver: Check for volume existence before calculating quota
The quota calculation of the aws driver occasionally raises NoneType
errors because the volume passed was not found for the VolumeId attached
to the instance. This can be because of an outdated lru_cache, where
volumes of new instances aren't available until the caches TTL expires.
In this case, skip the quota calculation for the volume and print a
warning with the instance information.

Change-Id: Ib58dad787483a1a2e873216b25f4d4aa2abbb47c
2023-10-11 16:31:07 +02:00
Zuul 76fb25d529 Merge "Handle invalid AWS instance in quota lookup" 2023-09-25 21:38:34 +00:00
Zuul 2813a7df1f Merge "Kubernetes/OpenShift drivers: allow setting dynamic k8s labels" 2023-09-25 07:43:11 +00:00
Dong Zhang 087f163305 Avoid unnecessary sorting of flavors
The original implementation sorts the flavors list everytime it is retrieved
from the cache. This has some drawback and could potentially cause issue:

1. Since the list is cached, sorting it everytime is not necessary
2. When accessing the list while it is being sorted in another thread, it might
   return an empty list (when key function is supplied and it might release GIL).

To fix this it is sorted right after the api call and before it is put to cache.

Change-Id: If3461f88844d7c2e139e3fc4a076abd7fdff66a7
2023-09-15 09:08:55 +02:00
Benjamin Schanzel 9249e125a7
Fix UnboundLocalError in error handling of runStateMachine
This fixes an UnboundLocalError that can arise while hanlding other
errors.

```
[...]
Traceback (most recent call last):
  File "/opt/nodepool/lib/python3.11/site-packages/nodepool/driver/statemachine.py", line 741, in _runStateMachines
    sm.runStateMachine()
  File "/opt/nodepool/lib/python3.11/site-packages/nodepool/driver/statemachine.py", line 326, in runStateMachine
    if state_machine and state_machine.external_id:
       ^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'state_machine' where it is not associated with a value
```

Change-Id: I06ab0851cb2803f67d1d35fc12fabea9ea931501
2023-09-13 17:04:49 +02:00
Zuul e4a0acf7c5 Merge "AWS: handle 'InvalidConversionTaskId.Malformed' error" 2023-09-11 22:24:55 +00:00
Benjamin Schanzel 4660bb9aa7
Kubernetes/OpenShift drivers: allow setting dynamic k8s labels
Just like for the OpenStack/AWS/Azure drivers, allow to configure
dynamic metadata (labels) for kubernetes resources with information
about the corresponding node request.

Change-Id: I5d174edc6b7a49c2ab579a9a0b1b560389d6de82
2023-09-11 10:49:27 +02:00
Zuul 5c8f007e83 Merge "Add an image upload timeout to the openstack driver" 2023-09-07 00:57:38 +00:00