Commit Graph

33 Commits

Author SHA1 Message Date
Zuul ac703de734 Merge "Resolve statsd client once at startup" 2023-12-09 16:18:57 +00:00
Benjamin Schanzel f0b5d1f149
Fix logging of failed node launches
A Node has no attribute `label`, instead this information is taken from
`node.type`. Additionally, print the node id.

2023-10-06 16:26:27,834 ERROR nodepool.NodeLauncher: [e: 1bfeefa5044b481fa3e18781a9972773] [node_request: 200-0044066729] [node: 0045664756] Launch attempt 3/3 failed for node 0045664756:
[...snip...]
kubernetes.client.exceptions.ApiException: (409)
[...snip...]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/opt/nodepool/lib/python3.11/site-packages/nodepool/driver/utils.py", line 101, in run
    self.node.label)
    ^^^^^^^^^^^^^^^
AttributeError: 'Node' object has no attribute 'label'

Change-Id: I42913c26318247f6d352308fa2d7b5fdf8a5fcd0
2023-10-12 14:14:12 +02:00
Tristan Cacqueray 8dc55c231f Add tenant and label name to Launch failed error
This change improves the launch failure logs to include the tenant
and the label name of the failed node.

Change-Id: I13ddfb1333a4865d3b117aa68fda14d3123fe451
2023-08-23 18:06:06 +00:00
James E. Blair 21b8451947 Resolve statsd client once at startup
We currently create a new statsd client each time we replace a
provider manager.  This means that if we are unable to resolve
DNS at that time, the new provider may crash due to unhandled
exceptions.

To resolve this, let's adopt the same behavior we have in Zuul
which is to set up the statsd client once at startup and continue
to use the same client object for the life of the process.

This means that operators will still see errors at startup during
a misconfiguration, but any external changes after that will
not affect nodepool.

Change-Id: I65967c71e859fddbea15aee89f6ddae44344c87b
2023-08-14 10:47:53 -07:00
James E. Blair 99d2a361a1 Use cached ids in node iterator more often
There are several places where it is now probably safe to use
cached ids when iterating over ZK nodes.  The main reasons not to
use cached ids are in the case of race conditions or in case the
tree cache may have missed an event and is unaware of a node.  We
have increased confidence in the accuracy of our cache now, so at
least in the cases where we know that races are not an issue, we
can switch those to use cached ids and save a ZK round trip (a
possibly significant one if there is a long list of children).

This change adds the flag in the following places (with
explanations of why it's safe):

* State machine cleanup routines

    Leaked instances have to show up on two subsequent calls to
    be acted upon, so this is not sensitive to timing

* Quota calculation

    If we do get the quota wrong, drivers are expected to handle
    that gracefully anyway.

* Held node cleanup

    Worst case is we wait until next iteration to clean up.

* Stats

    They're a snapshot anyway, so a cache mismatch is really just
    a shift in the snapshot time.

Change-Id: Ie7af2f62188951bf302ffdb64827d868609a1e3c
2023-05-30 13:27:45 -07:00
James E. Blair eda7b0f609 Add a LazyExecutorTTLCache to the OpenStack driver
See the docstring for an explanation of what a Lazy Executor TTL
Cache is.  By switching the caching of the server list method (and
also volumes and fips) to the lazy cache, we will make all of the
methods called by the state machines asynchronous.  This means
that both the create and delete state machine threads should be
able to spin through all of their state machines as quickly as
Python and ZooKeeper overhead will allow.

Change-Id: Ibce6b4d82929e6a764fdbc025990f7e01060b509
2023-03-15 15:25:52 -07:00
James E. Blair de02ac5a20 Add OpenStack volume quota
This adds support for staying within OpenStack volume quota limits
on instances that utilize boot-from-volume.

Change-Id: I1b7bc177581d23cecd9443a392fb058176409c46
2023-02-13 06:56:03 -08:00
James E. Blair 207d8ac63c AWS multi quota support
This adds support for AWS quotas that are specific to instance types.

The current quota support in AWS assumes only the "standard" instance types,
but AWS has several additional types with particular specialties (high memory,
GPU, etc).  This adds automatic support for those by encoding their service
quota codes (like 'L-1216C47A') into the QuotaInformation object.

QuotaInformation accepts not only cores, ram, and instances as resource
values, but now also accepts arbitraly keys such as 'L-1216C47A'.
Extra testing of QI is added to ensure we handle the arithmetic correctly
in cases where one or the other operand does not have a resource counter.

The statemachine drivers did not encode their resource information into
the ZK Node record, so tenant quota was not operating correctly.  This is
now fixed.

The AWS driver now accepts max_cores, _instances, and _ram values similar
to the OpenStack driver.  It additionally accepts max_resources which can
be used to specify limits for arbitrary quotas like 'L-1216C47A'.

The tenant quota system now also accepts arbitrary keys such as 'L-1216C47A'
so that, for example, high memory nodes may be limited by tenant.

The mapping of instance types to quota is manually maintained, however,
AWS doesn't seem to add new instance types too often, and those it does are
highly specialized.  If a new instance type is not handled internally, the
driver will not be able to calculate expected quota usage, but will still
operate until the new type is added to the mapping.

Change-Id: Iefdc8f3fb8249c61c43fe51b592f551e273f9c36
2022-07-25 14:41:07 -07:00
James E. Blair 4aaacc7919 Fix ignore_provider_quota in statemachine drivers
The statemachine driver copied the quota handling from OpenStack,
which also included a check for the presence of the ignore_provider_quota
configuration attribute.  This attribute does not exist in any of
the statemachine drivers, and due to the way the check was constructed,
it meant that we defaulted to ignoring the provider quota rather than
actually checking it.

We should perform this check so that providers can exclude themselves
if there is no possibility of satisfying a request.  This corrects the
check.

Change-Id: I7ced3f749a1646ecbb2b80f93b26e61a4e8cd69a
2022-06-29 10:47:49 -07:00
James E. Blair d5b0dee642 AWS driver create/delete improvements
The default AWS rate limit is 2 instances/sec, but in practice, we
can achieve something like 0.6 instances/sec with the current code.
That's because the create instance REST API call itself takes more
than a second to return.  To achieve even the default AWS rate
(much less a potentially faster one which may be obtainable via
support request), we need to alter the approach.  This change does
the following:

* Paralellizes create API calls.  We create a threadpool with
  (typically) 8 workers to execute create instance calls in the
  background.  2 or 3 workers should be sufficient to meet the
  2/sec rate, more allows for the occasional longer execution time
  as well as a customized higher rate.  We max out at 8 to protect
  nodepool from too many threads.
* The state machine uses the new background create calls instead
  of synchronously creating instances.  This allows other state
  machines to progress further (ie, advance to ssh keyscan faster
  in the case of a rush of requests).
* Delete calls are batched.  They don't take as long as create calls,
  yet their existence at all uses up rate limiting slots which could
  be used for creating instances.  By batching deletes, we make
  more room for creates.
* A bug in the RateLimiter could cause it not to record the initial
  time and therefore avoid actually rate limiting.  This is fixed.
* The RateLimiter is now thread-safe.
* The default rate limit for AWS is changed to 2 requests/sec.
* Documentation for the 'rate' parameter for the AWS driver is added.
* Documentation for the 'rate' parameter for the Azure driver is
  corrected to describe the rate as requests/sec instead of delay
  between requests.

Change-Id: Ida2cbc59928e183eb7da275ff26d152eae784cfe
2022-06-22 13:28:58 -07:00
Zuul a4acb5644e Merge "Use Zuul-style ZooKeeper connections" 2022-05-23 22:56:54 +00:00
James E. Blair 10df93540f Use Zuul-style ZooKeeper connections
We have made many improvements to connection handling in Zuul.
Bring those back to Nodepool by copying over the zuul/zk directory
which has our base ZK connection classes.

This will enable us to bring other Zuul classes over, such as the
component registry.

The existing connection-related code is removed and the remaining
model-style code is moved to nodepool.zk.zookeeper.  Almost every
file imported the model as nodepool.zk, so import adjustments are
made to compensate while keeping the code more or less as-is.

Change-Id: I9f793d7bbad573cb881dfcfdf11e3013e0f8e4a3
2022-05-23 07:40:20 -07:00
Zuul 7129681181 Merge "Do not reset quota cache timestamp when invalid" 2022-05-19 15:46:57 +00:00
Joshua Watt 2c632af426 Do not reset quota cache timestamp when invalid
The quota cache may not be a valid dictionary when
invalidateQuotaCache() is called (e.g. when 'ignore-provider-quota' is
used in OpenStack). In that case, don't attempt to treat the None as a
dictionary as this raises a TypeError exception.

This bug was preventing Quota errors from OpenStack from causing
nodepool to retry the node request when ignore-provider-quota is True,
because the OpenStack handler calles invalidateQuotaCache() before
raising the QuotaException. Since invalidateQuotaCache() was raising
TypeError, it prevented the QuotaException from being raised and the
node allocation was outright failed.

A test has been added to verify that nodepool and OpenStack will now
retry node allocations as intended.

This fixes that bug, but does change the behavior of OpenStack when
ignore-provider-quota is True and it returns a Quota error.

Change-Id: I1916c56c4f07c6a5d53ce82f4c1bb32bddbd7d63
Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
2022-05-10 15:04:25 -05:00
James E. Blair a62060a14e Measure rate limits from start times
The RateLimiter utility used by several drivers measures time from
the exit of one context manager to the beginning of the next.  This
means that if the API call itself takes substantial time, it will
be added to the expected delay.  In other words, if the expected rate
is one API call every two seconds, and the API call itself takes
one second, the actual rate will be one call every three seconds.

To more closely approximate the overall expected rate, measure from
start to start, so that the duration of the API call itself is
included.

Change-Id: Ia62a6bfa6a3e6cac65f0c20179edcfbc94a5dcc5
2022-05-02 13:29:23 -07:00
Zuul 17d0112017 Merge "Add support for label quota" 2022-04-27 20:36:49 +00:00
James E. Blair 46e130fe1a Add more debug info to AWS driver
These changes are all in service of being able to better understand
AWS driver log messages:

* Use annotated loggers in the statemachine provider framework
  so that we see the request, node, and provider information
* Have the statemachine framework pass annotated loggers to the
  state machines themselves so that the above information is available
  for log messages on individual API calls
* Add optional performance information to the rate limit handler
  (delay and API call duration)
* Add some additional log entries to the AWS adapter

Also:

* Suppress boto logging by default in unit tests (it is verbose and
  usually not helpful)
* Add coverage of node deletion in the AWS driver tests

Change-Id: I0e6b4ad72d1af7f776da73c5dd2a50b40f60e4a2
2022-04-11 10:14:20 -07:00
Simon Westphahl 3874d0b5c0 Add support for label quota
This change add support for deferring requests when a provider cannot
fulfill a request temporarily based on the requested labels. How the
label quota is used depends on the driver.

The reason for introducing the concept of label quota has to do with how
the static driver handled requests when a label wasn't available
immediately. So far Nodepool locked the request and created a node with
the requested label(s) in state BUILDING. When a node with matching
label(s) was returned, the static driver re-registered it using the
waiting nodes in state BUILDING.

The problem with this approach was, that as soon as the request was
locked Zuul could no longer revise the relative priority of the
request. This could lead to the situation that later node requests where
prioritized when they had a lower inital relative priority.

To fix this the static provider will now use the label quota to defer
requests that can't be fulfilled with the existing free nodes.

Change-Id: I55f6655c0e34b7a3c136b270658009b4f4d76394
2022-03-31 15:03:41 +02:00
Zuul f7b109abc9 Merge "QuotaInformation : abstract resource recording" 2021-06-28 21:46:20 +00:00
James E. Blair 667c239354 Azure: add quota support
This adds support for querying the provider quota as well as
getting resource usage information for hardware profiles.

This also includes a change to the general quota utils for
all drivers:

If a driver raises an exception when querying available quota
and we have a cached value, use the cached value.  This avoids
errors from transient provider API/network issues.

Change-Id: I7199ced512226aaf8bdb6106d41796a5cbca97c0
2021-05-27 16:18:33 -07:00
Ian Wienand bf40e145e4 QuotaInformation : abstract resource recording
Change I670d4bfd7d35ce1109b3ee9b3342fb45ee283a79 added the
quota['compute'] property to ZK nodes for Zuul to keep track of.

Instead of accessing the property directly, this adds a
get_resources() function to QuotaInformation which returns it, and
makes a point to note that the property is used so that we don't
modify it in incompatible ways in future.

This is a no-op refactor, but helpful in a follow-on change that also
adds this field to leaked nodes.

Change-Id: Id78b059cf2121e01e4cd444f6ad3834373cf7fb6
2021-04-20 16:09:18 +10:00
James E. Blair 63f38dfd6c Support threadless deletes
The launcher implements deletes using threads, and unlike with
launches, does not give drivers an opportunity to override that
and handle them without threads (as we want to do in the state
machine driver).

To correct this, we move the NodeDeleter class from the launcher
to driver utils, and add a new driver Provider method that returns
the NodeDeleter thread.  This is added in the base Provider class
so all drivers get this behavior by default.

In the state machine driver, we override the method so that instead
of returning a thread, we start a state machine and add it to a list
of state machines that our internal state machine runner thread
should drive.

Change-Id: Iddb7ed23c741824b5727fe2d89c9ddbfc01cd7d7
2021-03-21 14:39:01 -07:00
James E. Blair 7ce4dbe26d Add a state machine driver framework
This is intended to simplify efficient driver implementation by
isolating ZooKeeper and other internal Nodepool logic from cloud
implementation, while at the same time reducing the need for
threads.

It is based on the simple driver interface, but by using state
machines, it can accomodate clouds (like Azure and OpenStack) which
require multiple steps to create an instance.

It's currently only suitable for use with a single pool, because
the launcher's poolworkers are multiple threads.  However, we should
be able to collapse them into one thread which would make this safe
for use with multiple pools.

We may also need to adjust some of the sleep times in the launcher
to accomodate the idea that the pool worker threads will be more
active in driving the state machine polling.

Change-Id: Ia179220a71653c9e261342a262ff1e23e5408215
2021-03-19 10:47:10 -07:00
James E. Blair 120f399ec6 Finish adding GCE quota support
This expands the simple driver with the calls necessary to get
the current quota limits and expected and current usage of flavors.
That's enough to handle real cloud quota.

This change also removes a leftover debug line.

Change-Id: I837f5ad8e697b48ad8138021e64e799c8be3c6fc
2020-08-12 11:00:42 -07:00
James E. Blair 9e9a5b9bfd Improve max-servers handling for GCE
The quota handling for the simple driver (used by GCE) only handles
max-servers, but even so, it still didn't take into consideration
currently building servers.  If a number of simultaneous requests
were received, it would try to build them all and eventually return
node failures for the ones that the cloud refused to build.

The OpenStack driver has a lot of nice quota handling methods which
do take currently building nodes into account.  This change moves
some of those methods into a new Provider mixin class for quota
support.  This class implements some handy methods which perform
the calculations and provides some abstract methods which providers
will need to implement in order to supply information.

The simple driver is updated to use this system, though it still
only supports max-servers for the moment.

Change-Id: I0ce742452914301552f4af5e92a3e36304a7e291
2020-06-21 06:38:50 -07:00
Simon Westphahl 52709d7070 Pass node request handler to launcher base class
All implementations of the NodeLauncher have access to the node request
handler instance, but it was not passed through to the base class.

In order to initialize the event log adapter, the NodeLauncher will now
receive then node request handler instead of the Zookeeper connection.

The Zookeeper connection is directly extracted from the request handler.

Change-Id: I8ee7335f50b64435c82a5f039989f98750874466
2020-01-22 09:49:52 +01:00
Simon Westphahl d90f0e745a Centralize logging adapters
Logging adapters were already in use for the NodeLauncher. In order to
expand logging of Zuul's event id to nodepool, we will first centralize
the existing adapters and refactor them simliar to how this is already
done in Zuul.

This will allow us to correlate node request with other log messages
from Zuul via the event id to allow for faster and easier debugging.

Change-Id: Idc7468239a055c75074e876ea8c445b49c791131
2020-01-22 09:13:21 +01:00
Tobias Henkel f58a2a2b68
Fix leaked loggers
In the NodeLauncher we currently create a specific logger per node we
create. However loggers cannot be destroyed and will live for the
whole lifetime of the application. This means that we essentially leak
one Logger object per node we create.

The reason for introducing per node loggers was that we can easily
filter the log by node id and get every message related to that node
creation. This also can be accomplished by using a LoggerAdapter which
is also already in use by zuul for the same purpose.

Change-Id: Ieff99c242c1ec1ca89e2c16929cfc3237d6f1a78
2019-04-30 21:08:28 +02:00
Tobias Henkel 64487baef0
Asynchronously update node statistics
We currently updarte the node statistics on every node launch or
delete. This cannot use caching at the moment because when the
statistics are updated we might end up pushing slightly outdated
data. If then there is no further update for a longer time we end up
with broken gauges. We already get update events from the node cache
so we can use that to centrally trigger node statistics updates.

This is combined with leader election so there is only a single
launcher that keeps the statistics up to date. This will ensure that
the statistics are not cluttered because of several launchers pushing
their own slightly different view into the stats.

As a side effect this reduces the runtime of a test that creates 200
nodes from 100s to 70s on my local machine.

Change-Id: I77c6edc1db45b5b45be1812cf19eea66fdfab014
2018-11-29 16:48:30 +01:00
Paul Belanger 87b0f66353
Use hostname for launch exception
By default, hostname contains informaiton about provider and node id,
include this information to aid in debugging.

Change-Id: I4ddd8a83a9f50d856f78c562d86a83da6d78c759
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2018-07-25 13:15:28 -04:00
Tobias Henkel 31c276c234
Fix relaunch attempts when hitting quota errors
The quota calculations in nodepool never can be perfectly accurate
because there still can be some races with other launchers in the
tenant or by other workloads sharing the tenant. So we must be able to
gracefully handle the case when the cloud refuses a launch because of
quota. Currently we just invalidate the quota cache and immediately
try again to launch the node. Of course that will fail again in most
cases. This makes the node request and thus the zuul job fail with
NODE_FAILURE.

Instead we need to pause the handler like it would happen because of
quota calculation. Instead we mark the node as aborted which is no
failure indicator and pause the handler so it will automatically
reschedule a new node launch as soon as the quota calculation allows
it.

Change-Id: I122268dc7723901c04a58fa6f5c1688a3fdab227
2018-07-06 08:41:02 +02:00
Tobias Henkel e2f279275d
Move QuotaInformation to driver utils
The QuotaInformation class is a quite generic container for quota
information. This might be useful in future drivers like AWS or
Azure. Further Moving it to utils resolves a cyclic import of
openstack handler and provider.

Change-Id: Ibbece39a0da74ae7b54e3a1088057e683f158a66
2018-06-06 20:21:28 +02:00
David Shrewsbury be0c59535a Simplify driver API
Further decouple the driver API from the provider configuration
by simplifying the driver interface (which will eventually lead to
the driver deciding whether or not to support multiple labels) and
removing config references in NodeLauncher.

This significantly simplifies the driver API by:

  * Reducing the launching of nodes to a launch() and launchesCompleted()
    interface. The launchesCompleted() method acts as a periodic check on
    whether the nodeset launches have all completed.

  * How the drivers actually launch nodes is decoupled from the driver
    API by not having the return of NodeRequestHandler.launch() be either
    a thread or None. This leaves it entirely up to the driver to decide
    the best way to manage launches... threads, subprocesses, messaging,
    etc. E.g., thread handling is moved from driver API into OpenStack driver.

  * Drivers no longer need to implement the very confusing pollLauncher()
    method. This was originally meant to poll the launch threads, but
    even drivers not using threads (e.g., the static driver) had to
    reimplement this method. That made no sense.

The NodeLauncher class (currently used only by the OpenStack driver),
is moved to a driver utility module that other drivers may choose to
use or ignore. And it no longer directly accesses the provider config.

The test_nodelaunchmanager.py tests was originally testing the
NodeLaunchManager class (removed quite some time ago) which managed
launch threads. This was eventually moved to the driver interface.
This change further moves thread management into the OpenStack driver.
As a result, these tests are redundant since the test_launcher.py
tests cover all of this. So remove test_nodelaunchmanager.py.

Change-Id: I407d999b3608e1614c0dbe49808b2a64756dde58
2018-05-18 11:24:59 -04:00