Commit Graph

85 Commits

Author SHA1 Message Date
Simon Westphahl 2aeaee92f1
Ignore unrelated error labels in request handler
Nodepool was declining node requests when other unrelated instance types
of a provider were unavailable:

    Declining node request <NodeRequest {... 'node_types': ['ubuntu'],
    ... }> due to ['node type(s) [ubuntu-invalid] not available']

To fix this we will the check error labels against the requested labels
before including them in the list of invalid node types.

Change-Id: I7bbb3b813ca82baf80821a9e84cc10385ea95a01
2023-11-09 13:58:54 +01:00
James E. Blair a602654482 Handle invalid AWS instance in quota lookup
Early nodepool performed all cloud operations within the context of
an accepted node request.  This means that any disagreement between
the nodepool configuration and the cloud (such as what instance types,
images, networks, or other resources are actually available) would
be detected within that context and the request would be marked as
completed and failed.

When we added tenant quota support, we also added the possibility
of needing to interact with the cloud before accepting a request.
Specifically, we may now ask the cloud what resources are needed
for a given instance type (and volume, etc) before deciding whether
to accept a request.  If we raise an exception here, the launcher
will simply loop indefinitely.

To avoid this, we will add a new exception class to indicate a
permanent configuration error which was detected at runtime.  If
AWS says an instance type doesn't exist when we try to calculate
its quota, we mark it as permanently errored in the provider
configuration, then return empty quota information back to the
launcher.

This allows the launcher to accept the request, but then immediately
mark it as failed because the label isn't available.

The state of this error is stored on the provider manager, so the
typical corrective action of updating the configuration to correct
the label config means that a new provider will be spawned with an
empty error label list; the error state will be cleared and the
launcher will try again.

Finally, an extra exception handler is added to the main launcher
loop so that if any other unhandled errors slip through, the
request will be deferred and the launcher will continue processing
requests.

Change-Id: I9a5349203a337ab23159806762cb46c059fe4ac5
2023-07-18 13:51:13 -07:00
Zuul 2ea6f1ed15 Merge "Fix race condition with node reuse and multiple providers" 2023-06-28 23:56:14 +00:00
James E. Blair 218da4aa81 Use cache more aggressively when searching for ready nodes
When assigning node requests, the pool main loop attempts to find
existing ready nodes for each request it accepts before creating
a NodeLauncher (which is the class which usually causes asynchronous
node creation to happen).

Even though we iterate over every ready node in the system looking
for a match, this should be fast because:

* We only consider nodes from our provider+pool.
* We only consider nodes which are not already assigned (and newly
  created nodes are created with assigned_to already set).
* We use the node cache.

However, there is room for improvement:

* We could use cached_ids so that we don't have to list children.
* We can avoid trying to lock nodes that the cache says are already
  locked (this should generally be covered by the "assigned_to" check
  mentioned above, but this is a good extra check and is consistent
  with other new optimizations.
* We can ignore nodes in the cache with incomplete data.

The first and last items are likely to make the most difference.

It has been observed that in a production system we can end up spending
much more time than expected looking for ready nodes.  The most likely
cause for this is cache misses, since if we are able to use cached data
we would expect all the other checks to be dealt with quickly.  That
leaves nodes that have appeared very recently (so they show up in a
live get_children call but aren't in the cache yet, or if they are,
are in the cache with incomplete (empty) data) or nodes that have been
disappeared very recently (so they showed up in our initial listing
but by the time we get around to trying to lock them, they have been
removed from the cache).

This also adds a log entry to the very chatty "cache event" log handler
in the case of a cache miss in order to aid future debugging.

Change-Id: I32a0f2d0c2f1b8bbbeae1441ee48c8320e3b9129
2023-05-30 14:43:12 -07:00
Tobias Henkel b1351389e9
Fix race condition with node reuse and multiple providers
While running multiple nodepool replicas on the same provider we faced
a race condition that lead to inconsistent nodes and an error during
locking nodes in zuul [1].

Upon log anlysis this is the sequence that lead to the issue:
- provider 1 gets 200-0030735612
- provider 1 creates node 0032198309, attaches it to the node request and unlocks it
- provider 2 re-uses the node for 100-0030736038 because it can lock the node but didn't
  have the updated node yet which means it's not allocated to a node request
- zuul tries to lock the node and faces [1].

This can be solved by checking if the node is still unallocated after
locking to make sure the node didn't change under us before locking.

[1] Zuul exception
023-05-16 11:36:16,292 ERROR zuul.nodepool: [e: b1dcc890-f3da-11ed-852e-3c4f498e591d] Error locking nodes:
Traceback (most recent call last):
  File "/opt/zuul/lib/python3.10/site-packages/zuul/nodepool.py", line 429, in acceptNodes
    nodes = self.lockNodes(request, nodeset)
  File "/opt/zuul/lib/python3.10/site-packages/zuul/nodepool.py", line 381, in lockNodes
    raise Exception("Node %s allocated to %s, not %s" %
Exception: Node 0032198309 allocated to 100-0030736038, not 200-0030735612

Change-Id: I774616a0481721e27e6b0db5168d3a248ba0b818
2023-05-16 14:14:21 +02:00
James E. Blair be3edd3e17 Convert openstack driver to statemachine
This updates the OpenStack driver to use the statemachine framework.

The goal is to revise all remaining drivers to use the statemachine
framework for two reasons:

1) We can dramatically reduce the number of threads in Nodepool which
is our biggest scaling bottleneck.  The OpenStack driver already
includes some work in that direction, but in a way that is unique
to it and not easily shared by other drivers.  The statemachine
framework is an extension of that idea implemented so that every driver
can use it.  This change further reduces the number of threads needed
even for the openstack driver.

2) By unifying all the drivers with a simple interface, we can prepare
to move them into Zuul.

There are a few updates to the statemachine framework to accomodate some
features that only the OpenStack driver used to date.

A number of tests need slight alteration since the openstack driver is
the basis of the "fake" driver used for tests.

Change-Id: Ie59a4e9f09990622b192ad840d9c948db717cce2
2023-01-10 10:30:14 -08:00
Clark Boylan 2a231a08c9 Add idle state to driver providers
This change adds an idle state to driver providers which is used to
indicate that the provider should stop performing actions that are not
safe to perform while we bootstrap a second newer version of the
provider to handle a config update.

This is particularly interesting for the static driver because it is
managing all of its state internally to nodepool and not relying on
external cloud systems to track resources. This means it is important
for the static provider to not have an old provider object update
zookeeper at the same time as a new provider object. This was previously
possible and created situtations where the resources in zookeeper did
not reflect our local config.

Since all other drivers rely on external state the primary update here
is to the static driver. We simply stop performing config
synchronization if the idle flag is set on a static provider. This will
allow the new provider to take over reflecting the new config
consistently.

Note, we don't take other approaches and essentially create a system
specific to the static driver because we're trying to avoid modifying
the nodepool runtime significantly to fix a problem that is specific to
the static driver.

Change-Id: I93519d0c6f4ddf8a417d837f6ae12a30a55870bb
2022-10-24 15:30:31 -07:00
Zuul 04e5b2c2c6 Merge "Don't pause the static provider at quota" 2022-10-06 17:51:56 +00:00
James E. Blair f31a0dadf8 Cause providers to continue to decline requests when at quota
When a provider is at quota, we pause it, and paused means paused.
That means we don't do anything with any other requests.

Unfortunately, that includes requests that the given provider can't
even handle.  So if a provider pauses because it is at quota while
other providers continue to operate, if a request for a node type
that no providers can handle arrives, then that request will remain
outstanding until this provider becomes unpaused and can decline it.

Requests shouldn't need to wait so long to be declined by providers
which can never under any circumstances handle them.  To address this,
we will now run the assignHandlers method whether we are paused or
not.  Within assignHandlers, we will process all requests regardless
of whether we are paused (but we won't necessarily accept them yet).
We will decide whether a request will be declined or not, and if it
will be declined, we will do so regardless of whether we are paused.
Finally, only if we are unpaused and do not expect to decline the
request will we accept it.

Change-Id: Ied9e4577670ea65b1d5ecfef95a7f837a7b6ac61
2022-09-16 17:16:09 -07:00
James E. Blair 1724d3a60f Don't pause the static provider at quota
The static provider avoids accepting requests for a label when
all the nodes of that label are busy.  In other words, it only
accepts requests when it expects to be able to fulfill them.

However, it also checks whether a node is reachable before fulfilling
a request.  And if the reachability of a node changes before a
request is fulfilled, then Nodepool will pause the handler until
the request is fulfilled.  If the node never comes back online,
no other nodes in the provider will be handled.

In general, we should never pause the static provider since pausing
is designed to allow for the accumulation of sufficient free resources
to fulfill the next request.  Static provider resources are not
fungible, so pausing until label A is available doesn't make label
B any less available.

To correct the problem, we no longer pause the static provider when
there is insufficient quota; instead we simply release the current
node request without declining it, and then allow the provider to
either defer it (if the node continues to be unreachable) or accept
it (if the node later becomes available again).

In other words, the handling of a request for a node with connection
problems will now be the same regardless of whether it is detected
on startup or while running.

Change-Id: I61565e9864efcfc81a205318d30972c25d4ea880
2022-09-12 13:02:40 -07:00
James E. Blair 4aaacc7919 Fix ignore_provider_quota in statemachine drivers
The statemachine driver copied the quota handling from OpenStack,
which also included a check for the presence of the ignore_provider_quota
configuration attribute.  This attribute does not exist in any of
the statemachine drivers, and due to the way the check was constructed,
it meant that we defaulted to ignoring the provider quota rather than
actually checking it.

We should perform this check so that providers can exclude themselves
if there is no possibility of satisfying a request.  This corrects the
check.

Change-Id: I7ced3f749a1646ecbb2b80f93b26e61a4e8cd69a
2022-06-29 10:47:49 -07:00
James E. Blair 1323d0b556 Update some variable names
Now that the component we registered is a "pool" change the call
sites to use "launcher_pools" instead of "launchers".  This may
reduce some ambiguity.

(s/launcher/pool/ might still be ambiguous since it may not be clear
whethere we're talking about our own pools or other pools; thus the
choice of "launcher_pool" for the variable name.)

Also, remove a redundant test assertion.

Change-Id: I865883cdb115bf72a3bd034d9290f60666d64b66
2022-05-23 13:30:50 -07:00
James E. Blair ea35fd5152 Add provider/pool priority support
This lets users configure providers which should fulfill requests
before other providers.  This facilitates using a less expensive
cloud before using a more expensive one.

The default priority is 100, to facilitate either raising above
or lowering below the default (while using only positive integers
in order to avoid confusion).

Change-Id: I969ea821e10a7773a0a8d135a4f13407319362ee
2022-05-23 13:28:21 -07:00
James E. Blair a612aa603c Add the component registry from Zuul
This uses a cache and lets us update metadata about components
and act on changes quickly (as compared to the current launcher
registry which doesn't have provision for live updates).

This removes the launcher registry, so operators should take care
to update all launchers within a short period of time since the
functionality to yield to a specific provider depends on it.

Change-Id: I6409db0edf022d711f4e825e2b3eb487e7a79922
2022-05-23 07:41:27 -07:00
James E. Blair 10df93540f Use Zuul-style ZooKeeper connections
We have made many improvements to connection handling in Zuul.
Bring those back to Nodepool by copying over the zuul/zk directory
which has our base ZK connection classes.

This will enable us to bring other Zuul classes over, such as the
component registry.

The existing connection-related code is removed and the remaining
model-style code is moved to nodepool.zk.zookeeper.  Almost every
file imported the model as nodepool.zk, so import adjustments are
made to compensate while keeping the code more or less as-is.

Change-Id: I9f793d7bbad573cb881dfcfdf11e3013e0f8e4a3
2022-05-23 07:40:20 -07:00
Benjamin Schanzel 7ae6b34e71
Pass requestor data to Nodes
Propagate the NodeRequest.requestor to a new property Node.requestor
which usually holds the zuul_system_id, or "NodePool:min_ready".
This is set initially when first creating a node, and is updated when a
ready node is assigned to a request. This is to always know for which
requestor is a node got allocated.

Change-Id: Ifd54a94bae39f31a70bdedce384a64c2a04495c1
2022-03-21 10:19:54 +01:00
James E. Blair e95b146d10 Switch docs theme to versioned RTD
To match change I2870450ffd02f55509fcc1297d050b09deafbfb9 in Zuul.

The default domain is changed to zuul which uncovered a reference
error which is fixed.

Change-Id: I71db35252d018feed41d9e87aa702c6daa61902b
2021-12-16 11:23:30 -08:00
Benjamin Schanzel ee90100852 Add Tenant-Scoped Resource Quota
This change adds the option to put quota on resources on a per-tenant
basis (i.e. Zuul tenants).

It adds a new top-level config structure ``tenant-resource-limits``
under which one can specify a number of tenants, each with
``max-servers``, ``max-cores``, and ``max-ram`` limits.  These limits
are valid globally, i.e., for all providers. This is contrary to
currently existing provider and pool quotas, which only are consindered
for nodes of the same provider.

Change-Id: I0c0154db7d5edaa91a9fe21ebf6936e14cef4db7
2021-09-01 09:07:43 +02:00
Zuul d08f79d09e Merge "Azure: add quota support" 2021-06-18 21:47:24 +00:00
Zuul 27bbc91799 Merge "Azure: reconcile config objects" 2021-06-15 01:18:31 +00:00
Zuul ee22b88ab5 Merge "Support threadless deletes" 2021-06-11 14:19:00 +00:00
James E. Blair 667c239354 Azure: add quota support
This adds support for querying the provider quota as well as
getting resource usage information for hardware profiles.

This also includes a change to the general quota utils for
all drivers:

If a driver raises an exception when querying available quota
and we have a cached value, use the cached value.  This avoids
errors from transient provider API/network issues.

Change-Id: I7199ced512226aaf8bdb6106d41796a5cbca97c0
2021-05-27 16:18:33 -07:00
James E. Blair f0eb830bb7 Log decline reason at info
Admins need to know why requests were declined, and major decisions
should be logged at info level.

Note that the complementary "Accepting node request" may be okay at
debug level as there is a more informative "Assigning node request"
message at info.

Change-Id: I67ac8c75e5581cc4da77cba42a98c5c3785640ff
2021-04-15 16:25:01 -07:00
James E. Blair 94fcc70a59 Azure: reconcile config objects
The config objects in the Azure driver have drifted a bit.  This
updates them to match the actual used configuration.  It also
reorganizes them to be a little easier to maintain by moving the
initializers into the individual objects.

Finally, the verbose __eq__ methods are removed in favor of a
simpler __eq__ method in the superclass.

Since the OpenStack, k8s, and OpenShift drivers calls super() in
__eq__ methods, they need to be updated at the same time.

This also corrects an unrelated error with a misnamed parameter
in the fake k8s used in the k8s tests.

Change-Id: Id6971ca002879d3fb056fedc7e4ca6ec35dd7434
2021-03-22 10:39:53 -07:00
James E. Blair 63f38dfd6c Support threadless deletes
The launcher implements deletes using threads, and unlike with
launches, does not give drivers an opportunity to override that
and handle them without threads (as we want to do in the state
machine driver).

To correct this, we move the NodeDeleter class from the launcher
to driver utils, and add a new driver Provider method that returns
the NodeDeleter thread.  This is added in the base Provider class
so all drivers get this behavior by default.

In the state machine driver, we override the method so that instead
of returning a thread, we start a state machine and add it to a list
of state machines that our internal state machine runner thread
should drive.

Change-Id: Iddb7ed23c741824b5727fe2d89c9ddbfc01cd7d7
2021-03-21 14:39:01 -07:00
Tobias Henkel 64a8c138dd
Mention node id when unlock failed
Nodepool currently logs an exception trace when it cannot unlock a
node. However this doesn't include the node id which would be useful
for debugging. Thus add that information. Do the same for failed node
allocation clearing.

Change-Id: I437f4be6cdd214f598410536753cc1edbcf730f8
2021-02-25 22:01:57 +01:00
James E. Blair 9e9a5b9bfd Improve max-servers handling for GCE
The quota handling for the simple driver (used by GCE) only handles
max-servers, but even so, it still didn't take into consideration
currently building servers.  If a number of simultaneous requests
were received, it would try to build them all and eventually return
node failures for the ones that the cloud refused to build.

The OpenStack driver has a lot of nice quota handling methods which
do take currently building nodes into account.  This change moves
some of those methods into a new Provider mixin class for quota
support.  This class implements some handy methods which perform
the calculations and provides some abstract methods which providers
will need to implement in order to supply information.

The simple driver is updated to use this system, though it still
only supports max-servers for the moment.

Change-Id: I0ce742452914301552f4af5e92a3e36304a7e291
2020-06-21 06:38:50 -07:00
David Shrewsbury aba9b4e134 Fix for clearing assigned nodes that have vanished
If a node request disappears, we are assuming the node znodes that
have been assigned to it are still around, but it turns out they
may not be for strange, odd reasons. Attempting to store to them will
cause errors. We need to handle that.

Change-Id: I10dbce96e4d789a0d1d8d82d72a780fb63b66d62
2020-02-27 17:31:49 -05:00
Simon Westphahl b628195d04 Annotate logs in node request handler
Request related logs are annotated with the event and node request
id to improve tracability.

Change-Id: I8fdb00f6a446db87e05d73b93afec815ead0f3ee
2020-01-27 15:27:58 +01:00
Zuul 2b4fdb03d9 Merge "Remove unused use_taskmanager flag" 2019-04-05 19:58:37 +00:00
David Shrewsbury 7a650001e9 Fix debug log message
The parameter was missing.

Change-Id: If4de197e54f152ef791748783b22e19cbbd98186
2019-04-04 14:05:19 -04:00
Monty Taylor 7618b714e2 Remove unused use_taskmanager flag
Now that there is no more TaskManager class, nor anything using
one, the use_taskmanager flag is vestigal. Clean it up so that we
don't have to pass it around to things anymore.

Change-Id: I7c1f766f948ad965ee5f07321743fbaebb54288a
2019-04-02 12:11:07 +00:00
David Shrewsbury d6ef934b70 Extract common config parsing for ProviderConfig
Adds a ProviderConfig class method that can be called to get
the config schema for the common config options in a Provider.
Drivers are modified to call this method.

Change-Id: Ib67256dddc06d13eb7683226edaa8c8c10a73326
2019-01-07 12:34:05 +00:00
David Shrewsbury a19dffd916 Extract out common config parsing for ConfigPool
Our driver code is in a less-than-ideal situation where each driver
is responsible for parsing config options that are common to all
drivers. This change begins to correct that, starting with ConfigPool.
It changes the driver API in the following ways:

1) Forces objects derived from ConfigPool to implement a load() method
   that should call super's method, then handle loading driver specific
   options from the config.

2) Adds a ConfigPool class method that can be called to get the config
   schema for the common config options leaving drivers to have to only
   define the schema for their own config options.

Other base config objects will be modeled after this pattern in
later changes.

Change-Id: I41620590c355cacd2c4fbe6916acfe80f20e3216
2019-01-03 11:05:26 -05:00
David Shrewsbury 16325d5c4c Add arbitrary node attributes config option
This config option, available under each provider pool section, can
contain static key-value pairs that will be stored in ZooKeeper on
each Node znode. This will allow us to pass along abitrary data from
nodepool to any user of nodepool (specifically, zuul). Initially, this
will be used to pass along zone information to zuul executors.

Change-Id: I126d37a8c0a4f44dca59c11f76a583b9181ab653
2018-11-29 09:35:59 -05:00
Tobias Henkel e4f245eb13
Add extra safety belt when reusing a node
When reusing a node we lock it but don't check if the state is still
ready. There are probably few to no possibilities how this can happen.
But it's better to guarantee this condition than to think that this
should be ok.

Change-Id: I365de6accb28b538dae5ff1dfb6c4a9966e50b01
2018-11-26 19:38:10 +01:00
Tobias Henkel 25f26165eb Remove unneeded todo comment
This removes a todo about request deferring. This is not valid at this
point because we already locked the request and further are in a loop
that must run fast.

Change-Id: I2858a418fe0ed953e4a300c6d27fca10b65194ef
2018-10-17 14:22:51 +00:00
Tobias Henkel 552cc28f7f
Use quota handling code for min-ready declines
We currently check the current pool nodes against max_servers in order
to decline min-ready request when we're at quota. This is older than
the quota handling where max-servers was the only limiting thing. Now
we have real quota handling we should defer this min-ready check to
the point where we know that we're at quota (independent of self
defined pool quota or real cloud quota).

Change-Id: Ic36ecfaf441436db6699297ef459c586f14c2dfc
2018-10-17 16:20:42 +02:00
David Shrewsbury 81436b58c6 Add logging when pausing due to node abort
We can see the unpause in the logs, but not the pause.

Change-Id: I100764bc7a8117b63e3a644c512722d5a6b255c3
2018-07-23 15:17:00 -04:00
Tobias Henkel ce9120192b
Cleanup test_over_quota
The test_over_quota has some debug lines that should be removed. Also
make the copy of the nodeset more intuitive.

Change-Id: I9a168bc1446e3bb479cace68ed932401ebcdd2bf
2018-07-06 19:35:02 +02:00
Tobias Henkel 31c276c234
Fix relaunch attempts when hitting quota errors
The quota calculations in nodepool never can be perfectly accurate
because there still can be some races with other launchers in the
tenant or by other workloads sharing the tenant. So we must be able to
gracefully handle the case when the cloud refuses a launch because of
quota. Currently we just invalidate the quota cache and immediately
try again to launch the node. Of course that will fail again in most
cases. This makes the node request and thus the zuul job fail with
NODE_FAILURE.

Instead we need to pause the handler like it would happen because of
quota calculation. Instead we mark the node as aborted which is no
failure indicator and pause the handler so it will automatically
reschedule a new node launch as soon as the quota calculation allows
it.

Change-Id: I122268dc7723901c04a58fa6f5c1688a3fdab227
2018-07-06 08:41:02 +02:00
Zuul 729f34fced Merge "launcher: add pool quota debug and log information" 2018-07-03 07:28:50 +00:00
Tristan Cacqueray 24bf9d1f0e launcher: add pool quota debug and log information
This change logs the current pool quota as DEBUG and adds a
INFO log when a pool doesn't have enough quota to satisfy a request.

Change-Id: I78445f21f251d206d4fb0dce817a992685c8b319
2018-07-01 01:16:32 +00:00
David Shrewsbury c08eb291f7 Fix for pools with different labels
Pools within the same provider that have different labels are
attempting to handle requests for ALL labels defined within the provider.
This is due to change I61263f51be5e957de71d1e2dabaa7391bbe7bddf which
incorrectly changed from getting supported labels from per-pool to
per-provider. This modifies the getSupportedLabels() API to support
pools.

Change-Id: I7a0d472928c6b528f6faa6dd3b9cf1479874cb22
2018-06-28 15:04:35 -04:00
Zuul 3c92ac841e Merge "Make a proper driver notification API" 2018-06-21 19:01:29 +00:00
Zuul 4b7f348b76 Merge "Pass zk connection to ProviderManager.start()" 2018-06-21 19:00:24 +00:00
David Shrewsbury 58ccd2c718 Fix paused handler exception handling
If a request handler is paused and an exception is thrown during
normal events, that exception handling code was not properly unpausing
the request and removing it from the list of active handlers.

Change-Id: I0cfd8294f9ce88e5918654a4847776c981ee4fb3
2018-06-15 09:05:04 -04:00
David Shrewsbury 326de1eedb Make a proper driver notification API
We sometimes have need to send notifications to certain driver objects.
A "notification" is a call to an object to indicate some condition has
changed, and the callee may choose to act on that, or totally ignore it.
Currently, the only notification is NodeRequestHandler.nodeReused() but
this adds another notification for the Provider objects in preparation
for static node driver changes.

Since it is entirely probable we may want more in the future, this
change organizes the notifications into proper classes for easy
identification/manipulation. It will be a naming standard end these
notification method names with "Notification".

Change-Id: I905ef42bb434435cbe3cec0f5007981a5b789769
2018-06-13 10:22:11 -04:00
David Shrewsbury a418aabb7a Pass zk connection to ProviderManager.start()
In order to support static node pre-registration, we need to give
the provider manager the opportunity to register/deregister any
nodes in its configuration file when it starts (on startup or when
the config change). It will need a ZooKeeper connection to do this.
The OpenStack driver will ignore this parameter.

Change-Id: Idd00286b2577921b3fe5b55e8f13a27f2fbde5d6
2018-06-12 12:04:16 -04:00
Zuul e17e4c36b3 Merge "Fix 'satisfy' spelling errors" 2018-06-11 13:42:28 +00:00