nodepool

Commit Graph

Author	SHA1	Message	Date
Simon Westphahl	2aeaee92f1	Ignore unrelated error labels in request handler Nodepool was declining node requests when other unrelated instance types of a provider were unavailable: Declining node request <NodeRequest {... 'node_types': ['ubuntu'], ... }> due to ['node type(s) [ubuntu-invalid] not available'] To fix this we will the check error labels against the requested labels before including them in the list of invalid node types. Change-Id: I7bbb3b813ca82baf80821a9e84cc10385ea95a01	2023-11-09 13:58:54 +01:00
James E. Blair	a602654482	Handle invalid AWS instance in quota lookup Early nodepool performed all cloud operations within the context of an accepted node request. This means that any disagreement between the nodepool configuration and the cloud (such as what instance types, images, networks, or other resources are actually available) would be detected within that context and the request would be marked as completed and failed. When we added tenant quota support, we also added the possibility of needing to interact with the cloud before accepting a request. Specifically, we may now ask the cloud what resources are needed for a given instance type (and volume, etc) before deciding whether to accept a request. If we raise an exception here, the launcher will simply loop indefinitely. To avoid this, we will add a new exception class to indicate a permanent configuration error which was detected at runtime. If AWS says an instance type doesn't exist when we try to calculate its quota, we mark it as permanently errored in the provider configuration, then return empty quota information back to the launcher. This allows the launcher to accept the request, but then immediately mark it as failed because the label isn't available. The state of this error is stored on the provider manager, so the typical corrective action of updating the configuration to correct the label config means that a new provider will be spawned with an empty error label list; the error state will be cleared and the launcher will try again. Finally, an extra exception handler is added to the main launcher loop so that if any other unhandled errors slip through, the request will be deferred and the launcher will continue processing requests. Change-Id: I9a5349203a337ab23159806762cb46c059fe4ac5	2023-07-18 13:51:13 -07:00
Zuul	2ea6f1ed15	Merge "Fix race condition with node reuse and multiple providers"	2023-06-28 23:56:14 +00:00
James E. Blair	218da4aa81	Use cache more aggressively when searching for ready nodes When assigning node requests, the pool main loop attempts to find existing ready nodes for each request it accepts before creating a NodeLauncher (which is the class which usually causes asynchronous node creation to happen). Even though we iterate over every ready node in the system looking for a match, this should be fast because: * We only consider nodes from our provider+pool. * We only consider nodes which are not already assigned (and newly created nodes are created with assigned_to already set). * We use the node cache. However, there is room for improvement: * We could use cached_ids so that we don't have to list children. * We can avoid trying to lock nodes that the cache says are already locked (this should generally be covered by the "assigned_to" check mentioned above, but this is a good extra check and is consistent with other new optimizations. * We can ignore nodes in the cache with incomplete data. The first and last items are likely to make the most difference. It has been observed that in a production system we can end up spending much more time than expected looking for ready nodes. The most likely cause for this is cache misses, since if we are able to use cached data we would expect all the other checks to be dealt with quickly. That leaves nodes that have appeared very recently (so they show up in a live get_children call but aren't in the cache yet, or if they are, are in the cache with incomplete (empty) data) or nodes that have been disappeared very recently (so they showed up in our initial listing but by the time we get around to trying to lock them, they have been removed from the cache). This also adds a log entry to the very chatty "cache event" log handler in the case of a cache miss in order to aid future debugging. Change-Id: I32a0f2d0c2f1b8bbbeae1441ee48c8320e3b9129	2023-05-30 14:43:12 -07:00
Tobias Henkel	b1351389e9	Fix race condition with node reuse and multiple providers While running multiple nodepool replicas on the same provider we faced a race condition that lead to inconsistent nodes and an error during locking nodes in zuul [1]. Upon log anlysis this is the sequence that lead to the issue: - provider 1 gets 200-0030735612 - provider 1 creates node 0032198309, attaches it to the node request and unlocks it - provider 2 re-uses the node for 100-0030736038 because it can lock the node but didn't have the updated node yet which means it's not allocated to a node request - zuul tries to lock the node and faces [1]. This can be solved by checking if the node is still unallocated after locking to make sure the node didn't change under us before locking. [1] Zuul exception 023-05-16 11:36:16,292 ERROR zuul.nodepool: [e: b1dcc890-f3da-11ed-852e-3c4f498e591d] Error locking nodes: Traceback (most recent call last): File "/opt/zuul/lib/python3.10/site-packages/zuul/nodepool.py", line 429, in acceptNodes nodes = self.lockNodes(request, nodeset) File "/opt/zuul/lib/python3.10/site-packages/zuul/nodepool.py", line 381, in lockNodes raise Exception("Node %s allocated to %s, not %s" % Exception: Node 0032198309 allocated to 100-0030736038, not 200-0030735612 Change-Id: I774616a0481721e27e6b0db5168d3a248ba0b818	2023-05-16 14:14:21 +02:00
James E. Blair	be3edd3e17	Convert openstack driver to statemachine This updates the OpenStack driver to use the statemachine framework. The goal is to revise all remaining drivers to use the statemachine framework for two reasons: 1) We can dramatically reduce the number of threads in Nodepool which is our biggest scaling bottleneck. The OpenStack driver already includes some work in that direction, but in a way that is unique to it and not easily shared by other drivers. The statemachine framework is an extension of that idea implemented so that every driver can use it. This change further reduces the number of threads needed even for the openstack driver. 2) By unifying all the drivers with a simple interface, we can prepare to move them into Zuul. There are a few updates to the statemachine framework to accomodate some features that only the OpenStack driver used to date. A number of tests need slight alteration since the openstack driver is the basis of the "fake" driver used for tests. Change-Id: Ie59a4e9f09990622b192ad840d9c948db717cce2	2023-01-10 10:30:14 -08:00
Clark Boylan	2a231a08c9	Add idle state to driver providers This change adds an idle state to driver providers which is used to indicate that the provider should stop performing actions that are not safe to perform while we bootstrap a second newer version of the provider to handle a config update. This is particularly interesting for the static driver because it is managing all of its state internally to nodepool and not relying on external cloud systems to track resources. This means it is important for the static provider to not have an old provider object update zookeeper at the same time as a new provider object. This was previously possible and created situtations where the resources in zookeeper did not reflect our local config. Since all other drivers rely on external state the primary update here is to the static driver. We simply stop performing config synchronization if the idle flag is set on a static provider. This will allow the new provider to take over reflecting the new config consistently. Note, we don't take other approaches and essentially create a system specific to the static driver because we're trying to avoid modifying the nodepool runtime significantly to fix a problem that is specific to the static driver. Change-Id: I93519d0c6f4ddf8a417d837f6ae12a30a55870bb	2022-10-24 15:30:31 -07:00
Zuul	04e5b2c2c6	Merge "Don't pause the static provider at quota"	2022-10-06 17:51:56 +00:00
James E. Blair	f31a0dadf8	Cause providers to continue to decline requests when at quota When a provider is at quota, we pause it, and paused means paused. That means we don't do anything with any other requests. Unfortunately, that includes requests that the given provider can't even handle. So if a provider pauses because it is at quota while other providers continue to operate, if a request for a node type that no providers can handle arrives, then that request will remain outstanding until this provider becomes unpaused and can decline it. Requests shouldn't need to wait so long to be declined by providers which can never under any circumstances handle them. To address this, we will now run the assignHandlers method whether we are paused or not. Within assignHandlers, we will process all requests regardless of whether we are paused (but we won't necessarily accept them yet). We will decide whether a request will be declined or not, and if it will be declined, we will do so regardless of whether we are paused. Finally, only if we are unpaused and do not expect to decline the request will we accept it. Change-Id: Ied9e4577670ea65b1d5ecfef95a7f837a7b6ac61	2022-09-16 17:16:09 -07:00
James E. Blair	1724d3a60f	Don't pause the static provider at quota The static provider avoids accepting requests for a label when all the nodes of that label are busy. In other words, it only accepts requests when it expects to be able to fulfill them. However, it also checks whether a node is reachable before fulfilling a request. And if the reachability of a node changes before a request is fulfilled, then Nodepool will pause the handler until the request is fulfilled. If the node never comes back online, no other nodes in the provider will be handled. In general, we should never pause the static provider since pausing is designed to allow for the accumulation of sufficient free resources to fulfill the next request. Static provider resources are not fungible, so pausing until label A is available doesn't make label B any less available. To correct the problem, we no longer pause the static provider when there is insufficient quota; instead we simply release the current node request without declining it, and then allow the provider to either defer it (if the node continues to be unreachable) or accept it (if the node later becomes available again). In other words, the handling of a request for a node with connection problems will now be the same regardless of whether it is detected on startup or while running. Change-Id: I61565e9864efcfc81a205318d30972c25d4ea880	2022-09-12 13:02:40 -07:00
James E. Blair	4aaacc7919	Fix ignore_provider_quota in statemachine drivers The statemachine driver copied the quota handling from OpenStack, which also included a check for the presence of the ignore_provider_quota configuration attribute. This attribute does not exist in any of the statemachine drivers, and due to the way the check was constructed, it meant that we defaulted to ignoring the provider quota rather than actually checking it. We should perform this check so that providers can exclude themselves if there is no possibility of satisfying a request. This corrects the check. Change-Id: I7ced3f749a1646ecbb2b80f93b26e61a4e8cd69a	2022-06-29 10:47:49 -07:00
James E. Blair	1323d0b556	Update some variable names Now that the component we registered is a "pool" change the call sites to use "launcher_pools" instead of "launchers". This may reduce some ambiguity. (s/launcher/pool/ might still be ambiguous since it may not be clear whethere we're talking about our own pools or other pools; thus the choice of "launcher_pool" for the variable name.) Also, remove a redundant test assertion. Change-Id: I865883cdb115bf72a3bd034d9290f60666d64b66	2022-05-23 13:30:50 -07:00
James E. Blair	ea35fd5152	Add provider/pool priority support This lets users configure providers which should fulfill requests before other providers. This facilitates using a less expensive cloud before using a more expensive one. The default priority is 100, to facilitate either raising above or lowering below the default (while using only positive integers in order to avoid confusion). Change-Id: I969ea821e10a7773a0a8d135a4f13407319362ee	2022-05-23 13:28:21 -07:00
James E. Blair	a612aa603c	Add the component registry from Zuul This uses a cache and lets us update metadata about components and act on changes quickly (as compared to the current launcher registry which doesn't have provision for live updates). This removes the launcher registry, so operators should take care to update all launchers within a short period of time since the functionality to yield to a specific provider depends on it. Change-Id: I6409db0edf022d711f4e825e2b3eb487e7a79922	2022-05-23 07:41:27 -07:00
James E. Blair	10df93540f	Use Zuul-style ZooKeeper connections We have made many improvements to connection handling in Zuul. Bring those back to Nodepool by copying over the zuul/zk directory which has our base ZK connection classes. This will enable us to bring other Zuul classes over, such as the component registry. The existing connection-related code is removed and the remaining model-style code is moved to nodepool.zk.zookeeper. Almost every file imported the model as nodepool.zk, so import adjustments are made to compensate while keeping the code more or less as-is. Change-Id: I9f793d7bbad573cb881dfcfdf11e3013e0f8e4a3	2022-05-23 07:40:20 -07:00
Benjamin Schanzel	7ae6b34e71	Pass requestor data to Nodes Propagate the NodeRequest.requestor to a new property Node.requestor which usually holds the zuul_system_id, or "NodePool:min_ready". This is set initially when first creating a node, and is updated when a ready node is assigned to a request. This is to always know for which requestor is a node got allocated. Change-Id: Ifd54a94bae39f31a70bdedce384a64c2a04495c1	2022-03-21 10:19:54 +01:00
James E. Blair	e95b146d10	Switch docs theme to versioned RTD To match change I2870450ffd02f55509fcc1297d050b09deafbfb9 in Zuul. The default domain is changed to zuul which uncovered a reference error which is fixed. Change-Id: I71db35252d018feed41d9e87aa702c6daa61902b	2021-12-16 11:23:30 -08:00
Benjamin Schanzel	ee90100852	Add Tenant-Scoped Resource Quota This change adds the option to put quota on resources on a per-tenant basis (i.e. Zuul tenants). It adds a new top-level config structure ``tenant-resource-limits`` under which one can specify a number of tenants, each with ``max-servers``, ``max-cores``, and ``max-ram`` limits. These limits are valid globally, i.e., for all providers. This is contrary to currently existing provider and pool quotas, which only are consindered for nodes of the same provider. Change-Id: I0c0154db7d5edaa91a9fe21ebf6936e14cef4db7	2021-09-01 09:07:43 +02:00
Zuul	d08f79d09e	Merge "Azure: add quota support"	2021-06-18 21:47:24 +00:00
Zuul	27bbc91799	Merge "Azure: reconcile config objects"	2021-06-15 01:18:31 +00:00
Zuul	ee22b88ab5	Merge "Support threadless deletes"	2021-06-11 14:19:00 +00:00
James E. Blair	667c239354	Azure: add quota support This adds support for querying the provider quota as well as getting resource usage information for hardware profiles. This also includes a change to the general quota utils for all drivers: If a driver raises an exception when querying available quota and we have a cached value, use the cached value. This avoids errors from transient provider API/network issues. Change-Id: I7199ced512226aaf8bdb6106d41796a5cbca97c0	2021-05-27 16:18:33 -07:00
James E. Blair	f0eb830bb7	Log decline reason at info Admins need to know why requests were declined, and major decisions should be logged at info level. Note that the complementary "Accepting node request" may be okay at debug level as there is a more informative "Assigning node request" message at info. Change-Id: I67ac8c75e5581cc4da77cba42a98c5c3785640ff	2021-04-15 16:25:01 -07:00
James E. Blair	94fcc70a59	Azure: reconcile config objects The config objects in the Azure driver have drifted a bit. This updates them to match the actual used configuration. It also reorganizes them to be a little easier to maintain by moving the initializers into the individual objects. Finally, the verbose __eq__ methods are removed in favor of a simpler __eq__ method in the superclass. Since the OpenStack, k8s, and OpenShift drivers calls super() in __eq__ methods, they need to be updated at the same time. This also corrects an unrelated error with a misnamed parameter in the fake k8s used in the k8s tests. Change-Id: Id6971ca002879d3fb056fedc7e4ca6ec35dd7434	2021-03-22 10:39:53 -07:00
James E. Blair	63f38dfd6c	Support threadless deletes The launcher implements deletes using threads, and unlike with launches, does not give drivers an opportunity to override that and handle them without threads (as we want to do in the state machine driver). To correct this, we move the NodeDeleter class from the launcher to driver utils, and add a new driver Provider method that returns the NodeDeleter thread. This is added in the base Provider class so all drivers get this behavior by default. In the state machine driver, we override the method so that instead of returning a thread, we start a state machine and add it to a list of state machines that our internal state machine runner thread should drive. Change-Id: Iddb7ed23c741824b5727fe2d89c9ddbfc01cd7d7	2021-03-21 14:39:01 -07:00
Tobias Henkel	64a8c138dd	Mention node id when unlock failed Nodepool currently logs an exception trace when it cannot unlock a node. However this doesn't include the node id which would be useful for debugging. Thus add that information. Do the same for failed node allocation clearing. Change-Id: I437f4be6cdd214f598410536753cc1edbcf730f8	2021-02-25 22:01:57 +01:00
James E. Blair	9e9a5b9bfd	Improve max-servers handling for GCE The quota handling for the simple driver (used by GCE) only handles max-servers, but even so, it still didn't take into consideration currently building servers. If a number of simultaneous requests were received, it would try to build them all and eventually return node failures for the ones that the cloud refused to build. The OpenStack driver has a lot of nice quota handling methods which do take currently building nodes into account. This change moves some of those methods into a new Provider mixin class for quota support. This class implements some handy methods which perform the calculations and provides some abstract methods which providers will need to implement in order to supply information. The simple driver is updated to use this system, though it still only supports max-servers for the moment. Change-Id: I0ce742452914301552f4af5e92a3e36304a7e291	2020-06-21 06:38:50 -07:00
David Shrewsbury	aba9b4e134	Fix for clearing assigned nodes that have vanished If a node request disappears, we are assuming the node znodes that have been assigned to it are still around, but it turns out they may not be for strange, odd reasons. Attempting to store to them will cause errors. We need to handle that. Change-Id: I10dbce96e4d789a0d1d8d82d72a780fb63b66d62	2020-02-27 17:31:49 -05:00
Simon Westphahl	b628195d04	Annotate logs in node request handler Request related logs are annotated with the event and node request id to improve tracability. Change-Id: I8fdb00f6a446db87e05d73b93afec815ead0f3ee	2020-01-27 15:27:58 +01:00
Zuul	2b4fdb03d9	Merge "Remove unused use_taskmanager flag"	2019-04-05 19:58:37 +00:00
David Shrewsbury	7a650001e9	Fix debug log message The parameter was missing. Change-Id: If4de197e54f152ef791748783b22e19cbbd98186	2019-04-04 14:05:19 -04:00
Monty Taylor	7618b714e2	Remove unused use_taskmanager flag Now that there is no more TaskManager class, nor anything using one, the use_taskmanager flag is vestigal. Clean it up so that we don't have to pass it around to things anymore. Change-Id: I7c1f766f948ad965ee5f07321743fbaebb54288a	2019-04-02 12:11:07 +00:00
David Shrewsbury	d6ef934b70	Extract common config parsing for ProviderConfig Adds a ProviderConfig class method that can be called to get the config schema for the common config options in a Provider. Drivers are modified to call this method. Change-Id: Ib67256dddc06d13eb7683226edaa8c8c10a73326	2019-01-07 12:34:05 +00:00
David Shrewsbury	a19dffd916	Extract out common config parsing for ConfigPool Our driver code is in a less-than-ideal situation where each driver is responsible for parsing config options that are common to all drivers. This change begins to correct that, starting with ConfigPool. It changes the driver API in the following ways: 1) Forces objects derived from ConfigPool to implement a load() method that should call super's method, then handle loading driver specific options from the config. 2) Adds a ConfigPool class method that can be called to get the config schema for the common config options leaving drivers to have to only define the schema for their own config options. Other base config objects will be modeled after this pattern in later changes. Change-Id: I41620590c355cacd2c4fbe6916acfe80f20e3216	2019-01-03 11:05:26 -05:00
David Shrewsbury	16325d5c4c	Add arbitrary node attributes config option This config option, available under each provider pool section, can contain static key-value pairs that will be stored in ZooKeeper on each Node znode. This will allow us to pass along abitrary data from nodepool to any user of nodepool (specifically, zuul). Initially, this will be used to pass along zone information to zuul executors. Change-Id: I126d37a8c0a4f44dca59c11f76a583b9181ab653	2018-11-29 09:35:59 -05:00
Tobias Henkel	e4f245eb13	Add extra safety belt when reusing a node When reusing a node we lock it but don't check if the state is still ready. There are probably few to no possibilities how this can happen. But it's better to guarantee this condition than to think that this should be ok. Change-Id: I365de6accb28b538dae5ff1dfb6c4a9966e50b01	2018-11-26 19:38:10 +01:00
Tobias Henkel	25f26165eb	Remove unneeded todo comment This removes a todo about request deferring. This is not valid at this point because we already locked the request and further are in a loop that must run fast. Change-Id: I2858a418fe0ed953e4a300c6d27fca10b65194ef	2018-10-17 14:22:51 +00:00
Tobias Henkel	552cc28f7f	Use quota handling code for min-ready declines We currently check the current pool nodes against max_servers in order to decline min-ready request when we're at quota. This is older than the quota handling where max-servers was the only limiting thing. Now we have real quota handling we should defer this min-ready check to the point where we know that we're at quota (independent of self defined pool quota or real cloud quota). Change-Id: Ic36ecfaf441436db6699297ef459c586f14c2dfc	2018-10-17 16:20:42 +02:00
David Shrewsbury	81436b58c6	Add logging when pausing due to node abort We can see the unpause in the logs, but not the pause. Change-Id: I100764bc7a8117b63e3a644c512722d5a6b255c3	2018-07-23 15:17:00 -04:00
Tobias Henkel	ce9120192b	Cleanup test_over_quota The test_over_quota has some debug lines that should be removed. Also make the copy of the nodeset more intuitive. Change-Id: I9a168bc1446e3bb479cace68ed932401ebcdd2bf	2018-07-06 19:35:02 +02:00
Tobias Henkel	31c276c234	Fix relaunch attempts when hitting quota errors The quota calculations in nodepool never can be perfectly accurate because there still can be some races with other launchers in the tenant or by other workloads sharing the tenant. So we must be able to gracefully handle the case when the cloud refuses a launch because of quota. Currently we just invalidate the quota cache and immediately try again to launch the node. Of course that will fail again in most cases. This makes the node request and thus the zuul job fail with NODE_FAILURE. Instead we need to pause the handler like it would happen because of quota calculation. Instead we mark the node as aborted which is no failure indicator and pause the handler so it will automatically reschedule a new node launch as soon as the quota calculation allows it. Change-Id: I122268dc7723901c04a58fa6f5c1688a3fdab227	2018-07-06 08:41:02 +02:00
Zuul	729f34fced	Merge "launcher: add pool quota debug and log information"	2018-07-03 07:28:50 +00:00
Tristan Cacqueray	24bf9d1f0e	launcher: add pool quota debug and log information This change logs the current pool quota as DEBUG and adds a INFO log when a pool doesn't have enough quota to satisfy a request. Change-Id: I78445f21f251d206d4fb0dce817a992685c8b319	2018-07-01 01:16:32 +00:00
David Shrewsbury	c08eb291f7	Fix for pools with different labels Pools within the same provider that have different labels are attempting to handle requests for ALL labels defined within the provider. This is due to change I61263f51be5e957de71d1e2dabaa7391bbe7bddf which incorrectly changed from getting supported labels from per-pool to per-provider. This modifies the getSupportedLabels() API to support pools. Change-Id: I7a0d472928c6b528f6faa6dd3b9cf1479874cb22	2018-06-28 15:04:35 -04:00
Zuul	3c92ac841e	Merge "Make a proper driver notification API"	2018-06-21 19:01:29 +00:00
Zuul	4b7f348b76	Merge "Pass zk connection to ProviderManager.start()"	2018-06-21 19:00:24 +00:00
David Shrewsbury	58ccd2c718	Fix paused handler exception handling If a request handler is paused and an exception is thrown during normal events, that exception handling code was not properly unpausing the request and removing it from the list of active handlers. Change-Id: I0cfd8294f9ce88e5918654a4847776c981ee4fb3	2018-06-15 09:05:04 -04:00
David Shrewsbury	326de1eedb	Make a proper driver notification API We sometimes have need to send notifications to certain driver objects. A "notification" is a call to an object to indicate some condition has changed, and the callee may choose to act on that, or totally ignore it. Currently, the only notification is NodeRequestHandler.nodeReused() but this adds another notification for the Provider objects in preparation for static node driver changes. Since it is entirely probable we may want more in the future, this change organizes the notifications into proper classes for easy identification/manipulation. It will be a naming standard end these notification method names with "Notification". Change-Id: I905ef42bb434435cbe3cec0f5007981a5b789769	2018-06-13 10:22:11 -04:00
David Shrewsbury	a418aabb7a	Pass zk connection to ProviderManager.start() In order to support static node pre-registration, we need to give the provider manager the opportunity to register/deregister any nodes in its configuration file when it starts (on startup or when the config change). It will need a ZooKeeper connection to do this. The OpenStack driver will ignore this parameter. Change-Id: Idd00286b2577921b3fe5b55e8f13a27f2fbde5d6	2018-06-12 12:04:16 -04:00
Zuul	e17e4c36b3	Merge "Fix 'satisfy' spelling errors"	2018-06-11 13:42:28 +00:00

1 2

85 Commits