nodepool

Commit Graph

Author	SHA1	Message	Date
Zuul	ac703de734	Merge "Resolve statsd client once at startup"	2023-12-09 16:18:57 +00:00
Benjamin Schanzel	f0b5d1f149	Fix logging of failed node launches A Node has no attribute `label`, instead this information is taken from `node.type`. Additionally, print the node id. 2023-10-06 16:26:27,834 ERROR nodepool.NodeLauncher: [e: 1bfeefa5044b481fa3e18781a9972773] [node_request: 200-0044066729] [node: 0045664756] Launch attempt 3/3 failed for node 0045664756: [...snip...] kubernetes.client.exceptions.ApiException: (409) [...snip...] During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.11/threading.py", line 1038, in _bootstrap_inner self.run() File "/opt/nodepool/lib/python3.11/site-packages/nodepool/driver/utils.py", line 101, in run self.node.label) ^^^^^^^^^^^^^^^ AttributeError: 'Node' object has no attribute 'label' Change-Id: I42913c26318247f6d352308fa2d7b5fdf8a5fcd0	2023-10-12 14:14:12 +02:00
Tristan Cacqueray	8dc55c231f	Add tenant and label name to Launch failed error This change improves the launch failure logs to include the tenant and the label name of the failed node. Change-Id: I13ddfb1333a4865d3b117aa68fda14d3123fe451	2023-08-23 18:06:06 +00:00
James E. Blair	21b8451947	Resolve statsd client once at startup We currently create a new statsd client each time we replace a provider manager. This means that if we are unable to resolve DNS at that time, the new provider may crash due to unhandled exceptions. To resolve this, let's adopt the same behavior we have in Zuul which is to set up the statsd client once at startup and continue to use the same client object for the life of the process. This means that operators will still see errors at startup during a misconfiguration, but any external changes after that will not affect nodepool. Change-Id: I65967c71e859fddbea15aee89f6ddae44344c87b	2023-08-14 10:47:53 -07:00
James E. Blair	99d2a361a1	Use cached ids in node iterator more often There are several places where it is now probably safe to use cached ids when iterating over ZK nodes. The main reasons not to use cached ids are in the case of race conditions or in case the tree cache may have missed an event and is unaware of a node. We have increased confidence in the accuracy of our cache now, so at least in the cases where we know that races are not an issue, we can switch those to use cached ids and save a ZK round trip (a possibly significant one if there is a long list of children). This change adds the flag in the following places (with explanations of why it's safe): * State machine cleanup routines Leaked instances have to show up on two subsequent calls to be acted upon, so this is not sensitive to timing * Quota calculation If we do get the quota wrong, drivers are expected to handle that gracefully anyway. * Held node cleanup Worst case is we wait until next iteration to clean up. * Stats They're a snapshot anyway, so a cache mismatch is really just a shift in the snapshot time. Change-Id: Ie7af2f62188951bf302ffdb64827d868609a1e3c	2023-05-30 13:27:45 -07:00
James E. Blair	eda7b0f609	Add a LazyExecutorTTLCache to the OpenStack driver See the docstring for an explanation of what a Lazy Executor TTL Cache is. By switching the caching of the server list method (and also volumes and fips) to the lazy cache, we will make all of the methods called by the state machines asynchronous. This means that both the create and delete state machine threads should be able to spin through all of their state machines as quickly as Python and ZooKeeper overhead will allow. Change-Id: Ibce6b4d82929e6a764fdbc025990f7e01060b509	2023-03-15 15:25:52 -07:00
James E. Blair	de02ac5a20	Add OpenStack volume quota This adds support for staying within OpenStack volume quota limits on instances that utilize boot-from-volume. Change-Id: I1b7bc177581d23cecd9443a392fb058176409c46	2023-02-13 06:56:03 -08:00
James E. Blair	207d8ac63c	AWS multi quota support This adds support for AWS quotas that are specific to instance types. The current quota support in AWS assumes only the "standard" instance types, but AWS has several additional types with particular specialties (high memory, GPU, etc). This adds automatic support for those by encoding their service quota codes (like 'L-1216C47A') into the QuotaInformation object. QuotaInformation accepts not only cores, ram, and instances as resource values, but now also accepts arbitraly keys such as 'L-1216C47A'. Extra testing of QI is added to ensure we handle the arithmetic correctly in cases where one or the other operand does not have a resource counter. The statemachine drivers did not encode their resource information into the ZK Node record, so tenant quota was not operating correctly. This is now fixed. The AWS driver now accepts max_cores, _instances, and _ram values similar to the OpenStack driver. It additionally accepts max_resources which can be used to specify limits for arbitrary quotas like 'L-1216C47A'. The tenant quota system now also accepts arbitrary keys such as 'L-1216C47A' so that, for example, high memory nodes may be limited by tenant. The mapping of instance types to quota is manually maintained, however, AWS doesn't seem to add new instance types too often, and those it does are highly specialized. If a new instance type is not handled internally, the driver will not be able to calculate expected quota usage, but will still operate until the new type is added to the mapping. Change-Id: Iefdc8f3fb8249c61c43fe51b592f551e273f9c36	2022-07-25 14:41:07 -07:00
James E. Blair	4aaacc7919	Fix ignore_provider_quota in statemachine drivers The statemachine driver copied the quota handling from OpenStack, which also included a check for the presence of the ignore_provider_quota configuration attribute. This attribute does not exist in any of the statemachine drivers, and due to the way the check was constructed, it meant that we defaulted to ignoring the provider quota rather than actually checking it. We should perform this check so that providers can exclude themselves if there is no possibility of satisfying a request. This corrects the check. Change-Id: I7ced3f749a1646ecbb2b80f93b26e61a4e8cd69a	2022-06-29 10:47:49 -07:00
James E. Blair	d5b0dee642	AWS driver create/delete improvements The default AWS rate limit is 2 instances/sec, but in practice, we can achieve something like 0.6 instances/sec with the current code. That's because the create instance REST API call itself takes more than a second to return. To achieve even the default AWS rate (much less a potentially faster one which may be obtainable via support request), we need to alter the approach. This change does the following: * Paralellizes create API calls. We create a threadpool with (typically) 8 workers to execute create instance calls in the background. 2 or 3 workers should be sufficient to meet the 2/sec rate, more allows for the occasional longer execution time as well as a customized higher rate. We max out at 8 to protect nodepool from too many threads. * The state machine uses the new background create calls instead of synchronously creating instances. This allows other state machines to progress further (ie, advance to ssh keyscan faster in the case of a rush of requests). * Delete calls are batched. They don't take as long as create calls, yet their existence at all uses up rate limiting slots which could be used for creating instances. By batching deletes, we make more room for creates. * A bug in the RateLimiter could cause it not to record the initial time and therefore avoid actually rate limiting. This is fixed. * The RateLimiter is now thread-safe. * The default rate limit for AWS is changed to 2 requests/sec. * Documentation for the 'rate' parameter for the AWS driver is added. * Documentation for the 'rate' parameter for the Azure driver is corrected to describe the rate as requests/sec instead of delay between requests. Change-Id: Ida2cbc59928e183eb7da275ff26d152eae784cfe	2022-06-22 13:28:58 -07:00
Zuul	a4acb5644e	Merge "Use Zuul-style ZooKeeper connections"	2022-05-23 22:56:54 +00:00
James E. Blair	10df93540f	Use Zuul-style ZooKeeper connections We have made many improvements to connection handling in Zuul. Bring those back to Nodepool by copying over the zuul/zk directory which has our base ZK connection classes. This will enable us to bring other Zuul classes over, such as the component registry. The existing connection-related code is removed and the remaining model-style code is moved to nodepool.zk.zookeeper. Almost every file imported the model as nodepool.zk, so import adjustments are made to compensate while keeping the code more or less as-is. Change-Id: I9f793d7bbad573cb881dfcfdf11e3013e0f8e4a3	2022-05-23 07:40:20 -07:00
Zuul	7129681181	Merge "Do not reset quota cache timestamp when invalid"	2022-05-19 15:46:57 +00:00
Joshua Watt	2c632af426	Do not reset quota cache timestamp when invalid The quota cache may not be a valid dictionary when invalidateQuotaCache() is called (e.g. when 'ignore-provider-quota' is used in OpenStack). In that case, don't attempt to treat the None as a dictionary as this raises a TypeError exception. This bug was preventing Quota errors from OpenStack from causing nodepool to retry the node request when ignore-provider-quota is True, because the OpenStack handler calles invalidateQuotaCache() before raising the QuotaException. Since invalidateQuotaCache() was raising TypeError, it prevented the QuotaException from being raised and the node allocation was outright failed. A test has been added to verify that nodepool and OpenStack will now retry node allocations as intended. This fixes that bug, but does change the behavior of OpenStack when ignore-provider-quota is True and it returns a Quota error. Change-Id: I1916c56c4f07c6a5d53ce82f4c1bb32bddbd7d63 Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>	2022-05-10 15:04:25 -05:00
James E. Blair	a62060a14e	Measure rate limits from start times The RateLimiter utility used by several drivers measures time from the exit of one context manager to the beginning of the next. This means that if the API call itself takes substantial time, it will be added to the expected delay. In other words, if the expected rate is one API call every two seconds, and the API call itself takes one second, the actual rate will be one call every three seconds. To more closely approximate the overall expected rate, measure from start to start, so that the duration of the API call itself is included. Change-Id: Ia62a6bfa6a3e6cac65f0c20179edcfbc94a5dcc5	2022-05-02 13:29:23 -07:00
Zuul	17d0112017	Merge "Add support for label quota"	2022-04-27 20:36:49 +00:00
James E. Blair	46e130fe1a	Add more debug info to AWS driver These changes are all in service of being able to better understand AWS driver log messages: * Use annotated loggers in the statemachine provider framework so that we see the request, node, and provider information * Have the statemachine framework pass annotated loggers to the state machines themselves so that the above information is available for log messages on individual API calls * Add optional performance information to the rate limit handler (delay and API call duration) * Add some additional log entries to the AWS adapter Also: * Suppress boto logging by default in unit tests (it is verbose and usually not helpful) * Add coverage of node deletion in the AWS driver tests Change-Id: I0e6b4ad72d1af7f776da73c5dd2a50b40f60e4a2	2022-04-11 10:14:20 -07:00
Simon Westphahl	3874d0b5c0	Add support for label quota This change add support for deferring requests when a provider cannot fulfill a request temporarily based on the requested labels. How the label quota is used depends on the driver. The reason for introducing the concept of label quota has to do with how the static driver handled requests when a label wasn't available immediately. So far Nodepool locked the request and created a node with the requested label(s) in state BUILDING. When a node with matching label(s) was returned, the static driver re-registered it using the waiting nodes in state BUILDING. The problem with this approach was, that as soon as the request was locked Zuul could no longer revise the relative priority of the request. This could lead to the situation that later node requests where prioritized when they had a lower inital relative priority. To fix this the static provider will now use the label quota to defer requests that can't be fulfilled with the existing free nodes. Change-Id: I55f6655c0e34b7a3c136b270658009b4f4d76394	2022-03-31 15:03:41 +02:00
Zuul	f7b109abc9	Merge "QuotaInformation : abstract resource recording"	2021-06-28 21:46:20 +00:00
James E. Blair	667c239354	Azure: add quota support This adds support for querying the provider quota as well as getting resource usage information for hardware profiles. This also includes a change to the general quota utils for all drivers: If a driver raises an exception when querying available quota and we have a cached value, use the cached value. This avoids errors from transient provider API/network issues. Change-Id: I7199ced512226aaf8bdb6106d41796a5cbca97c0	2021-05-27 16:18:33 -07:00
Ian Wienand	bf40e145e4	QuotaInformation : abstract resource recording Change I670d4bfd7d35ce1109b3ee9b3342fb45ee283a79 added the quota['compute'] property to ZK nodes for Zuul to keep track of. Instead of accessing the property directly, this adds a get_resources() function to QuotaInformation which returns it, and makes a point to note that the property is used so that we don't modify it in incompatible ways in future. This is a no-op refactor, but helpful in a follow-on change that also adds this field to leaked nodes. Change-Id: Id78b059cf2121e01e4cd444f6ad3834373cf7fb6	2021-04-20 16:09:18 +10:00
James E. Blair	63f38dfd6c	Support threadless deletes The launcher implements deletes using threads, and unlike with launches, does not give drivers an opportunity to override that and handle them without threads (as we want to do in the state machine driver). To correct this, we move the NodeDeleter class from the launcher to driver utils, and add a new driver Provider method that returns the NodeDeleter thread. This is added in the base Provider class so all drivers get this behavior by default. In the state machine driver, we override the method so that instead of returning a thread, we start a state machine and add it to a list of state machines that our internal state machine runner thread should drive. Change-Id: Iddb7ed23c741824b5727fe2d89c9ddbfc01cd7d7	2021-03-21 14:39:01 -07:00
James E. Blair	7ce4dbe26d	Add a state machine driver framework This is intended to simplify efficient driver implementation by isolating ZooKeeper and other internal Nodepool logic from cloud implementation, while at the same time reducing the need for threads. It is based on the simple driver interface, but by using state machines, it can accomodate clouds (like Azure and OpenStack) which require multiple steps to create an instance. It's currently only suitable for use with a single pool, because the launcher's poolworkers are multiple threads. However, we should be able to collapse them into one thread which would make this safe for use with multiple pools. We may also need to adjust some of the sleep times in the launcher to accomodate the idea that the pool worker threads will be more active in driving the state machine polling. Change-Id: Ia179220a71653c9e261342a262ff1e23e5408215	2021-03-19 10:47:10 -07:00
James E. Blair	120f399ec6	Finish adding GCE quota support This expands the simple driver with the calls necessary to get the current quota limits and expected and current usage of flavors. That's enough to handle real cloud quota. This change also removes a leftover debug line. Change-Id: I837f5ad8e697b48ad8138021e64e799c8be3c6fc	2020-08-12 11:00:42 -07:00
James E. Blair	9e9a5b9bfd	Improve max-servers handling for GCE The quota handling for the simple driver (used by GCE) only handles max-servers, but even so, it still didn't take into consideration currently building servers. If a number of simultaneous requests were received, it would try to build them all and eventually return node failures for the ones that the cloud refused to build. The OpenStack driver has a lot of nice quota handling methods which do take currently building nodes into account. This change moves some of those methods into a new Provider mixin class for quota support. This class implements some handy methods which perform the calculations and provides some abstract methods which providers will need to implement in order to supply information. The simple driver is updated to use this system, though it still only supports max-servers for the moment. Change-Id: I0ce742452914301552f4af5e92a3e36304a7e291	2020-06-21 06:38:50 -07:00
Simon Westphahl	52709d7070	Pass node request handler to launcher base class All implementations of the NodeLauncher have access to the node request handler instance, but it was not passed through to the base class. In order to initialize the event log adapter, the NodeLauncher will now receive then node request handler instead of the Zookeeper connection. The Zookeeper connection is directly extracted from the request handler. Change-Id: I8ee7335f50b64435c82a5f039989f98750874466	2020-01-22 09:49:52 +01:00
Simon Westphahl	d90f0e745a	Centralize logging adapters Logging adapters were already in use for the NodeLauncher. In order to expand logging of Zuul's event id to nodepool, we will first centralize the existing adapters and refactor them simliar to how this is already done in Zuul. This will allow us to correlate node request with other log messages from Zuul via the event id to allow for faster and easier debugging. Change-Id: Idc7468239a055c75074e876ea8c445b49c791131	2020-01-22 09:13:21 +01:00
Tobias Henkel	f58a2a2b68	Fix leaked loggers In the NodeLauncher we currently create a specific logger per node we create. However loggers cannot be destroyed and will live for the whole lifetime of the application. This means that we essentially leak one Logger object per node we create. The reason for introducing per node loggers was that we can easily filter the log by node id and get every message related to that node creation. This also can be accomplished by using a LoggerAdapter which is also already in use by zuul for the same purpose. Change-Id: Ieff99c242c1ec1ca89e2c16929cfc3237d6f1a78	2019-04-30 21:08:28 +02:00
Tobias Henkel	64487baef0	Asynchronously update node statistics We currently updarte the node statistics on every node launch or delete. This cannot use caching at the moment because when the statistics are updated we might end up pushing slightly outdated data. If then there is no further update for a longer time we end up with broken gauges. We already get update events from the node cache so we can use that to centrally trigger node statistics updates. This is combined with leader election so there is only a single launcher that keeps the statistics up to date. This will ensure that the statistics are not cluttered because of several launchers pushing their own slightly different view into the stats. As a side effect this reduces the runtime of a test that creates 200 nodes from 100s to 70s on my local machine. Change-Id: I77c6edc1db45b5b45be1812cf19eea66fdfab014	2018-11-29 16:48:30 +01:00
Paul Belanger	87b0f66353	Use hostname for launch exception By default, hostname contains informaiton about provider and node id, include this information to aid in debugging. Change-Id: I4ddd8a83a9f50d856f78c562d86a83da6d78c759 Signed-off-by: Paul Belanger <pabelanger@redhat.com>	2018-07-25 13:15:28 -04:00
Tobias Henkel	31c276c234	Fix relaunch attempts when hitting quota errors The quota calculations in nodepool never can be perfectly accurate because there still can be some races with other launchers in the tenant or by other workloads sharing the tenant. So we must be able to gracefully handle the case when the cloud refuses a launch because of quota. Currently we just invalidate the quota cache and immediately try again to launch the node. Of course that will fail again in most cases. This makes the node request and thus the zuul job fail with NODE_FAILURE. Instead we need to pause the handler like it would happen because of quota calculation. Instead we mark the node as aborted which is no failure indicator and pause the handler so it will automatically reschedule a new node launch as soon as the quota calculation allows it. Change-Id: I122268dc7723901c04a58fa6f5c1688a3fdab227	2018-07-06 08:41:02 +02:00
Tobias Henkel	e2f279275d	Move QuotaInformation to driver utils The QuotaInformation class is a quite generic container for quota information. This might be useful in future drivers like AWS or Azure. Further Moving it to utils resolves a cyclic import of openstack handler and provider. Change-Id: Ibbece39a0da74ae7b54e3a1088057e683f158a66	2018-06-06 20:21:28 +02:00
David Shrewsbury	be0c59535a	Simplify driver API Further decouple the driver API from the provider configuration by simplifying the driver interface (which will eventually lead to the driver deciding whether or not to support multiple labels) and removing config references in NodeLauncher. This significantly simplifies the driver API by: * Reducing the launching of nodes to a launch() and launchesCompleted() interface. The launchesCompleted() method acts as a periodic check on whether the nodeset launches have all completed. * How the drivers actually launch nodes is decoupled from the driver API by not having the return of NodeRequestHandler.launch() be either a thread or None. This leaves it entirely up to the driver to decide the best way to manage launches... threads, subprocesses, messaging, etc. E.g., thread handling is moved from driver API into OpenStack driver. * Drivers no longer need to implement the very confusing pollLauncher() method. This was originally meant to poll the launch threads, but even drivers not using threads (e.g., the static driver) had to reimplement this method. That made no sense. The NodeLauncher class (currently used only by the OpenStack driver), is moved to a driver utility module that other drivers may choose to use or ignore. And it no longer directly accesses the provider config. The test_nodelaunchmanager.py tests was originally testing the NodeLaunchManager class (removed quite some time ago) which managed launch threads. This was eventually moved to the driver interface. This change further moves thread management into the OpenStack driver. As a result, these tests are redundant since the test_launcher.py tests cover all of this. So remove test_nodelaunchmanager.py. Change-Id: I407d999b3608e1614c0dbe49808b2a64756dde58	2018-05-18 11:24:59 -04:00

33 Commits