nodepool

Commit Graph

Author	SHA1	Message	Date
Dong Zhang	aa113be19f	Introduce error.capacity states_key for InsufficientInstanceCapacity error We want to handle the "InsufficientInstanceCapacity" error different than other "error.unknown" errors in our monitoring/alerting system. With this change, it would produce a "error.capacity" instead of "error.unknown". Change-Id: Id3a49d4b2d4b4733f801e65df69b505e913985a7	2024-01-04 15:37:51 +01:00
James E. Blair	a602654482	Handle invalid AWS instance in quota lookup Early nodepool performed all cloud operations within the context of an accepted node request. This means that any disagreement between the nodepool configuration and the cloud (such as what instance types, images, networks, or other resources are actually available) would be detected within that context and the request would be marked as completed and failed. When we added tenant quota support, we also added the possibility of needing to interact with the cloud before accepting a request. Specifically, we may now ask the cloud what resources are needed for a given instance type (and volume, etc) before deciding whether to accept a request. If we raise an exception here, the launcher will simply loop indefinitely. To avoid this, we will add a new exception class to indicate a permanent configuration error which was detected at runtime. If AWS says an instance type doesn't exist when we try to calculate its quota, we mark it as permanently errored in the provider configuration, then return empty quota information back to the launcher. This allows the launcher to accept the request, but then immediately mark it as failed because the label isn't available. The state of this error is stored on the provider manager, so the typical corrective action of updating the configuration to correct the label config means that a new provider will be spawned with an empty error label list; the error state will be cleared and the launcher will try again. Finally, an extra exception handler is added to the main launcher loop so that if any other unhandled errors slip through, the request will be deferred and the launcher will continue processing requests. Change-Id: I9a5349203a337ab23159806762cb46c059fe4ac5	2023-07-18 13:51:13 -07:00
Fabien Boucher	f57ac1881a	Remove uneeded shebang and exec bit on some files Having python files with exec bit and shebang defined in /usr/lib/python-*/site-package/ is not fine in a RPM package. Instead of carrying a patch in nodepool RPM packaging better to fix this directly upstream. Change-Id: I5a01e21243f175d28c67376941149e357cdacd26	2019-12-13 19:30:03 +01:00
Tobias Henkel	31c276c234	Fix relaunch attempts when hitting quota errors The quota calculations in nodepool never can be perfectly accurate because there still can be some races with other launchers in the tenant or by other workloads sharing the tenant. So we must be able to gracefully handle the case when the cloud refuses a launch because of quota. Currently we just invalidate the quota cache and immediately try again to launch the node. Of course that will fail again in most cases. This makes the node request and thus the zuul job fail with NODE_FAILURE. Instead we need to pause the handler like it would happen because of quota calculation. Instead we mark the node as aborted which is no failure indicator and pause the handler so it will automatically reschedule a new node launch as soon as the quota calculation allows it. Change-Id: I122268dc7723901c04a58fa6f5c1688a3fdab227	2018-07-06 08:41:02 +02:00
Tobias Henkel	2da274e2ae	Don't gather host keys for non ssh connections In case of an image with the connection type winrm we cannot scan the ssh host keys. So in case the connection type is not ssh we need to skip gathering the host keys. Change-Id: I56f308baa10d40461cf4a919bbcdc4467e85a551	2018-04-03 17:31:45 +02:00
Tobias Henkel	7d79770840	Do pep8 housekeeping according to zuul rules The pep8 rules used in nodepool are somewhat broken. In preparation to use the pep8 ruleset from zuul we need to fix the findings upfront. Change-Id: I9fb2a80db7671c590cdb8effbd1a1102aaa3aff8	2018-01-17 02:17:45 +00:00
Tristan Cacqueray	4d201328f5	Collect request handling implementation in an OpenStack driver This change moves OpenStack related code to a driver. To avoid circular import, this change also moves the StatsReporter to the stats module so that the handlers doesn't have to import the launcher. Change-Id: I319ce8780aa7e81b079c3f31d546b89eca6cf5f4 Story: 2001044 Task: 4614	2017-07-25 14:27:17 +00:00
David Shrewsbury	8cbe6bb4ca	Add initial ZooKeeper API This implements the API necessary to perform the ZooKeeper functionality outlined in the "Nodepool: Use ZooKeeper for Workers" spec: http://specs.openstack.org/openstack-infra/infra-specs/specs/nodepool-zookeeper-workers.html This API is not used yet, but will be used and modified where necessary in upcoming reviews based on this work. Change-Id: I681722a1f2dc3fe13efa2baa3a1a7acd1cbe50ee	2016-07-21 14:46:04 -04:00
Monty Taylor	e1f4a12949	Use shade for all OpenStack interactions We wrote shade as an extraction of the logic we had in nodepool, and have since expanded it to support more clouds. It's time to start using it in nodepool, since that will allow us to add more clouds and also to handle a wider variety of them. Making a patch series was too tricky because of the way fakes and threading work, so this is everything in one stab. Depends-On: I557694b3931d81a3524c781ab5dabfb5995557f5 Change-Id: I423716d619aafb2eca5c1748bc65b38603a97b6a Co-Authored-By: James E. Blair <jeblair@linux.vnet.ibm.com> Co-Authored-By: David Shrewsbury <shrewsbury.dave@gmail.com> Co-Authored-By: Yolanda Robla <yolanda.robla-mota@hpe.com>	2016-03-26 10:23:25 +01:00
Monty Taylor	eed395d637	Be more specific in logging timeout exceptions At the moment, grepping through logs to determine what's happening with timeouts on a provider is difficult because for some errors the cause of the timeout is on a different line than the provider in question. Give each timeout a specific named exception, and then when we catch the exceptions, log them specifically with node id, provider and then the additional descriptive text from the timeout exception. This should allow for easy grepping through logs to find specific instances of types of timeouts - or of all timeouts. Also add a corresponding success debug log so that comparitive greps/counts are also easy. Change-Id: I889bd9b5d92f77ce9ff86415c775fe1cd9545bbc	2016-03-04 17:42:09 -06:00
Gregory Haynes	cda77d069f	Builders distinguish between failure and exception It would be great if builders distinguished between a job failure (invalid args, config, etc) and an exception (our code is broken). To do this, we need to make our own exceptions and use them. Change-Id: I31abb6fc2379ccac73b2045673eba453ac4a67a0	2016-01-12 15:33:01 -08:00

11 Commits