Commit Graph

11 Commits

Author SHA1 Message Date
Dong Zhang aa113be19f Introduce error.capacity states_key for InsufficientInstanceCapacity error
We want to handle the "InsufficientInstanceCapacity" error different than
other "error.unknown" errors in our monitoring/alerting system. With this
change, it would produce a "error.capacity" instead of "error.unknown".

Change-Id: Id3a49d4b2d4b4733f801e65df69b505e913985a7
2024-01-04 15:37:51 +01:00
James E. Blair a602654482 Handle invalid AWS instance in quota lookup
Early nodepool performed all cloud operations within the context of
an accepted node request.  This means that any disagreement between
the nodepool configuration and the cloud (such as what instance types,
images, networks, or other resources are actually available) would
be detected within that context and the request would be marked as
completed and failed.

When we added tenant quota support, we also added the possibility
of needing to interact with the cloud before accepting a request.
Specifically, we may now ask the cloud what resources are needed
for a given instance type (and volume, etc) before deciding whether
to accept a request.  If we raise an exception here, the launcher
will simply loop indefinitely.

To avoid this, we will add a new exception class to indicate a
permanent configuration error which was detected at runtime.  If
AWS says an instance type doesn't exist when we try to calculate
its quota, we mark it as permanently errored in the provider
configuration, then return empty quota information back to the
launcher.

This allows the launcher to accept the request, but then immediately
mark it as failed because the label isn't available.

The state of this error is stored on the provider manager, so the
typical corrective action of updating the configuration to correct
the label config means that a new provider will be spawned with an
empty error label list; the error state will be cleared and the
launcher will try again.

Finally, an extra exception handler is added to the main launcher
loop so that if any other unhandled errors slip through, the
request will be deferred and the launcher will continue processing
requests.

Change-Id: I9a5349203a337ab23159806762cb46c059fe4ac5
2023-07-18 13:51:13 -07:00
Fabien Boucher f57ac1881a
Remove uneeded shebang and exec bit on some files
Having python files with exec bit and shebang defined in
/usr/lib/python-*/site-package/ is not fine in a RPM package.

Instead of carrying a patch in nodepool RPM packaging better
to fix this directly upstream.

Change-Id: I5a01e21243f175d28c67376941149e357cdacd26
2019-12-13 19:30:03 +01:00
Tobias Henkel 31c276c234
Fix relaunch attempts when hitting quota errors
The quota calculations in nodepool never can be perfectly accurate
because there still can be some races with other launchers in the
tenant or by other workloads sharing the tenant. So we must be able to
gracefully handle the case when the cloud refuses a launch because of
quota. Currently we just invalidate the quota cache and immediately
try again to launch the node. Of course that will fail again in most
cases. This makes the node request and thus the zuul job fail with
NODE_FAILURE.

Instead we need to pause the handler like it would happen because of
quota calculation. Instead we mark the node as aborted which is no
failure indicator and pause the handler so it will automatically
reschedule a new node launch as soon as the quota calculation allows
it.

Change-Id: I122268dc7723901c04a58fa6f5c1688a3fdab227
2018-07-06 08:41:02 +02:00
Tobias Henkel 2da274e2ae
Don't gather host keys for non ssh connections
In case of an image with the connection type winrm we cannot scan the
ssh host keys. So in case the connection type is not ssh we
need to skip gathering the host keys.

Change-Id: I56f308baa10d40461cf4a919bbcdc4467e85a551
2018-04-03 17:31:45 +02:00
Tobias Henkel 7d79770840 Do pep8 housekeeping according to zuul rules
The pep8 rules used in nodepool are somewhat broken. In preparation to
use the pep8 ruleset from zuul we need to fix the findings upfront.

Change-Id: I9fb2a80db7671c590cdb8effbd1a1102aaa3aff8
2018-01-17 02:17:45 +00:00
Tristan Cacqueray 4d201328f5 Collect request handling implementation in an OpenStack driver
This change moves OpenStack related code to a driver. To avoid circular
import, this change also moves the StatsReporter to the stats module so that
the handlers doesn't have to import the launcher.

Change-Id: I319ce8780aa7e81b079c3f31d546b89eca6cf5f4
Story: 2001044
Task: 4614
2017-07-25 14:27:17 +00:00
David Shrewsbury 8cbe6bb4ca Add initial ZooKeeper API
This implements the API necessary to perform the ZooKeeper functionality
outlined in the "Nodepool: Use ZooKeeper for Workers" spec:

   http://specs.openstack.org/openstack-infra/infra-specs/specs/nodepool-zookeeper-workers.html

This API is not used yet, but will be used and modified where necessary
in upcoming reviews based on this work.

Change-Id: I681722a1f2dc3fe13efa2baa3a1a7acd1cbe50ee
2016-07-21 14:46:04 -04:00
Monty Taylor e1f4a12949 Use shade for all OpenStack interactions
We wrote shade as an extraction of the logic we had in nodepool, and
have since expanded it to support more clouds. It's time to start
using it in nodepool, since that will allow us to add more clouds
and also to handle a wider variety of them.

Making a patch series was too tricky because of the way fakes and
threading work, so this is everything in one stab.

Depends-On: I557694b3931d81a3524c781ab5dabfb5995557f5
Change-Id: I423716d619aafb2eca5c1748bc65b38603a97b6a
Co-Authored-By: James E. Blair <jeblair@linux.vnet.ibm.com>
Co-Authored-By: David Shrewsbury <shrewsbury.dave@gmail.com>
Co-Authored-By: Yolanda Robla <yolanda.robla-mota@hpe.com>
2016-03-26 10:23:25 +01:00
Monty Taylor eed395d637 Be more specific in logging timeout exceptions
At the moment, grepping through logs to determine what's happening with
timeouts on a provider is difficult because for some errors the cause of
the timeout is on a different line than the provider in question.

Give each timeout a specific named exception, and then when we catch the
exceptions, log them specifically with node id, provider and then the
additional descriptive text from the timeout exception. This should
allow for easy grepping through logs to find specific instances of
types of timeouts - or of all timeouts. Also add a corresponding success
debug log so that comparitive greps/counts are also easy.

Change-Id: I889bd9b5d92f77ce9ff86415c775fe1cd9545bbc
2016-03-04 17:42:09 -06:00
Gregory Haynes cda77d069f Builders distinguish between failure and exception
It would be great if builders distinguished between a job failure
(invalid args, config, etc) and an exception (our code is broken). To do
this, we need to make our own exceptions and use them.

Change-Id: I31abb6fc2379ccac73b2045673eba453ac4a67a0
2016-01-12 15:33:01 -08:00