Commit Graph

89 Commits

Author SHA1 Message Date
David Shrewsbury 511ffd9c29
Add tox functional testing for drivers
Reorganizes the unit tests into a new subdirectory, and
adds a new functional test subdirectory (that will be further
divided into subdirs for drivers).

Change-Id: I027bba619deef9379d293fdd0457df6a1377aea8
2018-11-01 15:33:44 +01:00
Tobias Henkel b01c7821e5
Initialize label statistics to zero
The label statistics age gauges that keep their values. Currently
nodepool doesn't initialize the label based node statistics to
zero. Because of this any label state that shows up in the statistics
won't be reset to zero and nodepool is stopping to send updates to
this gauge. This leads to graphs that are mostly stuck to 1 if no
nodes are currently in this state.

This can be fixed by iterating over the supported labels and
initializing all states of them to zero.

Change-Id: I6c7f63f8f64a83b225386f6da567bfae5141be7b
2018-10-18 15:45:42 +02:00
Tobias Henkel 2da07128e6
Fix race in test_launchNode_delete_error
Deleting failed nodes is offloaded to the DeletedNodeWorker. Thus the
deletion can be a bit delayed and we need to wait for delete_success
instead of asserting it.

Change-Id: I5b425707fac5250367f6f3164e803951b10196cb
2018-09-24 07:25:32 +02:00
Zuul da8a94d341 Merge "Do not abort node launch if failed node cannot be deleted" 2018-09-17 17:58:21 +00:00
mhuin ebdc35fc5e Do not abort node launch if failed node cannot be deleted
There was a possibility for an uncaught exception when launching a
node: if the launch fails and the subsequent node cleaning fails
as well, nodepool would break out of the loop prematurely.

Also add request information when logging launch failure to make
debugging easier.

Change-Id: I4a08ebe7b0ce763b6ac4bcd314403b869e487ac1
2018-09-10 10:40:14 +02:00
Tobias Henkel bb3c2dbf1b
Fix label name in reported label stats
Now the node type is a list of labels. This is not considered when
updating the node stats so the label in the reported stats is the
string representation of that list. Instead iterate over that list and
increase the counter for every label.

Change-Id: Id64cb933e310fa056deab0b63e9e02d451de5973
2018-09-07 08:20:57 +02:00
Clark Boylan ef6be899b6 Force node hold expiration to integer type
Zuul gives us the node hold requests and it may not provide the node
hold expiration value as an integer (this issue is also being fixed at
https://review.openstack.org/597592 in zuul). To guard against hold
clients (zuul and possibly others) giving us the wrong type for the
expiration values we coerce it to int() here if not None.

If the coercion fails we assume a value of 0 which means no expiration.
This allows the node to still be held as the user expected, it just
won't expire.

Change-Id: Iaa9500490b8536470978d94fd01cbaad23d1d647
2018-08-29 10:21:56 -07:00
Zuul 168a12a0a6 Merge "zk: skip node already being deleted in cleanup leaked instance task" 2018-08-24 22:04:40 +00:00
Tristan Cacqueray 8eb099283a zk: skip node already being deleted in cleanup leaked instance task
A cloud may fails to delete a node, and in this case we need to check
if we are not already trying to delete the node. Otherwise the launcher
keeps on adding artifical node.

Change-Id: Ic5dfc75df771a5f312099ee82f82e2561f5f4829
2018-08-24 16:56:29 -04:00
Ian Wienand cd9aa75640 Use pipelines for stats keys
Pipelines buffer stats and then send them out in more reasonable sized
chunks, helping to avoid small UDP packets going missing in a flood of
stats.  Use this in stats.py.

This needs a slight change to the assertedStats handler to extract the
combined stats.  This function is ported from Zuul where we updated to
handle pipeline stats (Id4f6f5a6cd66581a81299ed5c67a5c49c95c9b52) so
it is not really new code.

Change-Id: I3f68450c7164d1cf0f1f57f9a31e5dca2f72bc43
2018-07-25 16:46:13 +10:00
Zuul 1b4c92262a Merge "builder: support setting diskimage env-vars in secure configuration" 2018-07-23 16:04:00 +00:00
Tobias Henkel 934b1eed9c
Invalidate az cache on bad request
When getting error 400 we not only need to clear the image and flavor
cache but the az cache as well. Otherwise we will get constantly node
failures for any node request where nodepool chose that az
[1]. Currently the only way to recover from this situation is to
restart nodepool. Invalidating the cache doesn't fix the request that
failed due to this error but at least ensures that nodepool will
recover from this situation automatically for all further node
requests.

[1] Trace:
018-07-05 09:09:08,477 ERROR nodepool.NodeLauncher-0000123378: Launch attempt 2/3 failed for node 0000123378:
Traceback (most recent call last):
  File "/opt/nodepool-source/nodepool/driver/openstack/handler.py", line 221, in launch
    self._launchNode()
  File "/opt/nodepool-source/nodepool/driver/openstack/handler.py", line 134, in _launchNode
    volume_size=self.label.volume_size)
  File "/opt/nodepool-source/nodepool/driver/openstack/provider.py", line 378, in createServer
    return self._client.create_server(wait=False, **create_args)
  File "<decorator-gen-106>", line 2, in create_server
  File "/usr/lib/python3.5/site-packages/shade/_utils.py", line 410, in func_wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python3.5/site-packages/shade/openstackcloud.py", line 6909, in create_server
    endpoint, json=server_json)
  File "/usr/lib/python3.5/site-packages/keystoneauth1/adapter.py", line 334, in post
    return self.request(url, 'POST', **kwargs)
  File "/usr/lib/python3.5/site-packages/shade/_adapter.py", line 158, in request
    return self._munch_response(response, error_message=error_message)
  File "/usr/lib/python3.5/site-packages/shade/_adapter.py", line 114, in _munch_response
    exc.raise_from_response(response, error_message=error_message)
  File "/usr/lib/python3.5/site-packages/shade/exc.py", line 171, in raise_from_response
    raise OpenStackCloudBadRequest(msg, response=response)
shade.exc.OpenStackCloudBadRequest: (400) Client Error for url: (...) The requested availability zone is not available

Change-Id: I5f653f159b08cf086d20c2398a9345bd4caa4d1e
2018-07-23 14:04:08 +02:00
Zuul 4fd407b064 Merge "Add ability to ignore provider quota for a pool" 2018-07-12 22:03:39 +00:00
Jesse Pretorius 4c8b5f4f99 Add ability to ignore provider quota for a pool
In some circumstances it is useful to tell the launcher to
ignore the provide quota and just trust the max-* settings
for that pool instead.

This particular need arises when using Rackspace public cloud
for both instances and OnMetal nodes. In this situation the
quota for instances and for OnMetal nodes is different, but
nodepool only queries the quota for instances. When trying to
build OnMetal nodes, the quota check fails - but should not.

In this circumstance, instead of making shade/nodepool
complicated by figuring out how to calculate disparate quota
types, it makes sense to rather just allow nodepool to ignore
the quota for a pool to try executing the build instead.

While this is our use-case, it may also be useful to others
for other reasons.

Change-Id: I232a1ab365795381ab180aceb48e8c87843ac713
2018-07-11 17:37:55 +00:00
Tobias Henkel ce9120192b
Cleanup test_over_quota
The test_over_quota has some debug lines that should be removed. Also
make the copy of the nodeset more intuitive.

Change-Id: I9a168bc1446e3bb479cace68ed932401ebcdd2bf
2018-07-06 19:35:02 +02:00
Tobias Henkel 31c276c234
Fix relaunch attempts when hitting quota errors
The quota calculations in nodepool never can be perfectly accurate
because there still can be some races with other launchers in the
tenant or by other workloads sharing the tenant. So we must be able to
gracefully handle the case when the cloud refuses a launch because of
quota. Currently we just invalidate the quota cache and immediately
try again to launch the node. Of course that will fail again in most
cases. This makes the node request and thus the zuul job fail with
NODE_FAILURE.

Instead we need to pause the handler like it would happen because of
quota calculation. Instead we mark the node as aborted which is no
failure indicator and pause the handler so it will automatically
reschedule a new node launch as soon as the quota calculation allows
it.

Change-Id: I122268dc7723901c04a58fa6f5c1688a3fdab227
2018-07-06 08:41:02 +02:00
David Shrewsbury e66e5a56b6 Add test for referencing cloud image by ID
This adds a test for the image-id cloud image attribute (which
should cause shade to perform a direct lookup on the image by
UUID (bypassing the typical name-or-id logic).

Change-Id: I8b2b99aac012e0f1adda326dfe76d9f90f606cb1
Co-Authored-By: James E. Blair <jeblair@redhat.com>
2018-07-03 14:13:56 -04:00
David Shrewsbury 58ccd2c718 Fix paused handler exception handling
If a request handler is paused and an exception is thrown during
normal events, that exception handling code was not properly unpausing
the request and removing it from the list of active handlers.

Change-Id: I0cfd8294f9ce88e5918654a4847776c981ee4fb3
2018-06-15 09:05:04 -04:00
James E. Blair e20858755f Have Drivers create Providers
Use the new Driver class to create instances of Providers

Change-Id: Idfbde8d773a971133b49fbc318385893be293fac
2018-06-06 14:57:40 -04:00
Artem Goncharov 674c9516dc Add support for specifying security_group in nodepool
In some installations it might be unreal to rely on the default security
group (security concerns). In order to also enable possibility to share
one tenant between zuul and other resources a support for specifying
security_groups on the driver.openstack.pool level is added.

Change-Id: I63240049cba295e15f7cfe75b7e7a7d53aa4e37d
2018-06-05 10:00:06 +02:00
Tristan Cacqueray eca37d13ea builder: support setting diskimage env-vars in secure configuration
This change enables using diskimage-builder elements with secret securely.
For example, a rhel diskimage needs a REG_PASSWORD that could be define in
the secure file like so:

diskimages:
  - name: rhel-7
    env-vars:
      REG_PASSWORD: secret-password

Change-Id: I814318ae0b5c9e4665f3fa3f011d8a687b540fac
2018-05-29 23:48:13 +00:00
David Shrewsbury 2a2729b3b4 Add multilabel support to ZooKeeper Node model
Support storing multiple node types/labels for a Node() in the
ZooKeeper data.

Neither the OpenStack nor the static drivers make use of this yet;
they assume there will only be one value in the type/label list.

In the request handler code, we must now store the satisfied types
(rather than rely on the Node.type in self.nodeset) since that can
contain multiple values and we can't tell which value was used to
satisfy the request. The LabelRecorder class helps do this, and to
also help with assigning nodes to a request in proper order based
on order of the labels in the request.

Change-Id: Ice46e22183c4e7354eaa66504ea147aff5ee0df3
2018-05-24 14:37:34 -04:00
David Shrewsbury be0c59535a Simplify driver API
Further decouple the driver API from the provider configuration
by simplifying the driver interface (which will eventually lead to
the driver deciding whether or not to support multiple labels) and
removing config references in NodeLauncher.

This significantly simplifies the driver API by:

  * Reducing the launching of nodes to a launch() and launchesCompleted()
    interface. The launchesCompleted() method acts as a periodic check on
    whether the nodeset launches have all completed.

  * How the drivers actually launch nodes is decoupled from the driver
    API by not having the return of NodeRequestHandler.launch() be either
    a thread or None. This leaves it entirely up to the driver to decide
    the best way to manage launches... threads, subprocesses, messaging,
    etc. E.g., thread handling is moved from driver API into OpenStack driver.

  * Drivers no longer need to implement the very confusing pollLauncher()
    method. This was originally meant to poll the launch threads, but
    even drivers not using threads (e.g., the static driver) had to
    reimplement this method. That made no sense.

The NodeLauncher class (currently used only by the OpenStack driver),
is moved to a driver utility module that other drivers may choose to
use or ignore. And it no longer directly accesses the provider config.

The test_nodelaunchmanager.py tests was originally testing the
NodeLaunchManager class (removed quite some time ago) which managed
launch threads. This was eventually moved to the driver interface.
This change further moves thread management into the OpenStack driver.
As a result, these tests are redundant since the test_launcher.py
tests cover all of this. So remove test_nodelaunchmanager.py.

Change-Id: I407d999b3608e1614c0dbe49808b2a64756dde58
2018-05-18 11:24:59 -04:00
Zuul e881199916 Merge "Refactor NodeLauncher to be generic" 2018-04-27 14:05:17 +00:00
David Shrewsbury e1bd967001 Fix race in test_hold_expiration_no_default
The test_hold_expiration_no_default test was misconfigured to use
the nodepool config for the max ready age tests, which was causing
a race in the test where the node could be removed (b/c it exceeded
the max ready age) before it could set a hold. This fixes it to use
a new config, and lowers the TTL on the held node to quicken the test.

Change-Id: Iaeb16e7a11f8b5c7bb2b363b7939fee7933393c4
2018-04-25 15:25:01 -04:00
Tristan Cacqueray 11f0ffd201 Refactor NodeLauncher to be generic
This change refactors the NodeLauncher object so that it is part
of the generic interface. A nodepool driver handler only need to
return a subclass implementing the launch method.

Moreover this change adapts the StatsReporter class.

Change-Id: I6cfec650b862cb4fa0cb391bcc1248549e30c91b
2018-04-19 02:23:42 +00:00
Zuul 20c6646bf0 Merge "Refactor run_handler to be generic" 2018-04-18 15:38:11 +00:00
Zuul 7ddee72b51 Merge "Add connection-port to provider diskimage" 2018-04-13 15:20:58 +00:00
Zuul 07327ab296 Merge "Don't gather host keys for non ssh connections" 2018-04-13 15:20:56 +00:00
Zuul 38cd023ce0 Merge "launcher: handle ZK session loss during handler poll" 2018-04-12 19:07:27 +00:00
Zuul b44e8491b3 Merge "Handle ZK session loss during node launch" 2018-04-12 18:03:29 +00:00
Tristan Cacqueray f5dc749f2a Refactor run_handler to be generic
This change refactors the OpenStackNodeRequestHandler.run_handler method
so that it is part of the generic interface. A nodepool driver handler now
only needs to implement the launch() method.

Moreover, this change merge the NodeLaunchManager into the RequestHandler.

This change adds the following method to the handler interface:

* nodeReused(node): notify handler a node is re-used
* checkReusableNode(node): verify if a node can be re-used
* setNodeMetadata(node): store extra node metadata such as az
* pollLauncher()

Change-Id: I5b18b49f50e2b416b5e5c4e28e1a4a6a1df1c654
2018-04-12 02:50:49 +00:00
Tristan Cacqueray f5eb867ad7 launcher: handle ZK session loss during handler poll
Stop polling handler for which the NodeRequest lock is gone.

Change-Id: Iba31cb6804c94fea34a6dc911dffdd7ba2ebdc6c
2018-04-12 01:56:38 +00:00
Tobias Henkel 687f120b3c
Add connection-port to provider diskimage
The connection port should be included in the privider diskimage.
This makes it possible to define images using other ports for
connections winrm for Windows which run on a different port than 22.

Change-Id: Ib4b335ffbcc4dc71704c06387377675a4206c663
2018-04-03 17:48:52 +02:00
Tobias Henkel 2da274e2ae
Don't gather host keys for non ssh connections
In case of an image with the connection type winrm we cannot scan the
ssh host keys. So in case the connection type is not ssh we
need to skip gathering the host keys.

Change-Id: I56f308baa10d40461cf4a919bbcdc4467e85a551
2018-04-03 17:31:45 +02:00
David Shrewsbury 763fa7ebfe Handle ZK session loss during node launch
Node launches can take a while, thus increasing the chances of it
encountering a lost connection to ZooKeeper, which could lead to
a session loss. Let's try to handle that by not re-attempting any
launches (pointless because we've lost our node lock and the node
could be deleted out from under us).

Change-Id: Iaca005355e8d03e1e5ab7cdab0ea3c04117ca6fb
2018-03-28 10:16:23 -04:00
Paul Belanger 2286f2432c Add host-key-checking option to openstack providers
In some cases nodepool-launcher uses public API to launch nodes, but
doesn't have access to the private networks of nodes it launches.
Rather then failing, expose an option for operators to disable
ssh-keyscan and allow nodes to become ready.

Change-Id: I764398aa21461ef44048e9e6565d2ee3e01aaaf8
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2018-03-26 22:29:14 +00:00
Zuul 13e705edbc Merge "Store label info with launcher registration" 2018-03-08 03:01:31 +00:00
David Shrewsbury bcd8a2c132 Store label info with launcher registration
It can be useful to know all of the labels that nodepool can support
without having to examing nodepool configuration files. We can store
this information when we register each launcher at startup, and on
configuration changes.

This stores supported labels for each launcher in ZooKeeper. A
"supported label" here is one that is actually bootable and referenced
within the provider pool configuration sections. This is made slightly
more complex because each nodepool driver handles its own configuration,
so we have to check driver type before deciding how to get at the
labels.

It also introduces a new model class (Launcher) to extract away the
information that we store in ZooKeeper so that we can later more easily
add other information that could be useful (such as per-tenant info).

Change-Id: Icfb73fbe3b67321235a78ea7ed9bf4319567eb1a
2018-03-07 16:42:40 -05:00
Zuul a77eba273b Merge "Add unit test for multiple launchers" 2018-03-07 16:12:24 +00:00
mhuin a59528c8e8 Clean held nodes automatically after configurable timeout
Introduce a new configuration setting, "max_hold_age", that specifies
the maximum uptime of held instances. If set to 0, held instances
are kept until manually deleted. A custom value can be provided
at the rpcclient level.

Change-Id: I9a09728e5728c537ee44721f5d5e774dc0dcefa7
2018-02-20 16:13:55 +01:00
Paul Belanger b71da985ea
Add unit test for multiple launchers
Here we are adding a base test to show the current workings of
multiple launchers. Where 1 nodepool-launcher is responsible for
managing min-ready.

Change-Id: I4d18b1ba152fef90a58cb9e8bf9157e1fbafdb09
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2018-02-08 15:48:48 -05:00
David Shrewsbury e054b74c52 Do not delete unused but allocated nodes
Let's assume an allocated node is *going* to be used (maybe zuul
hasn't got around to locking it quickly enough). We don't want
to assign a node, then delete it out from underneath the requestor.

Change-Id: I092cffbfd347684f66d6e7cbd0f910d40858580b
2018-02-06 12:01:54 -05:00
David Shrewsbury 284e0b52bf Provider wedge fix
If a provider is going to pause due to insufficient quota, we attempt
to delete the oldest unused node (one that is READY and unlocked) to
guarantee that some quota will be freed.

To prevent us from issuing unnecessary delete requests, if we find a
node already being deleted within a recent timeframe (here, 5 minutes),
we do not issue another delete.

NOTE: This causes some assumptions in tests depending on the current
pause behavior (no proactive deletion) to no longer be valid, so steps
are taken in those tests to prevent the proactive delete from happening.

Change-Id: I8d94c60fe1ab184503592a02d6ca458f94a2ea3d
2018-02-02 16:42:13 -05:00
David Shrewsbury 6c6cbfc209 Partial revert for disabled provider change
Commit 1cf2cddf9d attempted to
short-circuit request handling when a provider was disabled by
setting max-servers to 0. That prevents the provider from declining
any requests, and thus prevents a request from ever reaching
FAILURE status if none of the other providers can satisfy it (the
request declined_by array is compared against the list of registered
launchers).

This adds a test for this case (which was run against the current
code and proved it was broken).

Change-Id: I8a2f9771b55b08259ecffa3475bdd2d4f2654464
2018-01-29 13:28:17 -05:00
Clark Boylan fd2caae47e Fix race in test_failed_provider
In test_failed_provider we determine what the second node that
fake-provider2 booted is then delete it. Unfortunately there is a race
where we could find the first node booted in this provider and re delete
it instead of also deleting the second node which never frees up
provider quota which cases the test to timeout and fail.

We fix this by explicitly checking the node state when looking for the
second node booted in the provider. The first node will be in a deleting
state or just won't exist so checking for a ready node should be
sufficient to always find the second node.

We also fix a small bug in how we raise the exception that causes
fake-provider's launch request to fail. We need a single argument to the
monkey patched launch() function otherwise we get a TypeError from
python instead of the KeyError that we want.

Change-Id: Ia2fad9736be76d6ca9a923c8e0643a96a9d1dbe6
2018-01-27 15:50:30 -08:00
Zuul 29bba0c1ef Merge "Fix race in test_provider_removal" 2018-01-24 20:30:59 +00:00
David Shrewsbury 2e0d144e7e Fix race in test_provider_removal
Turns out that waiting for the config to be non-None in the
Nodepool object is not sufficient to guarantee that our threads
for the providers have started. Instead, wait for our nodes to
appear, which definitely guarantees we have started the threads.

Change-Id: I8eb189c91bd315bbd52081b81556cde3e29d7e5b
2018-01-22 14:24:43 -05:00
Tristan Cacqueray d0a67878a3 Add a plugin interface for drivers
This change adds a plugin interface so that driver can be loaded dynamically.
Instead of importing each driver in the launcher, provider_manager and config,
the Drivers class discovers and loads driver from the driver directory.

This change also adds a reset() method to the driver Config interface to
reset the os_client_config reference when reloading the OpenStack driver.

Change-Id: Ia347aa2501de0e05b2a7dd014c4daf1b0a4e0fb5
2018-01-19 00:45:56 +00:00
Clark Boylan 63dbab87df Fix node data retrieval race in test_failed_provider
Once we start deleting nodes in test_failed_provider we can't count on
getNode() giving us a node for a previous request because that node may
have been deleted. Check that node is a valid value before accessing its
provider value to avoid this race.

Change-Id: Ie6b20f1a66de7865223a7ecf5785e2446afc7804
2018-01-18 13:19:02 -08:00