Reorganizes the unit tests into a new subdirectory, and
adds a new functional test subdirectory (that will be further
divided into subdirs for drivers).
Change-Id: I027bba619deef9379d293fdd0457df6a1377aea8
The label statistics age gauges that keep their values. Currently
nodepool doesn't initialize the label based node statistics to
zero. Because of this any label state that shows up in the statistics
won't be reset to zero and nodepool is stopping to send updates to
this gauge. This leads to graphs that are mostly stuck to 1 if no
nodes are currently in this state.
This can be fixed by iterating over the supported labels and
initializing all states of them to zero.
Change-Id: I6c7f63f8f64a83b225386f6da567bfae5141be7b
Deleting failed nodes is offloaded to the DeletedNodeWorker. Thus the
deletion can be a bit delayed and we need to wait for delete_success
instead of asserting it.
Change-Id: I5b425707fac5250367f6f3164e803951b10196cb
There was a possibility for an uncaught exception when launching a
node: if the launch fails and the subsequent node cleaning fails
as well, nodepool would break out of the loop prematurely.
Also add request information when logging launch failure to make
debugging easier.
Change-Id: I4a08ebe7b0ce763b6ac4bcd314403b869e487ac1
Now the node type is a list of labels. This is not considered when
updating the node stats so the label in the reported stats is the
string representation of that list. Instead iterate over that list and
increase the counter for every label.
Change-Id: Id64cb933e310fa056deab0b63e9e02d451de5973
Zuul gives us the node hold requests and it may not provide the node
hold expiration value as an integer (this issue is also being fixed at
https://review.openstack.org/597592 in zuul). To guard against hold
clients (zuul and possibly others) giving us the wrong type for the
expiration values we coerce it to int() here if not None.
If the coercion fails we assume a value of 0 which means no expiration.
This allows the node to still be held as the user expected, it just
won't expire.
Change-Id: Iaa9500490b8536470978d94fd01cbaad23d1d647
A cloud may fails to delete a node, and in this case we need to check
if we are not already trying to delete the node. Otherwise the launcher
keeps on adding artifical node.
Change-Id: Ic5dfc75df771a5f312099ee82f82e2561f5f4829
Pipelines buffer stats and then send them out in more reasonable sized
chunks, helping to avoid small UDP packets going missing in a flood of
stats. Use this in stats.py.
This needs a slight change to the assertedStats handler to extract the
combined stats. This function is ported from Zuul where we updated to
handle pipeline stats (Id4f6f5a6cd66581a81299ed5c67a5c49c95c9b52) so
it is not really new code.
Change-Id: I3f68450c7164d1cf0f1f57f9a31e5dca2f72bc43
When getting error 400 we not only need to clear the image and flavor
cache but the az cache as well. Otherwise we will get constantly node
failures for any node request where nodepool chose that az
[1]. Currently the only way to recover from this situation is to
restart nodepool. Invalidating the cache doesn't fix the request that
failed due to this error but at least ensures that nodepool will
recover from this situation automatically for all further node
requests.
[1] Trace:
018-07-05 09:09:08,477 ERROR nodepool.NodeLauncher-0000123378: Launch attempt 2/3 failed for node 0000123378:
Traceback (most recent call last):
File "/opt/nodepool-source/nodepool/driver/openstack/handler.py", line 221, in launch
self._launchNode()
File "/opt/nodepool-source/nodepool/driver/openstack/handler.py", line 134, in _launchNode
volume_size=self.label.volume_size)
File "/opt/nodepool-source/nodepool/driver/openstack/provider.py", line 378, in createServer
return self._client.create_server(wait=False, **create_args)
File "<decorator-gen-106>", line 2, in create_server
File "/usr/lib/python3.5/site-packages/shade/_utils.py", line 410, in func_wrapper
return func(*args, **kwargs)
File "/usr/lib/python3.5/site-packages/shade/openstackcloud.py", line 6909, in create_server
endpoint, json=server_json)
File "/usr/lib/python3.5/site-packages/keystoneauth1/adapter.py", line 334, in post
return self.request(url, 'POST', **kwargs)
File "/usr/lib/python3.5/site-packages/shade/_adapter.py", line 158, in request
return self._munch_response(response, error_message=error_message)
File "/usr/lib/python3.5/site-packages/shade/_adapter.py", line 114, in _munch_response
exc.raise_from_response(response, error_message=error_message)
File "/usr/lib/python3.5/site-packages/shade/exc.py", line 171, in raise_from_response
raise OpenStackCloudBadRequest(msg, response=response)
shade.exc.OpenStackCloudBadRequest: (400) Client Error for url: (...) The requested availability zone is not available
Change-Id: I5f653f159b08cf086d20c2398a9345bd4caa4d1e
In some circumstances it is useful to tell the launcher to
ignore the provide quota and just trust the max-* settings
for that pool instead.
This particular need arises when using Rackspace public cloud
for both instances and OnMetal nodes. In this situation the
quota for instances and for OnMetal nodes is different, but
nodepool only queries the quota for instances. When trying to
build OnMetal nodes, the quota check fails - but should not.
In this circumstance, instead of making shade/nodepool
complicated by figuring out how to calculate disparate quota
types, it makes sense to rather just allow nodepool to ignore
the quota for a pool to try executing the build instead.
While this is our use-case, it may also be useful to others
for other reasons.
Change-Id: I232a1ab365795381ab180aceb48e8c87843ac713
The test_over_quota has some debug lines that should be removed. Also
make the copy of the nodeset more intuitive.
Change-Id: I9a168bc1446e3bb479cace68ed932401ebcdd2bf
The quota calculations in nodepool never can be perfectly accurate
because there still can be some races with other launchers in the
tenant or by other workloads sharing the tenant. So we must be able to
gracefully handle the case when the cloud refuses a launch because of
quota. Currently we just invalidate the quota cache and immediately
try again to launch the node. Of course that will fail again in most
cases. This makes the node request and thus the zuul job fail with
NODE_FAILURE.
Instead we need to pause the handler like it would happen because of
quota calculation. Instead we mark the node as aborted which is no
failure indicator and pause the handler so it will automatically
reschedule a new node launch as soon as the quota calculation allows
it.
Change-Id: I122268dc7723901c04a58fa6f5c1688a3fdab227
This adds a test for the image-id cloud image attribute (which
should cause shade to perform a direct lookup on the image by
UUID (bypassing the typical name-or-id logic).
Change-Id: I8b2b99aac012e0f1adda326dfe76d9f90f606cb1
Co-Authored-By: James E. Blair <jeblair@redhat.com>
If a request handler is paused and an exception is thrown during
normal events, that exception handling code was not properly unpausing
the request and removing it from the list of active handlers.
Change-Id: I0cfd8294f9ce88e5918654a4847776c981ee4fb3
In some installations it might be unreal to rely on the default security
group (security concerns). In order to also enable possibility to share
one tenant between zuul and other resources a support for specifying
security_groups on the driver.openstack.pool level is added.
Change-Id: I63240049cba295e15f7cfe75b7e7a7d53aa4e37d
This change enables using diskimage-builder elements with secret securely.
For example, a rhel diskimage needs a REG_PASSWORD that could be define in
the secure file like so:
diskimages:
- name: rhel-7
env-vars:
REG_PASSWORD: secret-password
Change-Id: I814318ae0b5c9e4665f3fa3f011d8a687b540fac
Support storing multiple node types/labels for a Node() in the
ZooKeeper data.
Neither the OpenStack nor the static drivers make use of this yet;
they assume there will only be one value in the type/label list.
In the request handler code, we must now store the satisfied types
(rather than rely on the Node.type in self.nodeset) since that can
contain multiple values and we can't tell which value was used to
satisfy the request. The LabelRecorder class helps do this, and to
also help with assigning nodes to a request in proper order based
on order of the labels in the request.
Change-Id: Ice46e22183c4e7354eaa66504ea147aff5ee0df3
Further decouple the driver API from the provider configuration
by simplifying the driver interface (which will eventually lead to
the driver deciding whether or not to support multiple labels) and
removing config references in NodeLauncher.
This significantly simplifies the driver API by:
* Reducing the launching of nodes to a launch() and launchesCompleted()
interface. The launchesCompleted() method acts as a periodic check on
whether the nodeset launches have all completed.
* How the drivers actually launch nodes is decoupled from the driver
API by not having the return of NodeRequestHandler.launch() be either
a thread or None. This leaves it entirely up to the driver to decide
the best way to manage launches... threads, subprocesses, messaging,
etc. E.g., thread handling is moved from driver API into OpenStack driver.
* Drivers no longer need to implement the very confusing pollLauncher()
method. This was originally meant to poll the launch threads, but
even drivers not using threads (e.g., the static driver) had to
reimplement this method. That made no sense.
The NodeLauncher class (currently used only by the OpenStack driver),
is moved to a driver utility module that other drivers may choose to
use or ignore. And it no longer directly accesses the provider config.
The test_nodelaunchmanager.py tests was originally testing the
NodeLaunchManager class (removed quite some time ago) which managed
launch threads. This was eventually moved to the driver interface.
This change further moves thread management into the OpenStack driver.
As a result, these tests are redundant since the test_launcher.py
tests cover all of this. So remove test_nodelaunchmanager.py.
Change-Id: I407d999b3608e1614c0dbe49808b2a64756dde58
The test_hold_expiration_no_default test was misconfigured to use
the nodepool config for the max ready age tests, which was causing
a race in the test where the node could be removed (b/c it exceeded
the max ready age) before it could set a hold. This fixes it to use
a new config, and lowers the TTL on the held node to quicken the test.
Change-Id: Iaeb16e7a11f8b5c7bb2b363b7939fee7933393c4
This change refactors the NodeLauncher object so that it is part
of the generic interface. A nodepool driver handler only need to
return a subclass implementing the launch method.
Moreover this change adapts the StatsReporter class.
Change-Id: I6cfec650b862cb4fa0cb391bcc1248549e30c91b
This change refactors the OpenStackNodeRequestHandler.run_handler method
so that it is part of the generic interface. A nodepool driver handler now
only needs to implement the launch() method.
Moreover, this change merge the NodeLaunchManager into the RequestHandler.
This change adds the following method to the handler interface:
* nodeReused(node): notify handler a node is re-used
* checkReusableNode(node): verify if a node can be re-used
* setNodeMetadata(node): store extra node metadata such as az
* pollLauncher()
Change-Id: I5b18b49f50e2b416b5e5c4e28e1a4a6a1df1c654
The connection port should be included in the privider diskimage.
This makes it possible to define images using other ports for
connections winrm for Windows which run on a different port than 22.
Change-Id: Ib4b335ffbcc4dc71704c06387377675a4206c663
In case of an image with the connection type winrm we cannot scan the
ssh host keys. So in case the connection type is not ssh we
need to skip gathering the host keys.
Change-Id: I56f308baa10d40461cf4a919bbcdc4467e85a551
Node launches can take a while, thus increasing the chances of it
encountering a lost connection to ZooKeeper, which could lead to
a session loss. Let's try to handle that by not re-attempting any
launches (pointless because we've lost our node lock and the node
could be deleted out from under us).
Change-Id: Iaca005355e8d03e1e5ab7cdab0ea3c04117ca6fb
In some cases nodepool-launcher uses public API to launch nodes, but
doesn't have access to the private networks of nodes it launches.
Rather then failing, expose an option for operators to disable
ssh-keyscan and allow nodes to become ready.
Change-Id: I764398aa21461ef44048e9e6565d2ee3e01aaaf8
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
It can be useful to know all of the labels that nodepool can support
without having to examing nodepool configuration files. We can store
this information when we register each launcher at startup, and on
configuration changes.
This stores supported labels for each launcher in ZooKeeper. A
"supported label" here is one that is actually bootable and referenced
within the provider pool configuration sections. This is made slightly
more complex because each nodepool driver handles its own configuration,
so we have to check driver type before deciding how to get at the
labels.
It also introduces a new model class (Launcher) to extract away the
information that we store in ZooKeeper so that we can later more easily
add other information that could be useful (such as per-tenant info).
Change-Id: Icfb73fbe3b67321235a78ea7ed9bf4319567eb1a
Introduce a new configuration setting, "max_hold_age", that specifies
the maximum uptime of held instances. If set to 0, held instances
are kept until manually deleted. A custom value can be provided
at the rpcclient level.
Change-Id: I9a09728e5728c537ee44721f5d5e774dc0dcefa7
Here we are adding a base test to show the current workings of
multiple launchers. Where 1 nodepool-launcher is responsible for
managing min-ready.
Change-Id: I4d18b1ba152fef90a58cb9e8bf9157e1fbafdb09
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
Let's assume an allocated node is *going* to be used (maybe zuul
hasn't got around to locking it quickly enough). We don't want
to assign a node, then delete it out from underneath the requestor.
Change-Id: I092cffbfd347684f66d6e7cbd0f910d40858580b
If a provider is going to pause due to insufficient quota, we attempt
to delete the oldest unused node (one that is READY and unlocked) to
guarantee that some quota will be freed.
To prevent us from issuing unnecessary delete requests, if we find a
node already being deleted within a recent timeframe (here, 5 minutes),
we do not issue another delete.
NOTE: This causes some assumptions in tests depending on the current
pause behavior (no proactive deletion) to no longer be valid, so steps
are taken in those tests to prevent the proactive delete from happening.
Change-Id: I8d94c60fe1ab184503592a02d6ca458f94a2ea3d
Commit 1cf2cddf9d attempted to
short-circuit request handling when a provider was disabled by
setting max-servers to 0. That prevents the provider from declining
any requests, and thus prevents a request from ever reaching
FAILURE status if none of the other providers can satisfy it (the
request declined_by array is compared against the list of registered
launchers).
This adds a test for this case (which was run against the current
code and proved it was broken).
Change-Id: I8a2f9771b55b08259ecffa3475bdd2d4f2654464
In test_failed_provider we determine what the second node that
fake-provider2 booted is then delete it. Unfortunately there is a race
where we could find the first node booted in this provider and re delete
it instead of also deleting the second node which never frees up
provider quota which cases the test to timeout and fail.
We fix this by explicitly checking the node state when looking for the
second node booted in the provider. The first node will be in a deleting
state or just won't exist so checking for a ready node should be
sufficient to always find the second node.
We also fix a small bug in how we raise the exception that causes
fake-provider's launch request to fail. We need a single argument to the
monkey patched launch() function otherwise we get a TypeError from
python instead of the KeyError that we want.
Change-Id: Ia2fad9736be76d6ca9a923c8e0643a96a9d1dbe6
Turns out that waiting for the config to be non-None in the
Nodepool object is not sufficient to guarantee that our threads
for the providers have started. Instead, wait for our nodes to
appear, which definitely guarantees we have started the threads.
Change-Id: I8eb189c91bd315bbd52081b81556cde3e29d7e5b
This change adds a plugin interface so that driver can be loaded dynamically.
Instead of importing each driver in the launcher, provider_manager and config,
the Drivers class discovers and loads driver from the driver directory.
This change also adds a reset() method to the driver Config interface to
reset the os_client_config reference when reloading the OpenStack driver.
Change-Id: Ia347aa2501de0e05b2a7dd014c4daf1b0a4e0fb5
Once we start deleting nodes in test_failed_provider we can't count on
getNode() giving us a node for a previous request because that node may
have been deleted. Check that node is a valid value before accessing its
provider value to avoid this race.
Change-Id: Ie6b20f1a66de7865223a7ecf5785e2446afc7804