nodepool

Commit Graph

Author	SHA1	Message	Date
David Shrewsbury	511ffd9c29	Add tox functional testing for drivers Reorganizes the unit tests into a new subdirectory, and adds a new functional test subdirectory (that will be further divided into subdirs for drivers). Change-Id: I027bba619deef9379d293fdd0457df6a1377aea8	2018-11-01 15:33:44 +01:00
Tobias Henkel	b01c7821e5	Initialize label statistics to zero The label statistics age gauges that keep their values. Currently nodepool doesn't initialize the label based node statistics to zero. Because of this any label state that shows up in the statistics won't be reset to zero and nodepool is stopping to send updates to this gauge. This leads to graphs that are mostly stuck to 1 if no nodes are currently in this state. This can be fixed by iterating over the supported labels and initializing all states of them to zero. Change-Id: I6c7f63f8f64a83b225386f6da567bfae5141be7b	2018-10-18 15:45:42 +02:00
Tobias Henkel	2da07128e6	Fix race in test_launchNode_delete_error Deleting failed nodes is offloaded to the DeletedNodeWorker. Thus the deletion can be a bit delayed and we need to wait for delete_success instead of asserting it. Change-Id: I5b425707fac5250367f6f3164e803951b10196cb	2018-09-24 07:25:32 +02:00
Zuul	da8a94d341	Merge "Do not abort node launch if failed node cannot be deleted"	2018-09-17 17:58:21 +00:00
mhuin	ebdc35fc5e	Do not abort node launch if failed node cannot be deleted There was a possibility for an uncaught exception when launching a node: if the launch fails and the subsequent node cleaning fails as well, nodepool would break out of the loop prematurely. Also add request information when logging launch failure to make debugging easier. Change-Id: I4a08ebe7b0ce763b6ac4bcd314403b869e487ac1	2018-09-10 10:40:14 +02:00
Tobias Henkel	bb3c2dbf1b	Fix label name in reported label stats Now the node type is a list of labels. This is not considered when updating the node stats so the label in the reported stats is the string representation of that list. Instead iterate over that list and increase the counter for every label. Change-Id: Id64cb933e310fa056deab0b63e9e02d451de5973	2018-09-07 08:20:57 +02:00
Clark Boylan	ef6be899b6	Force node hold expiration to integer type Zuul gives us the node hold requests and it may not provide the node hold expiration value as an integer (this issue is also being fixed at https://review.openstack.org/597592 in zuul). To guard against hold clients (zuul and possibly others) giving us the wrong type for the expiration values we coerce it to int() here if not None. If the coercion fails we assume a value of 0 which means no expiration. This allows the node to still be held as the user expected, it just won't expire. Change-Id: Iaa9500490b8536470978d94fd01cbaad23d1d647	2018-08-29 10:21:56 -07:00
Zuul	168a12a0a6	Merge "zk: skip node already being deleted in cleanup leaked instance task"	2018-08-24 22:04:40 +00:00
Tristan Cacqueray	8eb099283a	zk: skip node already being deleted in cleanup leaked instance task A cloud may fails to delete a node, and in this case we need to check if we are not already trying to delete the node. Otherwise the launcher keeps on adding artifical node. Change-Id: Ic5dfc75df771a5f312099ee82f82e2561f5f4829	2018-08-24 16:56:29 -04:00
Ian Wienand	cd9aa75640	Use pipelines for stats keys Pipelines buffer stats and then send them out in more reasonable sized chunks, helping to avoid small UDP packets going missing in a flood of stats. Use this in stats.py. This needs a slight change to the assertedStats handler to extract the combined stats. This function is ported from Zuul where we updated to handle pipeline stats (Id4f6f5a6cd66581a81299ed5c67a5c49c95c9b52) so it is not really new code. Change-Id: I3f68450c7164d1cf0f1f57f9a31e5dca2f72bc43	2018-07-25 16:46:13 +10:00
Zuul	1b4c92262a	Merge "builder: support setting diskimage env-vars in secure configuration"	2018-07-23 16:04:00 +00:00
Tobias Henkel	934b1eed9c	Invalidate az cache on bad request When getting error 400 we not only need to clear the image and flavor cache but the az cache as well. Otherwise we will get constantly node failures for any node request where nodepool chose that az [1]. Currently the only way to recover from this situation is to restart nodepool. Invalidating the cache doesn't fix the request that failed due to this error but at least ensures that nodepool will recover from this situation automatically for all further node requests. [1] Trace: 018-07-05 09:09:08,477 ERROR nodepool.NodeLauncher-0000123378: Launch attempt 2/3 failed for node 0000123378: Traceback (most recent call last): File "/opt/nodepool-source/nodepool/driver/openstack/handler.py", line 221, in launch self._launchNode() File "/opt/nodepool-source/nodepool/driver/openstack/handler.py", line 134, in _launchNode volume_size=self.label.volume_size) File "/opt/nodepool-source/nodepool/driver/openstack/provider.py", line 378, in createServer return self._client.create_server(wait=False, *create_args) File "<decorator-gen-106>", line 2, in create_server File "/usr/lib/python3.5/site-packages/shade/_utils.py", line 410, in func_wrapper return func(args, kwargs) File "/usr/lib/python3.5/site-packages/shade/openstackcloud.py", line 6909, in create_server endpoint, json=server_json) File "/usr/lib/python3.5/site-packages/keystoneauth1/adapter.py", line 334, in post return self.request(url, 'POST', kwargs) File "/usr/lib/python3.5/site-packages/shade/_adapter.py", line 158, in request return self._munch_response(response, error_message=error_message) File "/usr/lib/python3.5/site-packages/shade/_adapter.py", line 114, in _munch_response exc.raise_from_response(response, error_message=error_message) File "/usr/lib/python3.5/site-packages/shade/exc.py", line 171, in raise_from_response raise OpenStackCloudBadRequest(msg, response=response) shade.exc.OpenStackCloudBadRequest: (400) Client Error for url: (...) The requested availability zone is not available Change-Id: I5f653f159b08cf086d20c2398a9345bd4caa4d1e	2018-07-23 14:04:08 +02:00
Zuul	4fd407b064	Merge "Add ability to ignore provider quota for a pool"	2018-07-12 22:03:39 +00:00
Jesse Pretorius	4c8b5f4f99	Add ability to ignore provider quota for a pool In some circumstances it is useful to tell the launcher to ignore the provide quota and just trust the max-* settings for that pool instead. This particular need arises when using Rackspace public cloud for both instances and OnMetal nodes. In this situation the quota for instances and for OnMetal nodes is different, but nodepool only queries the quota for instances. When trying to build OnMetal nodes, the quota check fails - but should not. In this circumstance, instead of making shade/nodepool complicated by figuring out how to calculate disparate quota types, it makes sense to rather just allow nodepool to ignore the quota for a pool to try executing the build instead. While this is our use-case, it may also be useful to others for other reasons. Change-Id: I232a1ab365795381ab180aceb48e8c87843ac713	2018-07-11 17:37:55 +00:00
Tobias Henkel	ce9120192b	Cleanup test_over_quota The test_over_quota has some debug lines that should be removed. Also make the copy of the nodeset more intuitive. Change-Id: I9a168bc1446e3bb479cace68ed932401ebcdd2bf	2018-07-06 19:35:02 +02:00
Tobias Henkel	31c276c234	Fix relaunch attempts when hitting quota errors The quota calculations in nodepool never can be perfectly accurate because there still can be some races with other launchers in the tenant or by other workloads sharing the tenant. So we must be able to gracefully handle the case when the cloud refuses a launch because of quota. Currently we just invalidate the quota cache and immediately try again to launch the node. Of course that will fail again in most cases. This makes the node request and thus the zuul job fail with NODE_FAILURE. Instead we need to pause the handler like it would happen because of quota calculation. Instead we mark the node as aborted which is no failure indicator and pause the handler so it will automatically reschedule a new node launch as soon as the quota calculation allows it. Change-Id: I122268dc7723901c04a58fa6f5c1688a3fdab227	2018-07-06 08:41:02 +02:00
David Shrewsbury	e66e5a56b6	Add test for referencing cloud image by ID This adds a test for the image-id cloud image attribute (which should cause shade to perform a direct lookup on the image by UUID (bypassing the typical name-or-id logic). Change-Id: I8b2b99aac012e0f1adda326dfe76d9f90f606cb1 Co-Authored-By: James E. Blair <jeblair@redhat.com>	2018-07-03 14:13:56 -04:00
David Shrewsbury	58ccd2c718	Fix paused handler exception handling If a request handler is paused and an exception is thrown during normal events, that exception handling code was not properly unpausing the request and removing it from the list of active handlers. Change-Id: I0cfd8294f9ce88e5918654a4847776c981ee4fb3	2018-06-15 09:05:04 -04:00
James E. Blair	e20858755f	Have Drivers create Providers Use the new Driver class to create instances of Providers Change-Id: Idfbde8d773a971133b49fbc318385893be293fac	2018-06-06 14:57:40 -04:00
Artem Goncharov	674c9516dc	Add support for specifying security_group in nodepool In some installations it might be unreal to rely on the default security group (security concerns). In order to also enable possibility to share one tenant between zuul and other resources a support for specifying security_groups on the driver.openstack.pool level is added. Change-Id: I63240049cba295e15f7cfe75b7e7a7d53aa4e37d	2018-06-05 10:00:06 +02:00
Tristan Cacqueray	eca37d13ea	builder: support setting diskimage env-vars in secure configuration This change enables using diskimage-builder elements with secret securely. For example, a rhel diskimage needs a REG_PASSWORD that could be define in the secure file like so: diskimages: - name: rhel-7 env-vars: REG_PASSWORD: secret-password Change-Id: I814318ae0b5c9e4665f3fa3f011d8a687b540fac	2018-05-29 23:48:13 +00:00
David Shrewsbury	2a2729b3b4	Add multilabel support to ZooKeeper Node model Support storing multiple node types/labels for a Node() in the ZooKeeper data. Neither the OpenStack nor the static drivers make use of this yet; they assume there will only be one value in the type/label list. In the request handler code, we must now store the satisfied types (rather than rely on the Node.type in self.nodeset) since that can contain multiple values and we can't tell which value was used to satisfy the request. The LabelRecorder class helps do this, and to also help with assigning nodes to a request in proper order based on order of the labels in the request. Change-Id: Ice46e22183c4e7354eaa66504ea147aff5ee0df3	2018-05-24 14:37:34 -04:00
David Shrewsbury	be0c59535a	Simplify driver API Further decouple the driver API from the provider configuration by simplifying the driver interface (which will eventually lead to the driver deciding whether or not to support multiple labels) and removing config references in NodeLauncher. This significantly simplifies the driver API by: * Reducing the launching of nodes to a launch() and launchesCompleted() interface. The launchesCompleted() method acts as a periodic check on whether the nodeset launches have all completed. * How the drivers actually launch nodes is decoupled from the driver API by not having the return of NodeRequestHandler.launch() be either a thread or None. This leaves it entirely up to the driver to decide the best way to manage launches... threads, subprocesses, messaging, etc. E.g., thread handling is moved from driver API into OpenStack driver. * Drivers no longer need to implement the very confusing pollLauncher() method. This was originally meant to poll the launch threads, but even drivers not using threads (e.g., the static driver) had to reimplement this method. That made no sense. The NodeLauncher class (currently used only by the OpenStack driver), is moved to a driver utility module that other drivers may choose to use or ignore. And it no longer directly accesses the provider config. The test_nodelaunchmanager.py tests was originally testing the NodeLaunchManager class (removed quite some time ago) which managed launch threads. This was eventually moved to the driver interface. This change further moves thread management into the OpenStack driver. As a result, these tests are redundant since the test_launcher.py tests cover all of this. So remove test_nodelaunchmanager.py. Change-Id: I407d999b3608e1614c0dbe49808b2a64756dde58	2018-05-18 11:24:59 -04:00
Zuul	e881199916	Merge "Refactor NodeLauncher to be generic"	2018-04-27 14:05:17 +00:00
David Shrewsbury	e1bd967001	Fix race in test_hold_expiration_no_default The test_hold_expiration_no_default test was misconfigured to use the nodepool config for the max ready age tests, which was causing a race in the test where the node could be removed (b/c it exceeded the max ready age) before it could set a hold. This fixes it to use a new config, and lowers the TTL on the held node to quicken the test. Change-Id: Iaeb16e7a11f8b5c7bb2b363b7939fee7933393c4	2018-04-25 15:25:01 -04:00
Tristan Cacqueray	11f0ffd201	Refactor NodeLauncher to be generic This change refactors the NodeLauncher object so that it is part of the generic interface. A nodepool driver handler only need to return a subclass implementing the launch method. Moreover this change adapts the StatsReporter class. Change-Id: I6cfec650b862cb4fa0cb391bcc1248549e30c91b	2018-04-19 02:23:42 +00:00
Zuul	20c6646bf0	Merge "Refactor run_handler to be generic"	2018-04-18 15:38:11 +00:00
Zuul	7ddee72b51	Merge "Add connection-port to provider diskimage"	2018-04-13 15:20:58 +00:00
Zuul	07327ab296	Merge "Don't gather host keys for non ssh connections"	2018-04-13 15:20:56 +00:00
Zuul	38cd023ce0	Merge "launcher: handle ZK session loss during handler poll"	2018-04-12 19:07:27 +00:00
Zuul	b44e8491b3	Merge "Handle ZK session loss during node launch"	2018-04-12 18:03:29 +00:00
Tristan Cacqueray	f5dc749f2a	Refactor run_handler to be generic This change refactors the OpenStackNodeRequestHandler.run_handler method so that it is part of the generic interface. A nodepool driver handler now only needs to implement the launch() method. Moreover, this change merge the NodeLaunchManager into the RequestHandler. This change adds the following method to the handler interface: * nodeReused(node): notify handler a node is re-used * checkReusableNode(node): verify if a node can be re-used * setNodeMetadata(node): store extra node metadata such as az * pollLauncher() Change-Id: I5b18b49f50e2b416b5e5c4e28e1a4a6a1df1c654	2018-04-12 02:50:49 +00:00
Tristan Cacqueray	f5eb867ad7	launcher: handle ZK session loss during handler poll Stop polling handler for which the NodeRequest lock is gone. Change-Id: Iba31cb6804c94fea34a6dc911dffdd7ba2ebdc6c	2018-04-12 01:56:38 +00:00
Tobias Henkel	687f120b3c	Add connection-port to provider diskimage The connection port should be included in the privider diskimage. This makes it possible to define images using other ports for connections winrm for Windows which run on a different port than 22. Change-Id: Ib4b335ffbcc4dc71704c06387377675a4206c663	2018-04-03 17:48:52 +02:00
Tobias Henkel	2da274e2ae	Don't gather host keys for non ssh connections In case of an image with the connection type winrm we cannot scan the ssh host keys. So in case the connection type is not ssh we need to skip gathering the host keys. Change-Id: I56f308baa10d40461cf4a919bbcdc4467e85a551	2018-04-03 17:31:45 +02:00
David Shrewsbury	763fa7ebfe	Handle ZK session loss during node launch Node launches can take a while, thus increasing the chances of it encountering a lost connection to ZooKeeper, which could lead to a session loss. Let's try to handle that by not re-attempting any launches (pointless because we've lost our node lock and the node could be deleted out from under us). Change-Id: Iaca005355e8d03e1e5ab7cdab0ea3c04117ca6fb	2018-03-28 10:16:23 -04:00
Paul Belanger	2286f2432c	Add host-key-checking option to openstack providers In some cases nodepool-launcher uses public API to launch nodes, but doesn't have access to the private networks of nodes it launches. Rather then failing, expose an option for operators to disable ssh-keyscan and allow nodes to become ready. Change-Id: I764398aa21461ef44048e9e6565d2ee3e01aaaf8 Signed-off-by: Paul Belanger <pabelanger@redhat.com>	2018-03-26 22:29:14 +00:00
Zuul	13e705edbc	Merge "Store label info with launcher registration"	2018-03-08 03:01:31 +00:00
David Shrewsbury	bcd8a2c132	Store label info with launcher registration It can be useful to know all of the labels that nodepool can support without having to examing nodepool configuration files. We can store this information when we register each launcher at startup, and on configuration changes. This stores supported labels for each launcher in ZooKeeper. A "supported label" here is one that is actually bootable and referenced within the provider pool configuration sections. This is made slightly more complex because each nodepool driver handles its own configuration, so we have to check driver type before deciding how to get at the labels. It also introduces a new model class (Launcher) to extract away the information that we store in ZooKeeper so that we can later more easily add other information that could be useful (such as per-tenant info). Change-Id: Icfb73fbe3b67321235a78ea7ed9bf4319567eb1a	2018-03-07 16:42:40 -05:00
Zuul	a77eba273b	Merge "Add unit test for multiple launchers"	2018-03-07 16:12:24 +00:00
mhuin	a59528c8e8	Clean held nodes automatically after configurable timeout Introduce a new configuration setting, "max_hold_age", that specifies the maximum uptime of held instances. If set to 0, held instances are kept until manually deleted. A custom value can be provided at the rpcclient level. Change-Id: I9a09728e5728c537ee44721f5d5e774dc0dcefa7	2018-02-20 16:13:55 +01:00
Paul Belanger	b71da985ea	Add unit test for multiple launchers Here we are adding a base test to show the current workings of multiple launchers. Where 1 nodepool-launcher is responsible for managing min-ready. Change-Id: I4d18b1ba152fef90a58cb9e8bf9157e1fbafdb09 Signed-off-by: Paul Belanger <pabelanger@redhat.com>	2018-02-08 15:48:48 -05:00
David Shrewsbury	e054b74c52	Do not delete unused but allocated nodes Let's assume an allocated node is going to be used (maybe zuul hasn't got around to locking it quickly enough). We don't want to assign a node, then delete it out from underneath the requestor. Change-Id: I092cffbfd347684f66d6e7cbd0f910d40858580b	2018-02-06 12:01:54 -05:00
David Shrewsbury	284e0b52bf	Provider wedge fix If a provider is going to pause due to insufficient quota, we attempt to delete the oldest unused node (one that is READY and unlocked) to guarantee that some quota will be freed. To prevent us from issuing unnecessary delete requests, if we find a node already being deleted within a recent timeframe (here, 5 minutes), we do not issue another delete. NOTE: This causes some assumptions in tests depending on the current pause behavior (no proactive deletion) to no longer be valid, so steps are taken in those tests to prevent the proactive delete from happening. Change-Id: I8d94c60fe1ab184503592a02d6ca458f94a2ea3d	2018-02-02 16:42:13 -05:00
David Shrewsbury	6c6cbfc209	Partial revert for disabled provider change Commit `1cf2cddf9d` attempted to short-circuit request handling when a provider was disabled by setting max-servers to 0. That prevents the provider from declining any requests, and thus prevents a request from ever reaching FAILURE status if none of the other providers can satisfy it (the request declined_by array is compared against the list of registered launchers). This adds a test for this case (which was run against the current code and proved it was broken). Change-Id: I8a2f9771b55b08259ecffa3475bdd2d4f2654464	2018-01-29 13:28:17 -05:00
Clark Boylan	fd2caae47e	Fix race in test_failed_provider In test_failed_provider we determine what the second node that fake-provider2 booted is then delete it. Unfortunately there is a race where we could find the first node booted in this provider and re delete it instead of also deleting the second node which never frees up provider quota which cases the test to timeout and fail. We fix this by explicitly checking the node state when looking for the second node booted in the provider. The first node will be in a deleting state or just won't exist so checking for a ready node should be sufficient to always find the second node. We also fix a small bug in how we raise the exception that causes fake-provider's launch request to fail. We need a single argument to the monkey patched launch() function otherwise we get a TypeError from python instead of the KeyError that we want. Change-Id: Ia2fad9736be76d6ca9a923c8e0643a96a9d1dbe6	2018-01-27 15:50:30 -08:00
Zuul	29bba0c1ef	Merge "Fix race in test_provider_removal"	2018-01-24 20:30:59 +00:00
David Shrewsbury	2e0d144e7e	Fix race in test_provider_removal Turns out that waiting for the config to be non-None in the Nodepool object is not sufficient to guarantee that our threads for the providers have started. Instead, wait for our nodes to appear, which definitely guarantees we have started the threads. Change-Id: I8eb189c91bd315bbd52081b81556cde3e29d7e5b	2018-01-22 14:24:43 -05:00
Tristan Cacqueray	d0a67878a3	Add a plugin interface for drivers This change adds a plugin interface so that driver can be loaded dynamically. Instead of importing each driver in the launcher, provider_manager and config, the Drivers class discovers and loads driver from the driver directory. This change also adds a reset() method to the driver Config interface to reset the os_client_config reference when reloading the OpenStack driver. Change-Id: Ia347aa2501de0e05b2a7dd014c4daf1b0a4e0fb5	2018-01-19 00:45:56 +00:00
Clark Boylan	63dbab87df	Fix node data retrieval race in test_failed_provider Once we start deleting nodes in test_failed_provider we can't count on getNode() giving us a node for a previous request because that node may have been deleted. Check that node is a valid value before accessing its provider value to avoid this race. Change-Id: Ie6b20f1a66de7865223a7ecf5785e2446afc7804	2018-01-18 13:19:02 -08:00

1 2

89 Commits