nodepool

Commit Graph

Author	SHA1	Message	Date
James E. Blair	be3edd3e17	Convert openstack driver to statemachine This updates the OpenStack driver to use the statemachine framework. The goal is to revise all remaining drivers to use the statemachine framework for two reasons: 1) We can dramatically reduce the number of threads in Nodepool which is our biggest scaling bottleneck. The OpenStack driver already includes some work in that direction, but in a way that is unique to it and not easily shared by other drivers. The statemachine framework is an extension of that idea implemented so that every driver can use it. This change further reduces the number of threads needed even for the openstack driver. 2) By unifying all the drivers with a simple interface, we can prepare to move them into Zuul. There are a few updates to the statemachine framework to accomodate some features that only the OpenStack driver used to date. A number of tests need slight alteration since the openstack driver is the basis of the "fake" driver used for tests. Change-Id: Ie59a4e9f09990622b192ad840d9c948db717cce2	2023-01-10 10:30:14 -08:00
James E. Blair	916d62a374	Allow specifying diskimage metadata/tags For drivers that support tagging/metadata (openstack, aws, azure), Add or enhance support for supplying tags for uploaded diskimages. This allows users to set metadata on the global diskimage object which will then be used as default values for metadata on the provider diskimage values. The resulting merged dictionary forms the basis of metadata to be associated with the uploaded image. The changes needed to reconcile this for the three drivers mentioned above are: All: the diskimages[].meta key is added to supply the default values for provider metadata. OpenStack: provider diskimage metadata is already supported using providers[].diskimages[].meta, so no further changes are needed. AWS, Azure: provider diskimage tags are added using the key providers[].diskimages[].tags since these providers already use the "tags" nomenclature for instances. This results in the somewhat incongruous situation where we have diskimage "metadata" being combined with provider "tags", but it's either that or have images with "metadata" while we have instances with "tags", both of which are "tags" in EC2. The chosen approach has consistency within the driver. Change-Id: I30aadadf022af3aa97772011cda8dbae0113a3d8	2022-08-23 06:39:08 -07:00
Joshua Watt	2c632af426	Do not reset quota cache timestamp when invalid The quota cache may not be a valid dictionary when invalidateQuotaCache() is called (e.g. when 'ignore-provider-quota' is used in OpenStack). In that case, don't attempt to treat the None as a dictionary as this raises a TypeError exception. This bug was preventing Quota errors from OpenStack from causing nodepool to retry the node request when ignore-provider-quota is True, because the OpenStack handler calles invalidateQuotaCache() before raising the QuotaException. Since invalidateQuotaCache() was raising TypeError, it prevented the QuotaException from being raised and the node allocation was outright failed. A test has been added to verify that nodepool and OpenStack will now retry node allocations as intended. This fixes that bug, but does change the behavior of OpenStack when ignore-provider-quota is True and it returns a Quota error. Change-Id: I1916c56c4f07c6a5d53ce82f4c1bb32bddbd7d63 Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>	2022-05-10 15:04:25 -05:00
James E. Blair	b8035de65f	Improve handling of errors in provider manager startup If a provider (or its configuration) is sufficiently broken that the provider manager is unable to start, then the launcher will go into a loop where it attempts to restart all providers in the system until it succeeds. During this time, no pool managers are running which mean all requests are ignored by this launcher. Nodepool continuously reloads its configuration file, and in case of an error, the expected behavior is to continue running and allow the user to correct the configuration and retry after a short delay. We also expect providers on a launcher to be independent of each other so that if ones fails, the others continue working. However since we neither exit, nor process node requests if a provider manager fails to start, an error with one provider can cause all providers to stop handling requests with very little feedback to the operator. To address this, if a provider manager fails to start, the launcher will now behave as if the provider were absent from the config file. It will still emit the error to the log, and it will continuously attempt to start the provider so that if the error condition abates, the provider will start. If there are no providers on-line for a label, then as long as any provider in the system is running, node requests will be handled and declined and possibly failed while the broken provider is offilne. If the system contains only a single provider and it is broken, then no requests will be handled (failed), which is the current behavior, and still likely to be the most desirable in that case. Change-Id: If652e8911993946cee67c4dba5e6f88e55ac7099	2022-01-14 19:07:32 -08:00
Tobias Henkel	2e59f7b0b3	Offload waiting for server creation/deletion Currently nodepool has one thread per server creation or deletion. Each of those waits for the cloud by regularly getting the server list and checking if their instance is active or gone. On a busy nodepool this leads to severe thread contention when the server list gets large and/or there are many parallel creations/deletions in progress. This can be improved by offloading the waiting to a single thread that regularly retrieves the server list and compares that to the list of waiting server creates/deletes. The calling threads are then waiting until the central thread wakes them up to proceed their task. The waiting threads are waiting for the event outside of the GIL and thus are not contributing to the thread contention problem anymore. An alternative approach would be to redesign the threading to be less threaded but this would be a much more complex redesign. Thus this change keeps the many threads approach but makes them wait much more lightweight which shows a substantial improvement during load testing in a test environment. Change-Id: I5525f2558a4f08a455f72e6b5479f27684471dc7	2021-02-16 15:37:57 +01:00
Tobias Urdin	e4ce77466a	Filter active images for OpenStack provider The OpenStack provider doesn't filter on status so when we uploaded a new image and deactivated the old one it throws a SDKException because it finds multiple images with the same name. This adds a filter to only lookup Glance images with a `active` status with the openstacksdk which is the only valid state where we can use the image [1]. [1] https://docs.openstack.org/glance/latest/user/statuses.html Change-Id: I480b4e222232da1f1aa86b1a08117e605ef08eb4	2020-03-17 16:26:50 +01:00
Clark Boylan	4ac8bf0cd9	Delete dib images when all uploads set to deleting We have noticed that in the real world that sometimes images refuse to delete. Boot from volume with leaked volumes can cause this as well as other unknown issues. When this happens we keep old dib builds around longer than expected which eventually leads to full builder disks and sadness. Once we've set all provider uploads to the delete state we know we no longer what that image at all either locally or in the cloud. We can delete the local copies of the files as soon as we reach this state so that we aren't waiting for the provider image uploads to go away first. This should make the builders more reliable as they'll clean their disks as soon as is possible. Change-Id: I7549dca5172bc82f143463994ac8578cc98a4375	2020-01-13 13:22:16 -08:00
Tobias Henkel	376adbc933	Delete images by id Nodepool currently deletes managed images by name. This leads to problems when an image is uploaded twice (e.g. because of multiple formats). In this case there can be more than one image with the same name which breaks the deletion. This can be fixed by deleting the images by id instead. Change-Id: I74fc6219ef7f2c496f36defb0703137ec4d7d30e	2019-11-25 12:57:51 +01:00
Clark Boylan	4b6afe403b	Handle case where nova server is in DELETED state The nova api can return instance records for an instance that has been deleted. When it does this the status should be "DELETED". This means we should check for the instance to have no more record or if the record is present check that the status if DELETED. Change-Id: I7ad753a3c73f3d2cd78f4a380f78279af9206ada	2019-10-11 11:01:09 -07:00
Clark Boylan	21ae5d7027	Use real uuids in fake cloud resource IDs When debugging fake cloud interactions between provider managers that are changing it is really confusing when you see logs that show port 1a has been deleted but then when ports are listed 1a is still there. This was happening because we were sharing the id values for fake cloud resources across fake provider managers. Initially this wasn't clear to me reading logs. By updating these resources to use real uuids we can see that $UUID1 does not equal $UUID2 when comparing these items in logs. Change-Id: I77db55b1a7649c3af91e83c81a906f03a28bb34a	2019-10-07 12:32:40 -07:00
Tobias Henkel	8678b34398	Fix node failures when at volume quota When creating instances with boot from volume we don't get quota related information in the exception raised by wait_for_server. Also in the server munch that is returned the fault information is missing. This causes node failures when we run into the volume quota. This can be fixed by explicitly fetching the server if we got one and inspecting the fault information which contains more information about the fault reason [1]. [1] Example fault reason: Build of instance 4628f079-26a9-4a1d-aaa0-881ba4c7b9cb aborted: VolumeSizeExceedsAvailableQuota: Requested volume or snapshot exceeds allowed gigabytes quota. Requested 500G, quota is 10240G and 10050G has been consumed. Change-Id: I6d832d4dbe348646cd4fb49ee7cb5f6a6ad343cf	2019-09-06 15:15:34 -04:00
Monty Taylor	7618b714e2	Remove unused use_taskmanager flag Now that there is no more TaskManager class, nor anything using one, the use_taskmanager flag is vestigal. Clean it up so that we don't have to pass it around to things anymore. Change-Id: I7c1f766f948ad965ee5f07321743fbaebb54288a	2019-04-02 12:11:07 +00:00
Ian Wienand	0cf8144e8c	Revert "Revert "Cleanup down ports"" This reverts commit `7e1b8a7261`. openstacksdk >=0.19.0 fixes the filtering problems leading to all ports being deleted. However openstacksdk <0.21.0 has problems with dogpile.cache so use 0.21.0 as a minimum. Change-Id: Id642d074cbb645ced5342dda4a1c89987c91a8fc	2019-01-18 15:03:55 +01:00
Paul Belanger	cc994d3a6a	Include host_id for openstack provider When we create a server using the openstack provider, in certain conditions it is helpful to collect the host_id value. As an example, Zuul will pass through this information into a job inventory file which will allow an operator to better profile where jobs run with in a cloud. This can be helpful trying to debug random jobs timing out within a cloud. Change-Id: If29397e67a470462561f24c746375b8291ac43ab Signed-off-by: Paul Belanger <pabelanger@redhat.com>	2018-12-06 18:49:48 -05:00
Tristan Cacqueray	6fe861f42a	Set type for error'ed instances When a server creation fails but has an external id we create a new znode to offload the deletion of that node. This currently misses the node type which will trigger an exception during node launch [1]. This wedges the provider until the node deleter kicked in and deleted that node successfully. Fix this by storing the node type in this znode. [1] Exception Traceback (most recent call last): File "nodepool/driver/__init__.py", line 639, in run self._runHandler() File "nodepool/driver/__init__.py", line 563, in _runHandler self._waitForNodeSet() File "nodepool/driver/__init__.py", line 463, in _waitForNodeSet if not self.hasRemainingQuota(ntype): File "nodepool/driver/openstack/handler.py", line 314, in hasRemainingQuota self.manager.estimatedNodepoolQuotaUsed()) File "nodepool/driver/openstack/provider.py", line 164, in estimatedNodepoolQuotaUsed if node.type[0] not in provider_pool.labels: IndexError: list index out of range Change-Id: I67b269069dddb8349959802d7b1ee049a826d0c5 Co-authored-by: Tobias Henkel <tobias.henkel@bmw.de>	2018-12-05 10:30:20 +01:00
Tobias Henkel	7e1b8a7261	Revert "Cleanup down ports" The port filter for DOWN port seems to have no effect. It actually deleted all ports in the tenant. This reverts commit `cdd60504ec`. Change-Id: I48c1430bb768903af467cace1a720e45ecc8e98f	2018-10-30 13:13:43 +01:00
David Shrewsbury	cdd60504ec	Cleanup down ports Cleanup will be periodic (every 3 minutes by default, not yet configurable) and will be logged and reported via statsd. Change-Id: I81b57d6f6142e64dd0ebf31531ca6489d6c46583	2018-10-29 13:36:43 -04:00
mhuin	ebdc35fc5e	Do not abort node launch if failed node cannot be deleted There was a possibility for an uncaught exception when launching a node: if the launch fails and the subsequent node cleaning fails as well, nodepool would break out of the loop prematurely. Also add request information when logging launch failure to make debugging easier. Change-Id: I4a08ebe7b0ce763b6ac4bcd314403b869e487ac1	2018-09-10 10:40:14 +02:00
Tobias Henkel	934b1eed9c	Invalidate az cache on bad request When getting error 400 we not only need to clear the image and flavor cache but the az cache as well. Otherwise we will get constantly node failures for any node request where nodepool chose that az [1]. Currently the only way to recover from this situation is to restart nodepool. Invalidating the cache doesn't fix the request that failed due to this error but at least ensures that nodepool will recover from this situation automatically for all further node requests. [1] Trace: 018-07-05 09:09:08,477 ERROR nodepool.NodeLauncher-0000123378: Launch attempt 2/3 failed for node 0000123378: Traceback (most recent call last): File "/opt/nodepool-source/nodepool/driver/openstack/handler.py", line 221, in launch self._launchNode() File "/opt/nodepool-source/nodepool/driver/openstack/handler.py", line 134, in _launchNode volume_size=self.label.volume_size) File "/opt/nodepool-source/nodepool/driver/openstack/provider.py", line 378, in createServer return self._client.create_server(wait=False, *create_args) File "<decorator-gen-106>", line 2, in create_server File "/usr/lib/python3.5/site-packages/shade/_utils.py", line 410, in func_wrapper return func(args, kwargs) File "/usr/lib/python3.5/site-packages/shade/openstackcloud.py", line 6909, in create_server endpoint, json=server_json) File "/usr/lib/python3.5/site-packages/keystoneauth1/adapter.py", line 334, in post return self.request(url, 'POST', kwargs) File "/usr/lib/python3.5/site-packages/shade/_adapter.py", line 158, in request return self._munch_response(response, error_message=error_message) File "/usr/lib/python3.5/site-packages/shade/_adapter.py", line 114, in _munch_response exc.raise_from_response(response, error_message=error_message) File "/usr/lib/python3.5/site-packages/shade/exc.py", line 171, in raise_from_response raise OpenStackCloudBadRequest(msg, response=response) shade.exc.OpenStackCloudBadRequest: (400) Client Error for url: (...) The requested availability zone is not available Change-Id: I5f653f159b08cf086d20c2398a9345bd4caa4d1e	2018-07-23 14:04:08 +02:00
Artem Goncharov	fc1f80b6d1	Replace shade and os-client-config with openstacksdk. os-client-config is now just a wrapper around openstacksdk. The shade code has been imported into openstacksdk. To reduce complexity, just use openstacksdk directly. openstacksdk's TaskManager has had to grow some features to deal with SwiftService. Making nodepool's TaskManager a subclass of openstacksdk's TaskManager ensures that we get the thread pool set up properly. Change-Id: I3a01eb18ae31cc3b61509984f3817378db832b47	2018-07-14 08:44:03 -05:00
David Shrewsbury	e66e5a56b6	Add test for referencing cloud image by ID This adds a test for the image-id cloud image attribute (which should cause shade to perform a direct lookup on the image by UUID (bypassing the typical name-or-id logic). Change-Id: I8b2b99aac012e0f1adda326dfe76d9f90f606cb1 Co-Authored-By: James E. Blair <jeblair@redhat.com>	2018-07-03 14:13:56 -04:00
Artem Goncharov	674c9516dc	Add support for specifying security_group in nodepool In some installations it might be unreal to rely on the default security group (security concerns). In order to also enable possibility to share one tenant between zuul and other resources a support for specifying security_groups on the driver.openstack.pool level is added. Change-Id: I63240049cba295e15f7cfe75b7e7a7d53aa4e37d	2018-06-05 10:00:06 +02:00
James E. Blair	ae0f54abeb	Directly link providers and request handlers Rather than relying on inspection and loading classes from specific filenames, ask each driver's Provider implementation to supply the NodeRequestHandler instance for requests. This is the first in a series of changes to remove all similar indirection from the drivers in order to make them easier to understand. The test driver is moved out of the fixtures subdir to be a "regular" driver because it now needs relative imports (provider code needs to import the handler code). Change-Id: Iad60d8bfe102d43b9fb1a1ff3ede13e4fa753572	2018-06-04 11:53:44 -04:00
Tristan Cacqueray	d0a67878a3	Add a plugin interface for drivers This change adds a plugin interface so that driver can be loaded dynamically. Instead of importing each driver in the launcher, provider_manager and config, the Drivers class discovers and loads driver from the driver directory. This change also adds a reset() method to the driver Config interface to reset the os_client_config reference when reloading the OpenStack driver. Change-Id: Ia347aa2501de0e05b2a7dd014c4daf1b0a4e0fb5	2018-01-19 00:45:56 +00:00
James E. Blair	5aac0a361e	Fail on quota-exceeded (partial revert) This is a partial revert of `ae65d94e34` which did two things: made over-quota errors not fatal, and invalidated the quota cache. The quota cache invalidation had a typo which caused it to fail. This error actually meant that we retained the old behavior of marking a launch attempt as failed if we encountered an over-quota error. This change adds a test to exercise this and fixes the typo. In writing the test, it was observed that a launch which encounteres an over-quota situation will continue to retry indefinitely, every 5 seconds, and will emit an exception into the log each time. This is probably not the best experience for an operator. Further, because there is no discernable state change within nodepool, it is very difficult to test this situation. Due to these issues, we should discuss a more robust way to handle unexpected over-quota errors. In the mean time, reverting to the old (and indeed, still current due to the typo) behavior of failing on over-quota error, seems the safest way to proceed. Also, change time.monotonic to time.time. The reference point for monotonic is undefined, and can be 0, or other values less than 300, which causes the cache timeout check to fail. Change-Id: Id488c070df991a554570c5717dc85aec351fed45	2017-12-19 17:15:56 -08:00
Tobias Henkel	1b465699ed	Add cloud quota handling This is an approach of calculating the cloud quotas before launching nodes. To support co-existence with other services the limits of the cloud are gathered and all unknown resources subtracted from the max values. Thus the remaining quota for being used by nodepool can be calculated and checked before launching nodes. Change-Id: I18594ab76949faddade3164b6027dd3b82771c95	2017-12-13 20:44:13 +01:00
Rui Chen	32e1e0b616	Apply floating ip for node according to configuration When we deploy nodepool and zuul instances in virtual machine of cloud provider, the provisioned nodes may be in the same internal network with nodepool and zuul instances, in that case we don't have to allocate floating ip for nodes, zuul can talk with nodes via fixed ip of virtual machines. So if we can customize the behavior, save the quota of floating ip, it's awesome. Note: Although option "floating_ip_source: None" in clouds.yaml can decide to apply floating ip or not for specified cloud, but that impact all the SDKs and tools that use the clouds.yaml, we should control nodepool behavior flexibly and independently. This patch add a bool option "auto-floating-ip" into each pool of "provider" section in nodepool.conf Change-Id: Ia9a1bed6dd4f6e39015bde660f52e4cd6addb26e	2017-11-22 08:34:57 +00:00
Tristan Cacqueray	b01227c9d4	Move the fakeprovider module to the fake driver This change is a follow-up to the drivers spec and it makes the fake provider a real driver. The fakeprovider module is merged into the fake provider and the get_one_cloud config loader is simplified. Change-Id: I3f8ae12ea888e7c2a13f246ea5f85d4a809e8c8d	2017-07-28 11:35:07 +00:00
Tristan Cacqueray	4d201328f5	Collect request handling implementation in an OpenStack driver This change moves OpenStack related code to a driver. To avoid circular import, this change also moves the StatsReporter to the stats module so that the handlers doesn't have to import the launcher. Change-Id: I319ce8780aa7e81b079c3f31d546b89eca6cf5f4 Story: 2001044 Task: 4614	2017-07-25 14:27:17 +00:00

29 Commits