Commit Graph

29 Commits

Author SHA1 Message Date
James E. Blair be3edd3e17 Convert openstack driver to statemachine
This updates the OpenStack driver to use the statemachine framework.

The goal is to revise all remaining drivers to use the statemachine
framework for two reasons:

1) We can dramatically reduce the number of threads in Nodepool which
is our biggest scaling bottleneck.  The OpenStack driver already
includes some work in that direction, but in a way that is unique
to it and not easily shared by other drivers.  The statemachine
framework is an extension of that idea implemented so that every driver
can use it.  This change further reduces the number of threads needed
even for the openstack driver.

2) By unifying all the drivers with a simple interface, we can prepare
to move them into Zuul.

There are a few updates to the statemachine framework to accomodate some
features that only the OpenStack driver used to date.

A number of tests need slight alteration since the openstack driver is
the basis of the "fake" driver used for tests.

Change-Id: Ie59a4e9f09990622b192ad840d9c948db717cce2
2023-01-10 10:30:14 -08:00
James E. Blair 916d62a374 Allow specifying diskimage metadata/tags
For drivers that support tagging/metadata (openstack, aws, azure),
Add or enhance support for supplying tags for uploaded diskimages.

This allows users to set metadata on the global diskimage object
which will then be used as default values for metadata on the
provider diskimage values.  The resulting merged dictionary forms
the basis of metadata to be associated with the uploaded image.

The changes needed to reconcile this for the three drivers mentioned
above are:

All: the diskimages[].meta key is added to supply the default values
for provider metadata.

OpenStack: provider diskimage metadata is already supported using
providers[].diskimages[].meta, so no further changes are needed.

AWS, Azure: provider diskimage tags are added using the key
providers[].diskimages[].tags since these providers already use
the "tags" nomenclature for instances.

This results in the somewhat incongruous situation where we have
diskimage "metadata" being combined with provider "tags", but it's
either that or have images with "metadata" while we have instances
with "tags", both of which are "tags" in EC2.  The chosen approach
has consistency within the driver.

Change-Id: I30aadadf022af3aa97772011cda8dbae0113a3d8
2022-08-23 06:39:08 -07:00
Joshua Watt 2c632af426 Do not reset quota cache timestamp when invalid
The quota cache may not be a valid dictionary when
invalidateQuotaCache() is called (e.g. when 'ignore-provider-quota' is
used in OpenStack). In that case, don't attempt to treat the None as a
dictionary as this raises a TypeError exception.

This bug was preventing Quota errors from OpenStack from causing
nodepool to retry the node request when ignore-provider-quota is True,
because the OpenStack handler calles invalidateQuotaCache() before
raising the QuotaException. Since invalidateQuotaCache() was raising
TypeError, it prevented the QuotaException from being raised and the
node allocation was outright failed.

A test has been added to verify that nodepool and OpenStack will now
retry node allocations as intended.

This fixes that bug, but does change the behavior of OpenStack when
ignore-provider-quota is True and it returns a Quota error.

Change-Id: I1916c56c4f07c6a5d53ce82f4c1bb32bddbd7d63
Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
2022-05-10 15:04:25 -05:00
James E. Blair b8035de65f Improve handling of errors in provider manager startup
If a provider (or its configuration) is sufficiently broken that
the provider manager is unable to start, then the launcher will
go into a loop where it attempts to restart all providers in the
system until it succeeds.  During this time, no pool managers are
running which mean all requests are ignored by this launcher.

Nodepool continuously reloads its configuration file, and in case
of an error, the expected behavior is to continue running and allow
the user to correct the configuration and retry after a short delay.

We also expect providers on a launcher to be independent of each
other so that if ones fails, the others continue working.

However since we neither exit, nor process node requests if a
provider manager fails to start, an error with one provider can
cause all providers to stop handling requests with very little
feedback to the operator.

To address this, if a provider manager fails to start, the launcher
will now behave as if the provider were absent from the config file.
It will still emit the error to the log, and it will continuously
attempt to start the provider so that if the error condition abates,
the provider will start.

If there are no providers on-line for a label, then as long as any
provider in the system is running, node requests will be handled
and declined and possibly failed while the broken provider is offilne.

If the system contains only a single provider and it is broken, then
no requests will be handled (failed), which is the current behavior,
and still likely to be the most desirable in that case.

Change-Id: If652e8911993946cee67c4dba5e6f88e55ac7099
2022-01-14 19:07:32 -08:00
Tobias Henkel 2e59f7b0b3
Offload waiting for server creation/deletion
Currently nodepool has one thread per server creation or
deletion. Each of those waits for the cloud by regularly getting the
server list and checking if their instance is active or gone. On a
busy nodepool this leads to severe thread contention when the server
list gets large and/or there are many parallel creations/deletions in
progress.

This can be improved by offloading the waiting to a single thread that
regularly retrieves the server list and compares that to the list of
waiting server creates/deletes. The calling threads are then waiting
until the central thread wakes them up to proceed their task. The
waiting threads are waiting for the event outside of the GIL and thus
are not contributing to the thread contention problem anymore.

An alternative approach would be to redesign the threading to be less
threaded but this would be a much more complex redesign. Thus this
change keeps the many threads approach but makes them wait much more
lightweight which shows a substantial improvement during load testing
in a test environment.

Change-Id: I5525f2558a4f08a455f72e6b5479f27684471dc7
2021-02-16 15:37:57 +01:00
Tobias Urdin e4ce77466a Filter active images for OpenStack provider
The OpenStack provider doesn't filter on status
so when we uploaded a new image and deactivated
the old one it throws a SDKException because it
finds multiple images with the same name.

This adds a filter to only lookup Glance images
with a `active` status with the openstacksdk
which is the only valid state where we can
use the image [1].

[1] https://docs.openstack.org/glance/latest/user/statuses.html

Change-Id: I480b4e222232da1f1aa86b1a08117e605ef08eb4
2020-03-17 16:26:50 +01:00
Clark Boylan 4ac8bf0cd9 Delete dib images when all uploads set to deleting
We have noticed that in the real world that sometimes images refuse to
delete. Boot from volume with leaked volumes can cause this as well as
other unknown issues. When this happens we keep old dib builds around
longer than expected which eventually leads to full builder disks and
sadness.

Once we've set all provider uploads to the delete state we know we no
longer what that image at all either locally or in the cloud. We can
delete the local copies of the files as soon as we reach this state so
that we aren't waiting for the provider image uploads to go away first.

This should make the builders more reliable as they'll clean their disks
as soon as is possible.

Change-Id: I7549dca5172bc82f143463994ac8578cc98a4375
2020-01-13 13:22:16 -08:00
Tobias Henkel 376adbc933
Delete images by id
Nodepool currently deletes managed images by name. This leads to
problems when an image is uploaded twice (e.g. because of multiple
formats). In this case there can be more than one image with the same
name which breaks the deletion. This can be fixed by deleting the
images by id instead.

Change-Id: I74fc6219ef7f2c496f36defb0703137ec4d7d30e
2019-11-25 12:57:51 +01:00
Clark Boylan 4b6afe403b Handle case where nova server is in DELETED state
The nova api can return instance records for an instance that has been
deleted. When it does this the status should be "DELETED". This means we
should check for the instance to have no more record or if the record is
present check that the status if DELETED.

Change-Id: I7ad753a3c73f3d2cd78f4a380f78279af9206ada
2019-10-11 11:01:09 -07:00
Clark Boylan 21ae5d7027 Use real uuids in fake cloud resource IDs
When debugging fake cloud interactions between provider managers that
are changing it is really confusing when you see logs that show port 1a
has been deleted but then when ports are listed 1a is still there.

This was happening because we were sharing the id values for fake cloud
resources across fake provider managers. Initially this wasn't clear to
me reading logs.

By updating these resources to use real uuids we can see that $UUID1
does not equal $UUID2 when comparing these items in logs.

Change-Id: I77db55b1a7649c3af91e83c81a906f03a28bb34a
2019-10-07 12:32:40 -07:00
Tobias Henkel 8678b34398 Fix node failures when at volume quota
When creating instances with boot from volume we don't get quota
related information in the exception raised by wait_for_server. Also
in the server munch that is returned the fault information is
missing. This causes node failures when we run into the volume
quota. This can be fixed by explicitly fetching the server if we got
one and inspecting the fault information which contains more
information about the fault reason [1].

[1] Example fault reason:
Build of instance 4628f079-26a9-4a1d-aaa0-881ba4c7b9cb aborted:
VolumeSizeExceedsAvailableQuota: Requested volume or snapshot exceeds
allowed gigabytes quota. Requested 500G, quota is 10240G and 10050G
has been consumed.

Change-Id: I6d832d4dbe348646cd4fb49ee7cb5f6a6ad343cf
2019-09-06 15:15:34 -04:00
Monty Taylor 7618b714e2 Remove unused use_taskmanager flag
Now that there is no more TaskManager class, nor anything using
one, the use_taskmanager flag is vestigal. Clean it up so that we
don't have to pass it around to things anymore.

Change-Id: I7c1f766f948ad965ee5f07321743fbaebb54288a
2019-04-02 12:11:07 +00:00
Ian Wienand 0cf8144e8c
Revert "Revert "Cleanup down ports""
This reverts commit 7e1b8a7261.

openstacksdk >=0.19.0 fixes the filtering problems leading to all
ports being deleted. However openstacksdk <0.21.0 has problems with
dogpile.cache so use 0.21.0 as a minimum.

Change-Id: Id642d074cbb645ced5342dda4a1c89987c91a8fc
2019-01-18 15:03:55 +01:00
Paul Belanger cc994d3a6a Include host_id for openstack provider
When we create a server using the openstack provider, in certain
conditions it is helpful to collect the host_id value. As an example,
Zuul will pass through this information into a job inventory file which
will allow an operator to better profile where jobs run with in a cloud.

This can be helpful trying to debug random jobs timing out within a
cloud.

Change-Id: If29397e67a470462561f24c746375b8291ac43ab
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2018-12-06 18:49:48 -05:00
Tristan Cacqueray 6fe861f42a
Set type for error'ed instances
When a server creation fails but has an external id we create a new
znode to offload the deletion of that node. This currently misses the
node type which will trigger an exception during node launch [1]. This
wedges the provider until the node deleter kicked in and deleted that
node successfully. Fix this by storing the node type in this znode.

[1] Exception
Traceback (most recent call last):
  File "nodepool/driver/__init__.py", line 639, in run
    self._runHandler()
  File "nodepool/driver/__init__.py", line 563, in _runHandler
    self._waitForNodeSet()
  File "nodepool/driver/__init__.py", line 463, in _waitForNodeSet
    if not self.hasRemainingQuota(ntype):
  File "nodepool/driver/openstack/handler.py", line 314, in hasRemainingQuota
    self.manager.estimatedNodepoolQuotaUsed())
  File "nodepool/driver/openstack/provider.py", line 164, in estimatedNodepoolQuotaUsed
    if node.type[0] not in provider_pool.labels:
IndexError: list index out of range

Change-Id: I67b269069dddb8349959802d7b1ee049a826d0c5
Co-authored-by: Tobias Henkel <tobias.henkel@bmw.de>
2018-12-05 10:30:20 +01:00
Tobias Henkel 7e1b8a7261
Revert "Cleanup down ports"
The port filter for DOWN port seems to have no effect. It actually
deleted *all* ports in the tenant.

This reverts commit cdd60504ec.

Change-Id: I48c1430bb768903af467cace1a720e45ecc8e98f
2018-10-30 13:13:43 +01:00
David Shrewsbury cdd60504ec Cleanup down ports
Cleanup will be periodic (every 3 minutes by default, not yet
configurable) and will be logged and reported via statsd.

Change-Id: I81b57d6f6142e64dd0ebf31531ca6489d6c46583
2018-10-29 13:36:43 -04:00
mhuin ebdc35fc5e Do not abort node launch if failed node cannot be deleted
There was a possibility for an uncaught exception when launching a
node: if the launch fails and the subsequent node cleaning fails
as well, nodepool would break out of the loop prematurely.

Also add request information when logging launch failure to make
debugging easier.

Change-Id: I4a08ebe7b0ce763b6ac4bcd314403b869e487ac1
2018-09-10 10:40:14 +02:00
Tobias Henkel 934b1eed9c
Invalidate az cache on bad request
When getting error 400 we not only need to clear the image and flavor
cache but the az cache as well. Otherwise we will get constantly node
failures for any node request where nodepool chose that az
[1]. Currently the only way to recover from this situation is to
restart nodepool. Invalidating the cache doesn't fix the request that
failed due to this error but at least ensures that nodepool will
recover from this situation automatically for all further node
requests.

[1] Trace:
018-07-05 09:09:08,477 ERROR nodepool.NodeLauncher-0000123378: Launch attempt 2/3 failed for node 0000123378:
Traceback (most recent call last):
  File "/opt/nodepool-source/nodepool/driver/openstack/handler.py", line 221, in launch
    self._launchNode()
  File "/opt/nodepool-source/nodepool/driver/openstack/handler.py", line 134, in _launchNode
    volume_size=self.label.volume_size)
  File "/opt/nodepool-source/nodepool/driver/openstack/provider.py", line 378, in createServer
    return self._client.create_server(wait=False, **create_args)
  File "<decorator-gen-106>", line 2, in create_server
  File "/usr/lib/python3.5/site-packages/shade/_utils.py", line 410, in func_wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python3.5/site-packages/shade/openstackcloud.py", line 6909, in create_server
    endpoint, json=server_json)
  File "/usr/lib/python3.5/site-packages/keystoneauth1/adapter.py", line 334, in post
    return self.request(url, 'POST', **kwargs)
  File "/usr/lib/python3.5/site-packages/shade/_adapter.py", line 158, in request
    return self._munch_response(response, error_message=error_message)
  File "/usr/lib/python3.5/site-packages/shade/_adapter.py", line 114, in _munch_response
    exc.raise_from_response(response, error_message=error_message)
  File "/usr/lib/python3.5/site-packages/shade/exc.py", line 171, in raise_from_response
    raise OpenStackCloudBadRequest(msg, response=response)
shade.exc.OpenStackCloudBadRequest: (400) Client Error for url: (...) The requested availability zone is not available

Change-Id: I5f653f159b08cf086d20c2398a9345bd4caa4d1e
2018-07-23 14:04:08 +02:00
Artem Goncharov fc1f80b6d1
Replace shade and os-client-config with openstacksdk.
os-client-config is now just a wrapper around openstacksdk. The shade
code has been imported into openstacksdk. To reduce complexity, just use
openstacksdk directly.

openstacksdk's TaskManager has had to grow some features to deal with
SwiftService. Making nodepool's TaskManager a subclass of openstacksdk's
TaskManager ensures that we get the thread pool set up properly.

Change-Id: I3a01eb18ae31cc3b61509984f3817378db832b47
2018-07-14 08:44:03 -05:00
David Shrewsbury e66e5a56b6 Add test for referencing cloud image by ID
This adds a test for the image-id cloud image attribute (which
should cause shade to perform a direct lookup on the image by
UUID (bypassing the typical name-or-id logic).

Change-Id: I8b2b99aac012e0f1adda326dfe76d9f90f606cb1
Co-Authored-By: James E. Blair <jeblair@redhat.com>
2018-07-03 14:13:56 -04:00
Artem Goncharov 674c9516dc Add support for specifying security_group in nodepool
In some installations it might be unreal to rely on the default security
group (security concerns). In order to also enable possibility to share
one tenant between zuul and other resources a support for specifying
security_groups on the driver.openstack.pool level is added.

Change-Id: I63240049cba295e15f7cfe75b7e7a7d53aa4e37d
2018-06-05 10:00:06 +02:00
James E. Blair ae0f54abeb Directly link providers and request handlers
Rather than relying on inspection and loading classes from specific
filenames, ask each driver's Provider implementation to supply the
NodeRequestHandler instance for requests.

This is the first in a series of changes to remove all similar
indirection from the drivers in order to make them easier to
understand.

The test driver is moved out of the fixtures subdir to be a
"regular" driver because it now needs relative imports (provider
code needs to import the handler code).

Change-Id: Iad60d8bfe102d43b9fb1a1ff3ede13e4fa753572
2018-06-04 11:53:44 -04:00
Tristan Cacqueray d0a67878a3 Add a plugin interface for drivers
This change adds a plugin interface so that driver can be loaded dynamically.
Instead of importing each driver in the launcher, provider_manager and config,
the Drivers class discovers and loads driver from the driver directory.

This change also adds a reset() method to the driver Config interface to
reset the os_client_config reference when reloading the OpenStack driver.

Change-Id: Ia347aa2501de0e05b2a7dd014c4daf1b0a4e0fb5
2018-01-19 00:45:56 +00:00
James E. Blair 5aac0a361e Fail on quota-exceeded (partial revert)
This is a partial revert of ae65d94e34
which did two things: made over-quota errors not fatal, and invalidated
the quota cache.

The quota cache invalidation had a typo which caused it to fail.  This
error actually meant that we retained the old behavior of marking
a launch attempt as failed if we encountered an over-quota error.
This change adds a test to exercise this and fixes the typo.

In writing the test, it was observed that a launch which encounteres
an over-quota situation will continue to retry indefinitely, every
5 seconds, and will emit an exception into the log each time.  This is
probably not the best experience for an operator.  Further, because
there is no discernable state change within nodepool, it is very
difficult to test this situation.

Due to these issues, we should discuss a more robust way to handle
unexpected over-quota errors.  In the mean time, reverting to the old
(and indeed, still current due to the typo) behavior of failing on
over-quota error, seems the safest way to proceed.

Also, change time.monotonic to time.time.  The reference point for
monotonic is undefined, and can be 0, or other values less than 300,
which causes the cache timeout check to fail.

Change-Id: Id488c070df991a554570c5717dc85aec351fed45
2017-12-19 17:15:56 -08:00
Tobias Henkel 1b465699ed Add cloud quota handling
This is an approach of calculating the cloud quotas before launching
nodes. To support co-existence with other services the limits of the
cloud are gathered and all unknown resources subtracted from the max
values. Thus the remaining quota for being used by nodepool can be
calculated and checked before launching nodes.

Change-Id: I18594ab76949faddade3164b6027dd3b82771c95
2017-12-13 20:44:13 +01:00
Rui Chen 32e1e0b616 Apply floating ip for node according to configuration
When we deploy nodepool and zuul instances in virtual machine of
cloud provider, the provisioned nodes may be in the same internal
network with nodepool and zuul instances, in that case we don't
have to allocate floating ip for nodes, zuul can talk with nodes
via fixed ip of virtual machines. So if we can customize the behavior,
save the quota of floating ip, it's awesome.

Note: Although option "floating_ip_source: None" in clouds.yaml can
decide to apply floating ip or not for specified cloud, but that impact
all the SDKs and tools that use the clouds.yaml, we should control
nodepool behavior flexibly and independently.

This patch add a bool option "auto-floating-ip" into each pool of
"provider" section in nodepool.conf

Change-Id: Ia9a1bed6dd4f6e39015bde660f52e4cd6addb26e
2017-11-22 08:34:57 +00:00
Tristan Cacqueray b01227c9d4 Move the fakeprovider module to the fake driver
This change is a follow-up to the drivers spec and it makes the fake provider
a real driver. The fakeprovider module is merged into the fake provider and
the get_one_cloud config loader is simplified.

Change-Id: I3f8ae12ea888e7c2a13f246ea5f85d4a809e8c8d
2017-07-28 11:35:07 +00:00
Tristan Cacqueray 4d201328f5 Collect request handling implementation in an OpenStack driver
This change moves OpenStack related code to a driver. To avoid circular
import, this change also moves the StatsReporter to the stats module so that
the handlers doesn't have to import the launcher.

Change-Id: I319ce8780aa7e81b079c3f31d546b89eca6cf5f4
Story: 2001044
Task: 4614
2017-07-25 14:27:17 +00:00