This updates the OpenStack driver to use the statemachine framework.
The goal is to revise all remaining drivers to use the statemachine
framework for two reasons:
1) We can dramatically reduce the number of threads in Nodepool which
is our biggest scaling bottleneck. The OpenStack driver already
includes some work in that direction, but in a way that is unique
to it and not easily shared by other drivers. The statemachine
framework is an extension of that idea implemented so that every driver
can use it. This change further reduces the number of threads needed
even for the openstack driver.
2) By unifying all the drivers with a simple interface, we can prepare
to move them into Zuul.
There are a few updates to the statemachine framework to accomodate some
features that only the OpenStack driver used to date.
A number of tests need slight alteration since the openstack driver is
the basis of the "fake" driver used for tests.
Change-Id: Ie59a4e9f09990622b192ad840d9c948db717cce2
For drivers that support tagging/metadata (openstack, aws, azure),
Add or enhance support for supplying tags for uploaded diskimages.
This allows users to set metadata on the global diskimage object
which will then be used as default values for metadata on the
provider diskimage values. The resulting merged dictionary forms
the basis of metadata to be associated with the uploaded image.
The changes needed to reconcile this for the three drivers mentioned
above are:
All: the diskimages[].meta key is added to supply the default values
for provider metadata.
OpenStack: provider diskimage metadata is already supported using
providers[].diskimages[].meta, so no further changes are needed.
AWS, Azure: provider diskimage tags are added using the key
providers[].diskimages[].tags since these providers already use
the "tags" nomenclature for instances.
This results in the somewhat incongruous situation where we have
diskimage "metadata" being combined with provider "tags", but it's
either that or have images with "metadata" while we have instances
with "tags", both of which are "tags" in EC2. The chosen approach
has consistency within the driver.
Change-Id: I30aadadf022af3aa97772011cda8dbae0113a3d8
The quota cache may not be a valid dictionary when
invalidateQuotaCache() is called (e.g. when 'ignore-provider-quota' is
used in OpenStack). In that case, don't attempt to treat the None as a
dictionary as this raises a TypeError exception.
This bug was preventing Quota errors from OpenStack from causing
nodepool to retry the node request when ignore-provider-quota is True,
because the OpenStack handler calles invalidateQuotaCache() before
raising the QuotaException. Since invalidateQuotaCache() was raising
TypeError, it prevented the QuotaException from being raised and the
node allocation was outright failed.
A test has been added to verify that nodepool and OpenStack will now
retry node allocations as intended.
This fixes that bug, but does change the behavior of OpenStack when
ignore-provider-quota is True and it returns a Quota error.
Change-Id: I1916c56c4f07c6a5d53ce82f4c1bb32bddbd7d63
Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
If a provider (or its configuration) is sufficiently broken that
the provider manager is unable to start, then the launcher will
go into a loop where it attempts to restart all providers in the
system until it succeeds. During this time, no pool managers are
running which mean all requests are ignored by this launcher.
Nodepool continuously reloads its configuration file, and in case
of an error, the expected behavior is to continue running and allow
the user to correct the configuration and retry after a short delay.
We also expect providers on a launcher to be independent of each
other so that if ones fails, the others continue working.
However since we neither exit, nor process node requests if a
provider manager fails to start, an error with one provider can
cause all providers to stop handling requests with very little
feedback to the operator.
To address this, if a provider manager fails to start, the launcher
will now behave as if the provider were absent from the config file.
It will still emit the error to the log, and it will continuously
attempt to start the provider so that if the error condition abates,
the provider will start.
If there are no providers on-line for a label, then as long as any
provider in the system is running, node requests will be handled
and declined and possibly failed while the broken provider is offilne.
If the system contains only a single provider and it is broken, then
no requests will be handled (failed), which is the current behavior,
and still likely to be the most desirable in that case.
Change-Id: If652e8911993946cee67c4dba5e6f88e55ac7099
Currently nodepool has one thread per server creation or
deletion. Each of those waits for the cloud by regularly getting the
server list and checking if their instance is active or gone. On a
busy nodepool this leads to severe thread contention when the server
list gets large and/or there are many parallel creations/deletions in
progress.
This can be improved by offloading the waiting to a single thread that
regularly retrieves the server list and compares that to the list of
waiting server creates/deletes. The calling threads are then waiting
until the central thread wakes them up to proceed their task. The
waiting threads are waiting for the event outside of the GIL and thus
are not contributing to the thread contention problem anymore.
An alternative approach would be to redesign the threading to be less
threaded but this would be a much more complex redesign. Thus this
change keeps the many threads approach but makes them wait much more
lightweight which shows a substantial improvement during load testing
in a test environment.
Change-Id: I5525f2558a4f08a455f72e6b5479f27684471dc7
The OpenStack provider doesn't filter on status
so when we uploaded a new image and deactivated
the old one it throws a SDKException because it
finds multiple images with the same name.
This adds a filter to only lookup Glance images
with a `active` status with the openstacksdk
which is the only valid state where we can
use the image [1].
[1] https://docs.openstack.org/glance/latest/user/statuses.html
Change-Id: I480b4e222232da1f1aa86b1a08117e605ef08eb4
We have noticed that in the real world that sometimes images refuse to
delete. Boot from volume with leaked volumes can cause this as well as
other unknown issues. When this happens we keep old dib builds around
longer than expected which eventually leads to full builder disks and
sadness.
Once we've set all provider uploads to the delete state we know we no
longer what that image at all either locally or in the cloud. We can
delete the local copies of the files as soon as we reach this state so
that we aren't waiting for the provider image uploads to go away first.
This should make the builders more reliable as they'll clean their disks
as soon as is possible.
Change-Id: I7549dca5172bc82f143463994ac8578cc98a4375
Nodepool currently deletes managed images by name. This leads to
problems when an image is uploaded twice (e.g. because of multiple
formats). In this case there can be more than one image with the same
name which breaks the deletion. This can be fixed by deleting the
images by id instead.
Change-Id: I74fc6219ef7f2c496f36defb0703137ec4d7d30e
The nova api can return instance records for an instance that has been
deleted. When it does this the status should be "DELETED". This means we
should check for the instance to have no more record or if the record is
present check that the status if DELETED.
Change-Id: I7ad753a3c73f3d2cd78f4a380f78279af9206ada
When debugging fake cloud interactions between provider managers that
are changing it is really confusing when you see logs that show port 1a
has been deleted but then when ports are listed 1a is still there.
This was happening because we were sharing the id values for fake cloud
resources across fake provider managers. Initially this wasn't clear to
me reading logs.
By updating these resources to use real uuids we can see that $UUID1
does not equal $UUID2 when comparing these items in logs.
Change-Id: I77db55b1a7649c3af91e83c81a906f03a28bb34a
When creating instances with boot from volume we don't get quota
related information in the exception raised by wait_for_server. Also
in the server munch that is returned the fault information is
missing. This causes node failures when we run into the volume
quota. This can be fixed by explicitly fetching the server if we got
one and inspecting the fault information which contains more
information about the fault reason [1].
[1] Example fault reason:
Build of instance 4628f079-26a9-4a1d-aaa0-881ba4c7b9cb aborted:
VolumeSizeExceedsAvailableQuota: Requested volume or snapshot exceeds
allowed gigabytes quota. Requested 500G, quota is 10240G and 10050G
has been consumed.
Change-Id: I6d832d4dbe348646cd4fb49ee7cb5f6a6ad343cf
Now that there is no more TaskManager class, nor anything using
one, the use_taskmanager flag is vestigal. Clean it up so that we
don't have to pass it around to things anymore.
Change-Id: I7c1f766f948ad965ee5f07321743fbaebb54288a
This reverts commit 7e1b8a7261.
openstacksdk >=0.19.0 fixes the filtering problems leading to all
ports being deleted. However openstacksdk <0.21.0 has problems with
dogpile.cache so use 0.21.0 as a minimum.
Change-Id: Id642d074cbb645ced5342dda4a1c89987c91a8fc
When we create a server using the openstack provider, in certain
conditions it is helpful to collect the host_id value. As an example,
Zuul will pass through this information into a job inventory file which
will allow an operator to better profile where jobs run with in a cloud.
This can be helpful trying to debug random jobs timing out within a
cloud.
Change-Id: If29397e67a470462561f24c746375b8291ac43ab
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
When a server creation fails but has an external id we create a new
znode to offload the deletion of that node. This currently misses the
node type which will trigger an exception during node launch [1]. This
wedges the provider until the node deleter kicked in and deleted that
node successfully. Fix this by storing the node type in this znode.
[1] Exception
Traceback (most recent call last):
File "nodepool/driver/__init__.py", line 639, in run
self._runHandler()
File "nodepool/driver/__init__.py", line 563, in _runHandler
self._waitForNodeSet()
File "nodepool/driver/__init__.py", line 463, in _waitForNodeSet
if not self.hasRemainingQuota(ntype):
File "nodepool/driver/openstack/handler.py", line 314, in hasRemainingQuota
self.manager.estimatedNodepoolQuotaUsed())
File "nodepool/driver/openstack/provider.py", line 164, in estimatedNodepoolQuotaUsed
if node.type[0] not in provider_pool.labels:
IndexError: list index out of range
Change-Id: I67b269069dddb8349959802d7b1ee049a826d0c5
Co-authored-by: Tobias Henkel <tobias.henkel@bmw.de>
The port filter for DOWN port seems to have no effect. It actually
deleted *all* ports in the tenant.
This reverts commit cdd60504ec.
Change-Id: I48c1430bb768903af467cace1a720e45ecc8e98f
Cleanup will be periodic (every 3 minutes by default, not yet
configurable) and will be logged and reported via statsd.
Change-Id: I81b57d6f6142e64dd0ebf31531ca6489d6c46583
There was a possibility for an uncaught exception when launching a
node: if the launch fails and the subsequent node cleaning fails
as well, nodepool would break out of the loop prematurely.
Also add request information when logging launch failure to make
debugging easier.
Change-Id: I4a08ebe7b0ce763b6ac4bcd314403b869e487ac1
When getting error 400 we not only need to clear the image and flavor
cache but the az cache as well. Otherwise we will get constantly node
failures for any node request where nodepool chose that az
[1]. Currently the only way to recover from this situation is to
restart nodepool. Invalidating the cache doesn't fix the request that
failed due to this error but at least ensures that nodepool will
recover from this situation automatically for all further node
requests.
[1] Trace:
018-07-05 09:09:08,477 ERROR nodepool.NodeLauncher-0000123378: Launch attempt 2/3 failed for node 0000123378:
Traceback (most recent call last):
File "/opt/nodepool-source/nodepool/driver/openstack/handler.py", line 221, in launch
self._launchNode()
File "/opt/nodepool-source/nodepool/driver/openstack/handler.py", line 134, in _launchNode
volume_size=self.label.volume_size)
File "/opt/nodepool-source/nodepool/driver/openstack/provider.py", line 378, in createServer
return self._client.create_server(wait=False, **create_args)
File "<decorator-gen-106>", line 2, in create_server
File "/usr/lib/python3.5/site-packages/shade/_utils.py", line 410, in func_wrapper
return func(*args, **kwargs)
File "/usr/lib/python3.5/site-packages/shade/openstackcloud.py", line 6909, in create_server
endpoint, json=server_json)
File "/usr/lib/python3.5/site-packages/keystoneauth1/adapter.py", line 334, in post
return self.request(url, 'POST', **kwargs)
File "/usr/lib/python3.5/site-packages/shade/_adapter.py", line 158, in request
return self._munch_response(response, error_message=error_message)
File "/usr/lib/python3.5/site-packages/shade/_adapter.py", line 114, in _munch_response
exc.raise_from_response(response, error_message=error_message)
File "/usr/lib/python3.5/site-packages/shade/exc.py", line 171, in raise_from_response
raise OpenStackCloudBadRequest(msg, response=response)
shade.exc.OpenStackCloudBadRequest: (400) Client Error for url: (...) The requested availability zone is not available
Change-Id: I5f653f159b08cf086d20c2398a9345bd4caa4d1e
os-client-config is now just a wrapper around openstacksdk. The shade
code has been imported into openstacksdk. To reduce complexity, just use
openstacksdk directly.
openstacksdk's TaskManager has had to grow some features to deal with
SwiftService. Making nodepool's TaskManager a subclass of openstacksdk's
TaskManager ensures that we get the thread pool set up properly.
Change-Id: I3a01eb18ae31cc3b61509984f3817378db832b47
This adds a test for the image-id cloud image attribute (which
should cause shade to perform a direct lookup on the image by
UUID (bypassing the typical name-or-id logic).
Change-Id: I8b2b99aac012e0f1adda326dfe76d9f90f606cb1
Co-Authored-By: James E. Blair <jeblair@redhat.com>
In some installations it might be unreal to rely on the default security
group (security concerns). In order to also enable possibility to share
one tenant between zuul and other resources a support for specifying
security_groups on the driver.openstack.pool level is added.
Change-Id: I63240049cba295e15f7cfe75b7e7a7d53aa4e37d
Rather than relying on inspection and loading classes from specific
filenames, ask each driver's Provider implementation to supply the
NodeRequestHandler instance for requests.
This is the first in a series of changes to remove all similar
indirection from the drivers in order to make them easier to
understand.
The test driver is moved out of the fixtures subdir to be a
"regular" driver because it now needs relative imports (provider
code needs to import the handler code).
Change-Id: Iad60d8bfe102d43b9fb1a1ff3ede13e4fa753572
This change adds a plugin interface so that driver can be loaded dynamically.
Instead of importing each driver in the launcher, provider_manager and config,
the Drivers class discovers and loads driver from the driver directory.
This change also adds a reset() method to the driver Config interface to
reset the os_client_config reference when reloading the OpenStack driver.
Change-Id: Ia347aa2501de0e05b2a7dd014c4daf1b0a4e0fb5
This is a partial revert of ae65d94e34
which did two things: made over-quota errors not fatal, and invalidated
the quota cache.
The quota cache invalidation had a typo which caused it to fail. This
error actually meant that we retained the old behavior of marking
a launch attempt as failed if we encountered an over-quota error.
This change adds a test to exercise this and fixes the typo.
In writing the test, it was observed that a launch which encounteres
an over-quota situation will continue to retry indefinitely, every
5 seconds, and will emit an exception into the log each time. This is
probably not the best experience for an operator. Further, because
there is no discernable state change within nodepool, it is very
difficult to test this situation.
Due to these issues, we should discuss a more robust way to handle
unexpected over-quota errors. In the mean time, reverting to the old
(and indeed, still current due to the typo) behavior of failing on
over-quota error, seems the safest way to proceed.
Also, change time.monotonic to time.time. The reference point for
monotonic is undefined, and can be 0, or other values less than 300,
which causes the cache timeout check to fail.
Change-Id: Id488c070df991a554570c5717dc85aec351fed45
This is an approach of calculating the cloud quotas before launching
nodes. To support co-existence with other services the limits of the
cloud are gathered and all unknown resources subtracted from the max
values. Thus the remaining quota for being used by nodepool can be
calculated and checked before launching nodes.
Change-Id: I18594ab76949faddade3164b6027dd3b82771c95
When we deploy nodepool and zuul instances in virtual machine of
cloud provider, the provisioned nodes may be in the same internal
network with nodepool and zuul instances, in that case we don't
have to allocate floating ip for nodes, zuul can talk with nodes
via fixed ip of virtual machines. So if we can customize the behavior,
save the quota of floating ip, it's awesome.
Note: Although option "floating_ip_source: None" in clouds.yaml can
decide to apply floating ip or not for specified cloud, but that impact
all the SDKs and tools that use the clouds.yaml, we should control
nodepool behavior flexibly and independently.
This patch add a bool option "auto-floating-ip" into each pool of
"provider" section in nodepool.conf
Change-Id: Ia9a1bed6dd4f6e39015bde660f52e4cd6addb26e
This change is a follow-up to the drivers spec and it makes the fake provider
a real driver. The fakeprovider module is merged into the fake provider and
the get_one_cloud config loader is simplified.
Change-Id: I3f8ae12ea888e7c2a13f246ea5f85d4a809e8c8d
This change moves OpenStack related code to a driver. To avoid circular
import, this change also moves the StatsReporter to the stats module so that
the handlers doesn't have to import the launcher.
Change-Id: I319ce8780aa7e81b079c3f31d546b89eca6cf5f4
Story: 2001044
Task: 4614