This updates the OpenStack driver to use the statemachine framework.
The goal is to revise all remaining drivers to use the statemachine
framework for two reasons:
1) We can dramatically reduce the number of threads in Nodepool which
is our biggest scaling bottleneck. The OpenStack driver already
includes some work in that direction, but in a way that is unique
to it and not easily shared by other drivers. The statemachine
framework is an extension of that idea implemented so that every driver
can use it. This change further reduces the number of threads needed
even for the openstack driver.
2) By unifying all the drivers with a simple interface, we can prepare
to move them into Zuul.
There are a few updates to the statemachine framework to accomodate some
features that only the OpenStack driver used to date.
A number of tests need slight alteration since the openstack driver is
the basis of the "fake" driver used for tests.
Change-Id: Ie59a4e9f09990622b192ad840d9c948db717cce2
This allows users to create tags (or properties in the case of OpenStack)
on instances using string interpolation values. The use case is to be
able to add information about the tenant* which requested the instance
to cloud-provider tags.
* Note that ultimately Nodepool may not end up using a given node for
the request which originally prompted its creation, so care should be
taken when using information like this. The documentation notes that.
This feature uses a new configuration attribute on the provider-label
rather than the existing "tags" or "instance-properties" because existing
values may not be safe for use as Python format strings (e.g., an
existing value might be a JSON blob). This could be solved with YAML
tags (like !unsafe) but the most sensible default for that would be to
assume format strings and use a YAML tag to disable formatting, which
doesn't help with our backwards-compatibility problem. Additionally,
Nodepool configuration does not use YAML anchors (yet), so this would
be a significant change that might affect people's use of external tools
on the config file.
Testing this was beyond the ability of the AWS test framework as written,
so some redesign for how we handle patching boto-related methods is
included. The new approach is simpler, more readable, and flexible
in that it can better accomodate future changes.
Change-Id: I5f1befa6e2f2625431523d8d94685f79426b6ae5
We have made many improvements to connection handling in Zuul.
Bring those back to Nodepool by copying over the zuul/zk directory
which has our base ZK connection classes.
This will enable us to bring other Zuul classes over, such as the
component registry.
The existing connection-related code is removed and the remaining
model-style code is moved to nodepool.zk.zookeeper. Almost every
file imported the model as nodepool.zk, so import adjustments are
made to compensate while keeping the code more or less as-is.
Change-Id: I9f793d7bbad573cb881dfcfdf11e3013e0f8e4a3
Change I670d4bfd7d35ce1109b3ee9b3342fb45ee283a79 added the
quota['compute'] property to ZK nodes for Zuul to keep track of.
Instead of accessing the property directly, this adds a
get_resources() function to QuotaInformation which returns it, and
makes a point to note that the property is used so that we don't
modify it in incompatible ways in future.
This is a no-op refactor, but helpful in a follow-on change that also
adds this field to leaked nodes.
Change-Id: Id78b059cf2121e01e4cd444f6ad3834373cf7fb6
Ansible needs to know which shell type the node uses to operate
correctly, especially for ssh connections for windows nodes because
otherwise ansible defaults to trying bash.
Change-Id: I71abfefa57aaafd88f199be19ee7caa64efda538
When the ports quota is accidentally lower than the instance quota in
an openstack provider we can hit errors about number of ports exceeded
during instance creation [1]. This should be handled in nodepool like
a quota exceeded situation in order to not fail node requests.
[1] Trace:
ERROR nodepool.NodeLauncher: Launch failed for node None:
Traceback (most recent call last):
File "/opt/nodepool/lib/python3.8/site-packages/nodepool/driver/utils.py", line 73, in run
self.launch()
File "/opt/nodepool/lib/python3.8/site-packages/nodepool/driver/openstack/handler.py", line 246, in launch
self._launchNode()
File "/opt/nodepool/lib/python3.8/site-packages/nodepool/driver/openstack/handler.py", line 125, in _launchNode
server = self.handler.manager.createServer(
File "/opt/nodepool/lib/python3.8/site-packages/nodepool/driver/openstack/provider.py", line 281, in createServer
return self._client.create_server(wait=False, **create_args)
File "<decorator-gen-4>", line 2, in create_server
File "/opt/nodepool/lib/python3.8/site-packages/openstack/cloud/_utils.py", line 370, in func_wrapper
return func(*args, **kwargs)
File "/opt/nodepool/lib/python3.8/site-packages/openstack/cloud/_compute.py", line 932, in create_server
data = proxy._json_response(
File "/opt/nodepool/lib/python3.8/site-packages/openstack/proxy.py", line 646, in _json_response
exceptions.raise_from_response(response, error_message=error_message)
File "/opt/nodepool/lib/python3.8/site-packages/openstack/exceptions.py", line 233, in raise_from_response
raise cls(
openstack.exceptions.HttpException: HttpException: 403: Client Error for url: (...), Maximum number of ports exceeded
Change-Id: I31576e53e013666a5e3de6124262beda4da8b1be
The isAlive() spelling is deprecated since python3.8,
and was remove in python3.9. The is_alive() spelling
of the method exists already since at least python2.7
Change-Id: I5d38aad85365dbc24451f032aac79e1b0de05a09
The quota handling for the simple driver (used by GCE) only handles
max-servers, but even so, it still didn't take into consideration
currently building servers. If a number of simultaneous requests
were received, it would try to build them all and eventually return
node failures for the ones that the cloud refused to build.
The OpenStack driver has a lot of nice quota handling methods which
do take currently building nodes into account. This change moves
some of those methods into a new Provider mixin class for quota
support. This class implements some handy methods which perform
the calculations and provides some abstract methods which providers
will need to implement in order to supply information.
The simple driver is updated to use this system, though it still
only supports max-servers for the moment.
Change-Id: I0ce742452914301552f4af5e92a3e36304a7e291
We need to set pool info on leaked instances so that they are properly
accounted for against quota. When a znode has provider details but not
pool details we can't count it against quota used but also don't count
it as unmanaged quota so end up in limbo.
To fix this we set pool info metadata so that when an instance leaks we
can recover the pool info and set it on the phony instance znode records
used to delete those instances.
Change-Id: Iba51655f7bf86987f9f88bb45059464f9f211ee9
This only happens if the user has enabled console-log on the label,
so log it at info level so it appears in the standard logs.
Change-Id: I74dbc82bbbd310ba788250d864869681603babd2
A previous change added annotated logging to the base NodeLauncher
class. The node ID and request ID no longer need to be explicitly
added to log messages.
Change-Id: I276813b17167fe2dc7fff5581004939ca0e273d0
All implementations of the NodeLauncher have access to the node request
handler instance, but it was not passed through to the base class.
In order to initialize the event log adapter, the NodeLauncher will now
receive then node request handler instead of the Zookeeper connection.
The Zookeeper connection is directly extracted from the request handler.
Change-Id: I8ee7335f50b64435c82a5f039989f98750874466
When creating instances with boot from volume we don't get quota
related information in the exception raised by wait_for_server. Also
in the server munch that is returned the fault information is
missing. This causes node failures when we run into the volume
quota. This can be fixed by explicitly fetching the server if we got
one and inspecting the fault information which contains more
information about the fault reason [1].
[1] Example fault reason:
Build of instance 4628f079-26a9-4a1d-aaa0-881ba4c7b9cb aborted:
VolumeSizeExceedsAvailableQuota: Requested volume or snapshot exceeds
allowed gigabytes quota. Requested 500G, quota is 10240G and 10050G
has been consumed.
Change-Id: I6d832d4dbe348646cd4fb49ee7cb5f6a6ad343cf
This adds the ability for a pool.label to override the
pool.host-key-checking value, while having a label exist in the same
pool. This is helpful because it is possible for 1 pool to mix network
configuration, and some nodes maybe missing a default gateway (making
them unroutable by default).
Change-Id: I934d42b8e48aedb0ebc03137b7305eb2af390fc7
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
This change adds a new python_path Node attribute so that zuul executor
can remove the default hard-coded ansible_python_interpreter.
Change-Id: Iddf2cc6b2df579636ec39b091edcfe85a4a4ed10
We have a use case where we have a single pool, due to quota reasons,
but need the ability to selectively choose which network a label will
use. Now a nodepool operator will be able to define which networks are
attached to labels (in our case network appliances).
Change-Id: I3bfa32473c76b9fd59deee7d05b492e7cf67f69d
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
The network_cli conneciton in ansible, is built a top of paramiko,
which then uses SSH at the transport. This means hosts using this
connection should also have their host keys collected.
Change-Id: I0383227374bdc1a9b0cecc255eabcb8a8d097b09
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
The TaskManagerStopped exception will never be thrown anymore
by openstacksdk, since there is no TaskManager to be stopped.
Stop accounting for it in the code.
Change-Id: Idbdffa9ce945e41044c8eaf01cc582fca2114e6e
When we lose a task manager, we won't be able to create an instances.
Rather then continue to look until retries limit is reached, we raise an
exception early.
In the case of below, the retry limit is very high and results in logs
being spammed with the following:
2019-02-12 16:41:15,628 ERROR nodepool.NodeLauncher-0001616109: Request 200-0000443406: Launch attempt 39047/999999999 failed for node 0001616109:
Traceback (most recent call last):
File "/opt/rh/rh-python35/root/usr/lib/python3.5/site-packages/nodepool/driver/openstack/handler.py", line 241, in launch
self._launchNode()
File "/opt/rh/rh-python35/root/usr/lib/python3.5/site-packages/nodepool/driver/openstack/handler.py", line 142, in _launchNode
instance_properties=self.label.instance_properties)
File "/opt/rh/rh-python35/root/usr/lib/python3.5/site-packages/nodepool/driver/openstack/provider.py", line 340, in createServer
return self._client.create_server(wait=False, **create_args)
File "<decorator-gen-32>", line 2, in create_server
File "/opt/rh/rh-python35/root/usr/lib/python3.5/site-packages/openstack/cloud/_utils.py", line 377, in func_wrapper
return func(*args, **kwargs)
File "/opt/rh/rh-python35/root/usr/lib/python3.5/site-packages/openstack/cloud/openstackcloud.py", line 7020, in create_server
self.compute.post(endpoint, json=server_json))
File "/opt/rh/rh-python35/root/usr/lib/python3.5/site-packages/keystoneauth1/adapter.py", line 357, in post
return self.request(url, 'POST', **kwargs)
File "/opt/rh/rh-python35/root/usr/lib/python3.5/site-packages/openstack/_adapter.py", line 154, in request
**kwargs)
File "/opt/rh/rh-python35/root/usr/lib/python3.5/site-packages/openstack/task_manager.py", line 219, in submit_function
return self.submit_task(task)
File "/opt/rh/rh-python35/root/usr/lib/python3.5/site-packages/openstack/task_manager.py", line 185, in submit_task
name=self.name))
openstack.exceptions.TaskManagerStopped: TaskManager rdo-cloud-tripleo is no longer running
Change-Id: I5f907d19ec1e637defe90eb944f4e5bd759e8a74
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
When we create a server using the openstack provider, in certain
conditions it is helpful to collect the host_id value. As an example,
Zuul will pass through this information into a job inventory file which
will allow an operator to better profile where jobs run with in a cloud.
This can be helpful trying to debug random jobs timing out within a
cloud.
Change-Id: If29397e67a470462561f24c746375b8291ac43ab
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
When a server creation fails but has an external id we create a new
znode to offload the deletion of that node. This currently misses the
node type which will trigger an exception during node launch [1]. This
wedges the provider until the node deleter kicked in and deleted that
node successfully. Fix this by storing the node type in this znode.
[1] Exception
Traceback (most recent call last):
File "nodepool/driver/__init__.py", line 639, in run
self._runHandler()
File "nodepool/driver/__init__.py", line 563, in _runHandler
self._waitForNodeSet()
File "nodepool/driver/__init__.py", line 463, in _waitForNodeSet
if not self.hasRemainingQuota(ntype):
File "nodepool/driver/openstack/handler.py", line 314, in hasRemainingQuota
self.manager.estimatedNodepoolQuotaUsed())
File "nodepool/driver/openstack/provider.py", line 164, in estimatedNodepoolQuotaUsed
if node.type[0] not in provider_pool.labels:
IndexError: list index out of range
Change-Id: I67b269069dddb8349959802d7b1ee049a826d0c5
Co-authored-by: Tobias Henkel <tobias.henkel@bmw.de>
If we get an error on create server, we currently leak the instance
because we don't store the external id of the instance in ZK. It
should eventually be deleted since it's a leaked instance, but
we try to keep track of as much as possible. OpenStackSDK can
often return the external id to us in these cases, so handle that
case and store the external id on a ZK record so that the instance
is correctly accounted for.
Change-Id: I7ec448e9a7cf6cd01903bf7b5bf4b07a1c143fb8
We currently don't really have statistics about the resource
utilization of projects or tenants in zuul. This can be solved by
adding resource metadata to the node data structure that can be used
by zuul (which has the knowledge about projects and tenants) to report
resource usage statistics.
The new introduced optional resources field is a plain dict and could
be extended in the future with further resource information like ports
or volume size that might be useful in the future.
This change is fully backwards compatible and doesn't need a total
shutdown of nodepool to upgrade.
Change-Id: I670d4bfd7d35ce1109b3ee9b3342fb45ee283a79
This allows us to set parameters for server boot on various images.
This is the equivalent of the "--property" flag when using "openstack
server create". Various tools on the booted servers can then query
the config-drive metadata to get this value.
Needed-By: https://review.openstack.org/604193/
Change-Id: I99c1980f089aa2971ba728b77adfc6f4200e0b77
We are passing the zk connection to the estimatedNodepoolQuotaUsed()
method of the OpenStack provider when we already have it passed into
the provider's start() method, but not saving it there. Let's save
that connection in start() and use it instead.
In a later change to the provider, we will make further use of it.
Change-Id: I013a28d6c46046497d8b04867c51a23f6fa49d39
There was a possibility for an uncaught exception when launching a
node: if the launch fails and the subsequent node cleaning fails
as well, nodepool would break out of the loop prematurely.
Also add request information when logging launch failure to make
debugging easier.
Change-Id: I4a08ebe7b0ce763b6ac4bcd314403b869e487ac1
In some circumstances it is useful to tell the launcher to
ignore the provide quota and just trust the max-* settings
for that pool instead.
This particular need arises when using Rackspace public cloud
for both instances and OnMetal nodes. In this situation the
quota for instances and for OnMetal nodes is different, but
nodepool only queries the quota for instances. When trying to
build OnMetal nodes, the quota check fails - but should not.
In this circumstance, instead of making shade/nodepool
complicated by figuring out how to calculate disparate quota
types, it makes sense to rather just allow nodepool to ignore
the quota for a pool to try executing the build instead.
While this is our use-case, it may also be useful to others
for other reasons.
Change-Id: I232a1ab365795381ab180aceb48e8c87843ac713
The quota calculations in nodepool never can be perfectly accurate
because there still can be some races with other launchers in the
tenant or by other workloads sharing the tenant. So we must be able to
gracefully handle the case when the cloud refuses a launch because of
quota. Currently we just invalidate the quota cache and immediately
try again to launch the node. Of course that will fail again in most
cases. This makes the node request and thus the zuul job fail with
NODE_FAILURE.
Instead we need to pause the handler like it would happen because of
quota calculation. Instead we mark the node as aborted which is no
failure indicator and pause the handler so it will automatically
reschedule a new node launch as soon as the quota calculation allows
it.
Change-Id: I122268dc7723901c04a58fa6f5c1688a3fdab227
For pre-existing cloud images (not managed by nodepool), referencing
them by ID was failing since they could not be found with this data,
only by name.
Current code expects the shade get_image() call to accept a dict with
an 'id' key, which will return that same dict without any provider API
calls. This dict can then be used in createServer() to bypass looking
up the image to get the image ID. However, shade does not accept a dict
for this purpose, but an object with an 'id' attribute. This is
possibly a bug in shade to not accept a dict. But since nodepool knows
whether or not it has an ID (image-id) vs. an image name (image-name),
it can bypass shade altogether when image-id is used in the config.
Note: There is currently no image ID validation done before image
creation when an image-id value is supplied. Not even shade validated
the image ID with a passed in object. Server creation will fail with
an easily identifiable message about this, though.
Change-Id: I732026d1a305c71af53917285f4ebb2beaf3341d
Story: 2002013
Task: 19653
This change logs the current pool quota as DEBUG and adds a
INFO log when a pool doesn't have enough quota to satisfy a request.
Change-Id: I78445f21f251d206d4fb0dce817a992685c8b319
We sometimes have need to send notifications to certain driver objects.
A "notification" is a call to an object to indicate some condition has
changed, and the callee may choose to act on that, or totally ignore it.
Currently, the only notification is NodeRequestHandler.nodeReused() but
this adds another notification for the Provider objects in preparation
for static node driver changes.
Since it is entirely probable we may want more in the future, this
change organizes the notifications into proper classes for easy
identification/manipulation. It will be a naming standard end these
notification method names with "Notification".
Change-Id: I905ef42bb434435cbe3cec0f5007981a5b789769
The QuotaInformation class is a quite generic container for quota
information. This might be useful in future drivers like AWS or
Azure. Further Moving it to utils resolves a cyclic import of
openstack handler and provider.
Change-Id: Ibbece39a0da74ae7b54e3a1088057e683f158a66
In some installations it might be unreal to rely on the default security
group (security concerns). In order to also enable possibility to share
one tenant between zuul and other resources a support for specifying
security_groups on the driver.openstack.pool level is added.
Change-Id: I63240049cba295e15f7cfe75b7e7a7d53aa4e37d
Rather than relying on inspection and loading classes from specific
filenames, ask each driver's Provider implementation to supply the
NodeRequestHandler instance for requests.
This is the first in a series of changes to remove all similar
indirection from the drivers in order to make them easier to
understand.
The test driver is moved out of the fixtures subdir to be a
"regular" driver because it now needs relative imports (provider
code needs to import the handler code).
Change-Id: Iad60d8bfe102d43b9fb1a1ff3ede13e4fa753572
Support storing multiple node types/labels for a Node() in the
ZooKeeper data.
Neither the OpenStack nor the static drivers make use of this yet;
they assume there will only be one value in the type/label list.
In the request handler code, we must now store the satisfied types
(rather than rely on the Node.type in self.nodeset) since that can
contain multiple values and we can't tell which value was used to
satisfy the request. The LabelRecorder class helps do this, and to
also help with assigning nodes to a request in proper order based
on order of the labels in the request.
Change-Id: Ice46e22183c4e7354eaa66504ea147aff5ee0df3
Further decouple the driver API from the provider configuration
by simplifying the driver interface (which will eventually lead to
the driver deciding whether or not to support multiple labels) and
removing config references in NodeLauncher.
This significantly simplifies the driver API by:
* Reducing the launching of nodes to a launch() and launchesCompleted()
interface. The launchesCompleted() method acts as a periodic check on
whether the nodeset launches have all completed.
* How the drivers actually launch nodes is decoupled from the driver
API by not having the return of NodeRequestHandler.launch() be either
a thread or None. This leaves it entirely up to the driver to decide
the best way to manage launches... threads, subprocesses, messaging,
etc. E.g., thread handling is moved from driver API into OpenStack driver.
* Drivers no longer need to implement the very confusing pollLauncher()
method. This was originally meant to poll the launch threads, but
even drivers not using threads (e.g., the static driver) had to
reimplement this method. That made no sense.
The NodeLauncher class (currently used only by the OpenStack driver),
is moved to a driver utility module that other drivers may choose to
use or ignore. And it no longer directly accesses the provider config.
The test_nodelaunchmanager.py tests was originally testing the
NodeLaunchManager class (removed quite some time ago) which managed
launch threads. This was eventually moved to the driver interface.
This change further moves thread management into the OpenStack driver.
As a result, these tests are redundant since the test_launcher.py
tests cover all of this. So remove test_nodelaunchmanager.py.
Change-Id: I407d999b3608e1614c0dbe49808b2a64756dde58
Force each driver to implement this method. If it does not manage
images, it can simply return True.
Change-Id: Ic20c190bd24a5e17782d8e5a42aa21768288728f
In I6cfec650b862cb4fa0cb391bcc1248549e30c91b _logConsole was renamed,
but we didn't change all functions.
Change-Id: I5c21517a86e42b5759446d9e78fa82e10530c951
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
This change refactors the NodeLauncher object so that it is part
of the generic interface. A nodepool driver handler only need to
return a subclass implementing the launch method.
Moreover this change adapts the StatsReporter class.
Change-Id: I6cfec650b862cb4fa0cb391bcc1248549e30c91b
This change refactors the OpenStackNodeRequestHandler.run_handler method
so that it is part of the generic interface. A nodepool driver handler now
only needs to implement the launch() method.
Moreover, this change merge the NodeLaunchManager into the RequestHandler.
This change adds the following method to the handler interface:
* nodeReused(node): notify handler a node is re-used
* checkReusableNode(node): verify if a node can be re-used
* setNodeMetadata(node): store extra node metadata such as az
* pollLauncher()
Change-Id: I5b18b49f50e2b416b5e5c4e28e1a4a6a1df1c654