Support for concurrency and rate limiting has been added to keystoneauth,
which is the library openstacksdk uses to talk to OpenStack. Instead
of managing concurrency in nodepool using the TaskManager and pool of
worker threads, let keystoneauth take over. This also means we no longer
have a hook into the request process, so we defer statsd reporting to
the openstacksdk layer as well.
Change-Id: If21a10c56f43a121d30aa802f2c89d31df97f121
Task manager stats sending was broken with openstacksdk updates and
fixed in I8617ab2895d1544a6902ae5a3d6a97b87bfd2ec9.
What used to happen is that openstacksdk run_task() would call
post_run_task(); however with changes to use a threadpool this became
an async call so was not reflecting the post-run state but just the
post-insert-into-threadpool-queue state.
The referenced change moved the post_run_task() call into
submit_task(), where it is called after the task has been wait()ed
for (and actually run).
Unfortunatley, since this file overrides submit_task, it means that
post_run_task is not being called any more and we are not producing
timing stats.
I believe we can use the openstacksdk task manager's submit_task
directly. There are two differences; the "raw" argument was removed
from upstream with I7b46e263a76d84573bdfbbece57b1048764ed939 and is no
longer necessary. The ManagerStoppedException is private to this file
and doesn't appear to have an external dependencies; thus the
openstacksdk's TaskManagerStopped exception will work just the same.
Change-Id: I427e6ae9e4beae6d551427fc12a3cde2c1d03aba
We currently have keys like "ComputePostOs-volumes_boot" for providers
using boot-from-volume and other various "os-" keys depending on the
provider. Normalise all these to regular CamelCase. A basic
test-case is added.
Additionally add some documentation on the API call stats, pointing
out they reflect internal details so are subject to change. A release
note is added for the updated stats.
Change-Id: If8398906a5a7ad3d96e985263b0c841e5dcaf7b5
openstacksdk produces task names that look like compute.DELETE.servers,
which are {service_type}.{METHOD}.{url}.{parts} but shade and nodepool
have been reporting to statsd with {Service_type}{Method}{Url}{Parts}.
Translate the task.name produced by sdk into the key nodepool expects to
report as.
Also, emit 'Manager ran task' log lines
The super class emits these, but on the openstacksdk logger. We don't
really want to turn that on at debug level for normal usage, so rather
than calling super(post_run_task) just straight-up override it.
Make sure we log using the transformed name in all cases.
Change-Id: I7f21aefc204366f7621643fdba76b7e70ce4caf5
os-client-config is now just a wrapper around openstacksdk. The shade
code has been imported into openstacksdk. To reduce complexity, just use
openstacksdk directly.
openstacksdk's TaskManager has had to grow some features to deal with
SwiftService. Making nodepool's TaskManager a subclass of openstacksdk's
TaskManager ensures that we get the thread pool set up properly.
Change-Id: I3a01eb18ae31cc3b61509984f3817378db832b47
This is not use in the nodepool codebase anymore. The tasks submitted to
the TaskManager all exist inside of shade/sdk now.
Change-Id: Ie914be9f8687f4634996fb880e5c383dab7de6f0
As a small stepping stone towards consolidating the
shade/sdk/nodepool TaskManager implementations, change TaskManager from
being a thread to having a thread. This should allow us to make
nodepool's TaskManager inherit from sdk's TaskManager without losing our
minds trying to understand how multi-inheritance works with Thread as
one of the base classes.
Change-Id: Ie89eb9e68e5e3f0a7a55cdd555c683631f305b8e
nodepool is python3 only now, so there is no need to continue to use the
six library. Remove its use and shift to just using python3 versions of
things.
Change-Id: Ic1d1b4b736deba1b8325adf0b446e2121eaa6b20
The pep8 rules used in nodepool are somewhat broken. In preparation to
use the pep8 ruleset from zuul we need to fix the findings upfront.
Change-Id: I9fb2a80db7671c590cdb8effbd1a1102aaa3aff8
This currently is invalid syntax in python3. Use six so we can support
both python2 and python3.
Change-Id: I7eb664908bdc14551ffac3bb3665cbc5cb84bd84
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
The syntax for imports has changed for python3, lets use the new
syntax.
Change-Id: Ia985424bf23b44e492f51182179d2e476cdcccbb
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
ProviderManager is a TaskManager, and TaskManagers are intended
to serialize API requests to a single cloud from multiple threads.
Currently each worker in the builder has its own set of
ProviderManagers. That means that we are performing cloud API calls
in parallel. That's probably okay since we perform very few of them,
mostly image uploads and deletes. And in fact, we probably want
to avoid blocking on image uploads.
However, there is a thread associated with each of these
ProviderManagers, and even though they are idle, in aggregate they
add up to a significant CPU cost.
This makes the use of a TaskManager by a ProviderManager optional
and sets the builder not to use it in order to avoid spawning these
useless threads.
Change-Id: Iaf6498c34a38c384b85d3ab568c43dab0bcdd3d5
Some history to motivate the change : the original statsd 2.0 era
worked by just doing "import statsd" and it setup a global "statsd"
object based on env variables for you, or set it to None if they
didn't exist.
With statsd 3.0 this changed semantics slightly to provide default
values, so to maintain the status-quo we added the "stats" wrapper
(I4791c9d26f2309f78a556de42af5b9945005aa46) which did the same thing.
However, having a global object set at import time like this isn't the
best idea. For example, when running unit tests, you may want to set
the statsd host to your fake logger, but if it is already setup at
import time your unit test can't override it when it calls the various
classes. It creates other ordering problems as we are splitting up
nodepool into more components.
Thus we move to creating a separate client in each object as it is
instantiated. To maintain the existing behaviour of returning "None"
if the env variables aren't set we keep it in stats.py behind a new
function. All stats callers are modified to get a client in their
__init__()
See also: Ib84655378bdb7c7c3c66bf6187b462b3be2f908d -- similar changes
for zuul
Change-Id: I6d339a8c631f8508a60e9ef890173780157adefd
statsd >= 3.0 changed the way it initializes itself; to start-up from
environment variables you need to import from 'statsd.defaults.env'.
It is also slightly different in that it provides default values; so
we check if the environment variable is set and avoid importing it if
statsd isn't configured.
This moves the statsd object creation into a common module so it can
be shared rather than create multiple clients.
Documentation is also updated to describe how to configure this
Change-Id: I4791c9d26f2309f78a556de42af5b9945005aa46
In situations where the connection between nodepool and the cloud
it manages may be through an evil proxy, the fact that the
requests library in python-novaclient keeps port connections open
means that the evil proxy might timeout the connection during the
period of time that the image is being built.
If the Task object encounters a ProxyError it needs to be propagated
up in the calling thread to the ProviderManager rather than caught
and handled in TaskManager in the main thread. Once caught in the
ProviderManager, create a new client object and retry once. If a second
ProxyError is raised, then it will be passed to the main thread the
same as any other Exception.
Change-Id: I3d69de48be1993c04356d209f622b0a17e6e8d03
Fix for this: Id5186a288db3d03efd01d9328d69061dd746e9cb
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
self.run()
File "/usr/local/lib/python2.7/dist-packages/nodepool/task_manager.py", line 101, in run
statsd.timing(key, dt)
AttributeError: 'NoneType' object has no attribute 'timing'
Change-Id: I23970d20d91b7a95fb1f3b4a8dd8f99dc6df1897
When we stop a provider, it does not accept any new tasks. In that
case, we should raise a specific exception so that loops that retry
for a very long time can exit early since the "error" will never
recover.
Change-Id: I7387849ffdb96c487bae4c31ca7cbdd07b8afd8a
If a task manager was stopped with tasks in queue, they would
not be processed. This could result in deadlocked threads if
one of the long-running threads was waiting on a task while
the main thread shut down a manager.
Change-Id: I3f58c599d472d134984e63b41e9d493be9e9d70b
In case a provider manager is being replaced due to a configuration
change, raise exceptions for any tasks submitted after it is stopped.
This will probably cause many errors, but they should already be
handled and servers will eventually be deleted.
We can minimize the disruption by making the provider managers more
adaptable to changes, but this stopgap measure should at least fix
the current problem we are observing with threads that get stuck
and never complete.
Change-Id: I3d190881ede30480d7c4ae970a0cb2dd07c3e160
Rate limiting is necessary for some clouds, but it can be helpful to
be able to see when it is kicking in. We expect it to kick in when
making API requests from public cloud providers with ratelimits (all
our current ones).
Change-Id: Ieca7b49f1719ac4b4a326ae6d3682f1bca08ad54
The log must been print "nodepool.TaskManager" rather than
"nodepool.ProviderManager".
Change-Id: I71382f9cfad929b8a011310fe9a30b518100a5ff
Closes-bug: #1254734
Based on the one in Zuul, but output thread names. Also, update
Thread constructors to provide useful names.
Change-Id: Ic710a729a1986b7cf67cd7f7812a396a020c4936