Commit Graph

49 Commits

Author SHA1 Message Date
James E. Blair 646b7f4927 Add some builder operational stats
This adds some stats keys that may be useful when monitoring
the operation of individual nodepool builders.

Change-Id: Iffdeccd39b3a157a997cf37062064100c17b1cb3
2024-02-19 15:47:17 -08:00
James E. Blair 07c83f555d Add ZK cache stats
To observe the performance of the ZK connection and the new tree
caches, add some statsd metrics for each of these.  This will
let us monitor queue size over time.

Also, update the assertReportedStat method to output all received
stats if the expected stat was not found (like Zuul).

Change-Id: Ia7e1e0980fdc34007f80371ee0a77d4478948518
Depends-On: https://review.opendev.org/886552
2023-08-03 10:27:25 -07:00
James E. Blair d4f2c8b9e7 Report leaked resource metrics in statemachine driver
The OpenStack driver reports some leaked metrics.  Extend that in
a generic way to all statemachine drivers.  Doing so also adds
some more metrics to the OpenStack driver.

Change-Id: I97c01b54b576f922b201b28b117d34b5ee1a597d
2023-04-26 06:40:12 -07:00
Simon Westphahl 887fea5706
Correct documentation for image upload metric
The metric for the time spent uploading an image is in milliseconds, not
seconds.

Change-Id: I151bf774ca17bef34ce2d5ac794e2187da9a9b07
2022-12-05 08:35:11 +01:00
mbecker 1658aa9851 Add hold command to disable nodes
This allows nodes to be set in an idle state
so that they will not have jobs scheduled
while e.g. maintenance tasks are performed.
This is probably most useful for static nodes.

Change-Id: Iebc6b909f370fca11fab2be0b8805d4daef33afe
2022-10-13 12:43:34 +02:00
James E. Blair f615ad922f Add nodepool.image_build_requests metric
This reports the number of outstanding manual image build requests.

Change-Id: I365516bdb1fa20a3129099a81825e8506b3af4df
2022-06-21 14:52:53 -07:00
James E. Blair 138b68a5a7 Convert dib-request-list to image-status command
This augments the dib-request list (which shows what images have
manual build requests) with information about whether the image
is paused.  The resulting command is renamed to "image-status".

Change-Id: If75a8757b4ec93563e47bfdf0a239a9c21660c45
2022-06-21 14:12:22 -07:00
Simon Westphahl d6e8bd72df Expose image build requests in web UI and cli
Image build requests can now be retrieved through the /dib-request-list
endpoint or via the dib-request-list sub-command. The list will show the
age of the request and if it is still pending or if there is already a
build in progress.

Change-Id: If73d6c9fcd5bd94318f389771248604a7f51c449
2022-06-21 13:32:35 -07:00
Benjamin Schanzel 74c5c00305 Export current tenant limit stats
This adds a new statsd gauge which, in addition to the existing provider
limits, exports the currently configured tenant limits. This is in the
form ``nodepool.tenant_limits.TENANT.[cores,ram,instances]``.

Change-Id: I8e10a0974210d25d071dbbd63849a921fc8b79a2
2022-01-07 09:27:31 +01:00
James E. Blair e95b146d10 Switch docs theme to versioned RTD
To match change I2870450ffd02f55509fcc1297d050b09deafbfb9 in Zuul.

The default domain is changed to zuul which uncovered a reference
error which is fixed.

Change-Id: I71db35252d018feed41d9e87aa702c6daa61902b
2021-12-16 11:23:30 -08:00
James E. Blair 0b1fa1d57d Add commands to export/import image data from ZK
Change-Id: Id1ac6403f4fe80059b90900c519e56bca7dee0a0
2021-08-24 10:28:39 -07:00
Ian Wienand ce00f347a4
Logs stats for nodepool automated cleanup
As a follow-on to I81b57d6f6142e64dd0ebf31531ca6489d6c46583, bring
consistency to the resource leakage cleanup statistics provided by
nodepool.

New stats for cleanup of leaked instances and floating ips are added
and documented.  For consistency, the downPorts stat is renamed to
leaked.ports.

The documenation is re-organised slightly to group common stats
together.  The nodepool.task.<provider>.<task> stat is removed because
it is covered by the section on API stats below.

Change-Id: I9773181a81db245c5d1819fc7621b5182fbe5f59
2020-04-15 14:48:36 +02:00
Tobias Henkel f7f0821e98 Add ready endpoint to webapp
When running nodepool launchers in kubernetes a common method to
update nodepool or its config is doing rolling restarts. The process
for this is start a new nodepool, wait for it to be ready and then
tear down the old instance. Currently this is not possible without
risking node_failures when there is only one instance serving a
label. The reason for this is that there is no reliable way to
determine when the new instance is fully started which could lead to a
too early tear down of the old instance. This would result in
node_failures for all in-flight nore requests that are only valid for
this provider.

Adding a /ready endpoint to the webapp can make this deterministic
using readiness checks of kubernetes.

Change-Id: I53e77f3d8aaa4742ce2a89c1179e8563f850270e
2019-12-21 10:06:55 +00:00
Zuul ba8dd4a354 Merge "Update dib stats" 2019-04-04 18:14:55 +00:00
Monty Taylor 34aae137fa Remove TaskManager and just use keystoneauth
Support for concurrency and rate limiting has been added to keystoneauth,
which is the library openstacksdk uses to talk to OpenStack. Instead
of managing concurrency in nodepool using the TaskManager and pool of
worker threads, let keystoneauth take over. This also means we no longer
have a hook into the request process, so we defer statsd reporting to
the openstacksdk layer as well.

Change-Id: If21a10c56f43a121d30aa802f2c89d31df97f121
2019-04-02 09:36:13 +00:00
David Shrewsbury 6c2c1d3aac Update docs for provider removal.
The docs for provider removal were a tad inaccurate and/or misleading.
This hopefully clarifies the procedure.

Change-Id: I8d4d88c45dc3cea3465e5bf508d83fd940e5fdec
2019-03-21 13:21:06 -04:00
Ian Wienand 6fa73eac26 Update dib stats
This updates dib stats after creating a dashboard to use them.

Firstly, the individual return codes and runtime for each image type
are unnecessary, because they call come from the same invocation of
dib.  While it is definitely useful to track the size of each output
image, the overall status for a build is only a single value.  This
moves these duplciated values to ".status.<rc|duration>".

Unfortunately, there's really no way to say "what was the time of the
last non-null value" in grafana+graphite [1].  This means you can't do
something useful like show a singlestat of the relative time of the
last build "X hours ago" using the timer value.  We can work around
this by putting the timestamp of the last build in a gauge value; this
monotonically increases and is easy to turn into a relative time.

[1] https://github.com/grafana/grafana/issues/10550

Change-Id: Ia9518b6faecb30d45e0509bda4a9b2ab7fdc6261
2019-02-22 13:26:05 +11:00
Ian Wienand c68dbb9636 Use a pipeline for dib stats
I noticed in OpenStack production we don't seem to be getting all the
stats from dib, particularly from our very remote builder.  This is
likely because there is some packet loss quickly blasting out small
UDP packets with the stats.  A pipeline bundles the stats together
into the largest size packets it can (this has been a problem before;
see I3f68450c7164d1cf0f1f57f9a31e5dca2f72bc43).

Add some additional checks for the size stats which did not seem to be
covered by existing testing.

I also noticed that the documentation had an extra ".builder." in the
key which isn't actually there in the stats output.

Change-Id: Ib744f19385906d1e72231958d11c98f15b72d6bd
2019-02-22 11:36:27 +11:00
Ian Wienand 0cf8144e8c
Revert "Revert "Cleanup down ports""
This reverts commit 7e1b8a7261.

openstacksdk >=0.19.0 fixes the filtering problems leading to all
ports being deleted. However openstacksdk <0.21.0 has problems with
dogpile.cache so use 0.21.0 as a minimum.

Change-Id: Id642d074cbb645ced5342dda4a1c89987c91a8fc
2019-01-18 15:03:55 +01:00
Tobias Henkel 7e1b8a7261
Revert "Cleanup down ports"
The port filter for DOWN port seems to have no effect. It actually
deleted *all* ports in the tenant.

This reverts commit cdd60504ec.

Change-Id: I48c1430bb768903af467cace1a720e45ecc8e98f
2018-10-30 13:13:43 +01:00
Zuul 5c87fdb046 Merge "Cleanup down ports" 2018-10-30 01:12:31 +00:00
David Shrewsbury cdd60504ec Cleanup down ports
Cleanup will be periodic (every 3 minutes by default, not yet
configurable) and will be logged and reported via statsd.

Change-Id: I81b57d6f6142e64dd0ebf31531ca6489d6c46583
2018-10-29 13:36:43 -04:00
Zuul 059606b9f3 Merge "Normalise more of the API stats calls" 2018-10-24 04:57:31 +00:00
Ian Wienand 8d54917488 Use zuul-sphinx for configuration layout
This moves the configuration documentation to a hierarchical layout
using the attr directives provided by zuul-sphinx.

Apart from making it look like the zuul documentation, this brings
consistency to things like required flags, default values and typing
info.

There are no content changes but things have moved around somewhat to
accommodate the layout.

Depends-On: https://review.openstack.org/604267

Change-Id: I831dfd8c9458a1f255aa05fa96cfc5c416ed3310
2018-10-17 07:24:40 +11:00
Ian Wienand f18e2e8c76 Normalise more of the API stats calls
We currently have keys like "ComputePostOs-volumes_boot" for providers
using boot-from-volume and other various "os-" keys depending on the
provider.  Normalise all these to regular CamelCase.  A basic
test-case is added.

Additionally add some documentation on the API call stats, pointing
out they reflect internal details so are subject to change.  A release
note is added for the updated stats.

Change-Id: If8398906a5a7ad3d96e985263b0c841e5dcaf7b5
2018-09-28 18:49:30 +10:00
Markus Hosch 185f59d97d Add metric for image build result
Add a metric that shows on a per-image basis whether an image build was
successful or not.

Change-Id: I8e97017dd3f91cebef3791168371b29899b83389
2018-09-05 09:39:47 +02:00
Markus Hosch b3ae6e4791 Add list of metrics provided to statsd
This change provides a list of currently available metrics that are
reported by the laucher and the builder.

Change-Id: I51bc38c746cab5374095cc80e77db4534c041119
2018-08-29 08:45:30 +00:00
Ian Wienand 4074b0f0f9 Add label-list webapp endpoint
This is useful for getting the list of all available labels.
Originally implemented in Iafff02d546abb34affa88310f6a97918166cbf47,
this is based on the new info available from
Icfb73fbe3b67321235a78ea7ed9bf4319567eb1a

Co-Authored-By: Tristan Cacqueray <tdecacqu@redhat.com>
Change-Id: I4b43ac0e2ba44516ff289e93dbf553033fc9e130
Depends-On: https://review.openstack.org/548376
2018-03-01 11:14:03 +11:00
Ian Wienand 2dcd79b987 webapp: use content detection for return
Rather than having end-points with ".json", check the accept-header
and return the correct thing based on that.

Change-Id: Ia0e4cb90cdaa113bb1bf7b4636bc10293811f0f6
2018-03-01 11:14:03 +11:00
Ian Wienand 89790013f3 Consolidate node_list, add generic filter
node_list takes an argument "detail" which adds a rather arbitrary
list of results to the output.  This comes from the command-line,
where we're trying to keep width under a certain length; but doesn't
make as much sense here (especially for json).

For dashboard type applications, replace this with a simple "fields"
parameter which, if set, will only return those fields it sees in the
common text output function.

Note, this purposely doesn't apply to the JSON output, as it expected
client-side filtering is more appropriate there.  We could also add
generic field support to the command-line tools, if considered
worthwhile.

Add some documentation on all the end-points, and add info about these
parameters.

Change-Id: Ifbf1019b77368124961e7aa28dae403cabe50de1
2018-03-01 11:14:03 +11:00
David Shrewsbury 2e0e655cd0 Remove the hold command
This makes no sense in the zuulv3 world.

Change-Id: Id939ca174b490482007c32611ef8bbba9db4c7ca
2018-02-01 11:20:01 -05:00
David Shrewsbury aaecb282ee Split out erasing from 'info' command into 'erase'
This removes the --erase option from the 'info' command into
its own command.

Change-Id: I7f75e8efd49644f272102ff65c48d22a878334fe
2018-01-24 16:35:35 -05:00
David Shrewsbury 742b0b1d6b Add provider info command
This command will display all ZooKeeper data for a given provider,
and provide and option to remove all of the data from ZooKeeper.
This can be useful when an operator must permanently remove a
pre-existing provider from nodepool and cannot cleanly shutdown
the services otherwise.

Example:

   nodepool info rax
   nodepool info --erase rax

Change-Id: I527aae5ff89aac864f984af050abb83e7bc3ac04
2018-01-19 11:22:30 -05:00
David Shrewsbury a466e560da Remove alien_list command
This command has lost its usefulness in v3. Leaked instances
are automatically cleaned up by the CleanupWorker thread.

Change-Id: I99dced6c655fe865012d0d54f39bfc16b789d1a2
2017-12-04 07:56:58 -05:00
Clark Boylan f26f502fbf Start adding operational docs to zuulv3
This tries to capture common operation tasks in the documentation. It
also clears up some related items about what is necessary to have a
functioning Nodepool installation and what the dib-image-delete command
does.

Story: 2000790
Change-Id: I397fc4879fa84ffc667ddda0aff9c107eee0d694
2017-04-05 11:07:13 -07:00
Paul Belanger fbe932e14f Rename nodepoold to nodepool-launcher
The day has come to rename nodepoold to nodepool-launcher.

Change-Id: Ic04e3cf2dbdaf914bf8f92d073acb972380708f1
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2017-03-29 09:28:33 -04:00
David Shrewsbury 1b52af2c24 Docs: Remove refs to removed nodepool commands
The job-* commands were removed from the nodepool client a while back.

Change-Id: I77fd6215e4fa53be5d4c0010292ea50fb6179a40
2017-03-27 16:43:50 -04:00
Monty Taylor 066942a0ac Stop json-encoding the nodepool metadata
When we first started putting nodepool metadata into the server record
in OpenStack, we json encoded the data so that we could store a dict
into a field that only takes strings. We were also going to teach the
ansible OpenStack Inventory about this so that it could read the data
out of the groups list. However, ansible was not crazy about accepting
"attempt to json decode values in the metadata" since json-encoded
values are not actually part of the interface OpenStack expects - which
means one of our goals, which is ansible inventory groups based on
nodepool information is no longer really a thing.

We could push harder on that, but we actually don't need the functionality
we're getting from the json encoding. The OpenStack Inventory has
supported comma separated lists of groups since before day one. And the
other nodepool info we're storing stores and fetches just as easily
with 4 different top level keys as it does in a json dict - and is
easier to read and deal with when just looking at server records.
Finally, nova has a 255 byte limit on size of the value that can be
stored, so we cannot grow the information in the nodepool dict
indefinitely anyway.

Migrate the data to store into nodepool_ variables and a comma separated
list for groups. Consume both forms, so that people upgrading will not
lose track of existing stock of nodes.

Finally, we don't use snapshot_id anymore - so remove it.

Change-Id: I2c06dc7c2faa19e27d1fb1d9d6df78da45ffa6dd
2017-03-10 16:24:03 -05:00
James E. Blair 83b1b06a9c Fix some doc typos
One of these words was entirely unecessary.

Change-Id: Id64fc3e5c182ad2525b7e4b3a9fb518fe2554269
2016-12-19 10:29:58 -08:00
James E. Blair bfb225c2e7 Update operation docs
Updated to reflect more current information about nodepool-builder.

Also corrected section hierarchy so everything in operation.rst
is under the Operation heading.

Change-Id: Iade9bf7a46f6bec778e69d24491dfa1652a6674c
2016-12-16 15:59:15 -08:00
David Shrewsbury 6ca60fd15d Remove image-upload command and tests
This is rendered irrelevant now.

Change-Id: Ie6a791ce327bff9a9121b01a8daeea377cb16e08
2016-11-21 11:16:52 -05:00
David Shrewsbury 86160e23fa Remove image-update based tests
The image-update command is no longer needed since the image-build
command will now schedule a DIB image build and the upload to providers
will happen automatically.

Also removed some bad descriptive docstrings from the NodePoolBuilder
class.

Change-Id: Ib8cea6681435985a42c8646558adc7fa10484e72
2016-11-21 11:16:37 -05:00
James E. Blair 6da857c0ae Add auto-hold feature
This adds a new table and series of commands to manipulate it
in which an operator may indicate that nodes which have run failed
instances of specified jobs should automatically be held.

Change-Id: I69b00fbdeed4fba086a54f051bbb51384ea26a70
2016-06-22 13:23:53 -07:00
Clark Boylan 20e9c42b69 Add documentation on removing a provider
There are a series of steps that make deleting a provider easy.
Basically set max-servers to -1, wait for nodes to go away, delete
images, remove config. Document this.

Change-Id: I7b872ef3416c02c1d75e30611c9439805bb8428b
2015-11-19 16:40:48 -08:00
Antoine Musso 0b5f839cea Document SIGINT / SIGUSR2
Similar to Zuul, Nodepool handles two signals that were left
undocumented:

SIGINT (replaced SIGUSR1) is to gracefully stop
SIGUSR2 for stack dump / profiling

Borrow documentation from Zuul:

SIGINT  https://review.openstack.org/21543 (f231fa2a7) by Clark Boylan

SIGUSR2 https://review.openstack.org/42959 (fba9b247b) by Clark Boylan
SIGUSR2 https://review.openstack.org/97708 (d0f06265e) by Antoine Musso
        for the yappi profiling

Change-Id: I2005d8ebdc6444c40dfb29f2b0f7c4655e57caa0
2015-07-28 16:44:05 +02:00
Monty Taylor 00e39427c2 Record interesting info into nova metadata
Sometimes one wants to run a quick command across classes of nodes. This
is made really easy if there are some defining characteristics recorded
in the nova metadata that things like the ansible inventory can pick up
on. Add information to the meta parameter to record that information
in nova.

Change-Id: I3e24f5aa004c5bb8de7ffb757035d64804547f1d
2015-06-10 14:54:37 -07:00
Ian Wienand e47e400a4c Ignore stderr for documentation program output
novaclient is helpfully giving a deprecation warning at the moment
which totally messes up the command output in the documentation.  Just
ignore this extraneous output by only capturing stdout.

Change-Id: Ie126deb555fff52385bfb11d82f510cc9431b0a4
2015-03-20 11:30:10 +11:00
Yolanda Robla ad97fb91a2 Update documentation for using diskimage-builder
Recent change in nodepool to build images using
diskimage-builder involved adding new diskimages section
in the nodepool.yaml file, and added new commands like
dib-image-list, dib-image-delete, image-build and image-upload.

Change-Id: If36f3f6e39e382cb4c6398d3a063c979888a8642
2014-08-27 11:51:08 +02:00
James E. Blair faef2431a7 Finish initial docs
Finish the initial sections defined in the documentation index.
Add sphinxcontrib-programoutput to document command line utils.
Add py27 to the list of default tox targets.

Change-Id: I254534032e0706e410647b023249fe3af4f3a35f
2014-03-31 09:21:56 -07:00