The node list (web and cli) displays the connection port for the
node, but the k8s drivers use that to send service account
credential info to zuul.
To avoid exposing this to users if operators have chosen to make
the nodepool-launcher webserver accessible, redact the connection
port if it is not an integer.
This also affects the command-line nodepool-list in the same way.
Change-Id: I7a309f95417d47612e40d983b3a2ec6ee4d0183a
This follows the previous change and is intended to have little
or no behavior changes (only a few unit tests are updated to use
different placeholder values). It updates all textual references
of build numbers to build ids to better reflect that they are
UUIDs instead of integers.
Change-Id: I04b5eec732918f5b9b712f8caab2ea4ec90e9a9f
For sites with large numbers of image uploads, performing a web API
request of /image-list can be time-consuming. To improve response
time, maintain a TreeCache of image uploads and use that when listing
images.
The "image-list" CLI is unable to use this cache, and is already as
fast as it can be given the number of ZK requests it needs to issue.
The same is true for "dib-image-list", and this change adds a cache
for that as well.
Finally, since we otherwise would have three or four nearly identical
cache implementations, the TreeCache system is refactored into a base
class that can be used for all the caches. It has some hook points
to deal with the very slight behavior differences.
Change-Id: Ibff0a9016936b461eccb1b48dcf42f5ad8d8434e
With the static driver it is possible to have multiple static nodes
defined that only differ by their username. In order to be able to
distinguish them, include the username in the output of the
"nodepool list --detail" output.
Signed-off-by: Dr. Jens Harbott <harbott@osism.tech>
Change-Id: I702ccbf412731ef3400eb722386629356a244334
This speeds up node listing in the CLI ("nodepool list") and the
webapp ("/node-list") significantly.
The main culprit is that filling in the "locked" field is expensive
because we attempt to lock each node. To make that faster, we
now just query the lock contenders to determine whether it is locked
(if there are contenders, it's locked).
Further, in the webapp, we can use the cache more aggressively. First,
we update the cache listener to watch lock contenders and cache those
values on our Node objects. That means the webapp doesn't even need
to use the optmization above. Further, we can have the webapp use
cached node ids, at which point it doesn't need to make any ZK queries
at all.
With a local setup of 6000 nodes and a localhost ZK connection (real
world times will be much higher due to network delays), this
takes the web server node list from 3 seconds to 0.009 seconds.
The CLI node list improves from 2.1 seconds to 0.8 seconds (excluding
startup time).
Change-Id: Id857556865b6ad75b9ec404bd7ef0c45e2a527bd
This lets operators see the user_data and driver_data in the node
list when passing --detail. That can be useful for identifying
issues with the metastatic driver, or any other debugging which
would benefit from seeing that data.
Change-Id: I2f36ce98a183b7a8e289376f2228b6370900a057
This augments the dib-request list (which shows what images have
manual build requests) with information about whether the image
is paused. The resulting command is renamed to "image-status".
Change-Id: If75a8757b4ec93563e47bfdf0a239a9c21660c45
Image build requests can now be retrieved through the /dib-request-list
endpoint or via the dib-request-list sub-command. The list will show the
age of the request and if it is still pending or if there is already a
build in progress.
Change-Id: If73d6c9fcd5bd94318f389771248604a7f51c449
Now that the component we registered is a "pool" change the call
sites to use "launcher_pools" instead of "launchers". This may
reduce some ambiguity.
(s/launcher/pool/ might still be ambiguous since it may not be clear
whethere we're talking about our own pools or other pools; thus the
choice of "launcher_pool" for the variable name.)
Also, remove a redundant test assertion.
Change-Id: I865883cdb115bf72a3bd034d9290f60666d64b66
This uses a cache and lets us update metadata about components
and act on changes quickly (as compared to the current launcher
registry which doesn't have provision for live updates).
This removes the launcher registry, so operators should take care
to update all launchers within a short period of time since the
functionality to yield to a specific provider depends on it.
Change-Id: I6409db0edf022d711f4e825e2b3eb487e7a79922
As with the parent it seems that we can end up with empty image build
records as well as empty image upload records. Handle the build case as
well so that we can still list image builds properly when in this state.
This logs the problem as well so that you can debug it further.
Change-Id: Ic5521f5e2d2b65db94c2a5794f474913631a7a1b
This adds a CLI commend to set a flag in ZK for images indicating
that the image should be paused. This can be used to quickly pause
the building and uploading of one or more images globally. This
will effectively be boolean OR'd with the pause value for diskimage
builds in the config file.
In particular, this can be used to pause images for short durations,
either because a fix is imminent, or to allow the system to remain
stable while a configuration change goes through the CI/CD workflow.
Change-Id: I21a573dfc337c51f319afe3695d5446b2c91d70b
Multiple builders running cleanup of uploads in parallel can lead to a
race condition which results in empty image upload nodes in Zookeeper.
The problem is the creation of the image upload number lock inside the
image upload. The lock creation can re-created an already deleted image
upload w/o any data.
Later iterations will fail to parse the empty znode data.
It's not 100% clear how we arrive at this race condition, but we
definitely see multiple builders in the 'guarded' section trying to
delete an upload.
The following exception shows that there is some kind of race:
2020-06-23 11:00:45,225 ERROR nodepool.builder.CleanupWorker.0:
Exception cleaning up build <ImageBuild {'state': 'ready',
'state_time': 1592718469.360345, 'builder':
'nodepool-builder-2', 'builder_id':
'8e2c599b-fbfd-4d68-b8df-2fd2efdaa705', 'formats': 'qcow2,raw',
'username': 'Administrator', 'python_path': 'auto', 'id':
'0000013010', 'stat': ZnodeStat(czxid=129130944630,
mzxid=129131204208, ctime=1592712393474, mtime=1592718469363,
version=1, cversion=1, aversion=0, ephemeralOwner=0,
dataLength=214, numChildren=1, pzxid=129131204230)}> of image
example-image in provider <Provider example-provider>:
Traceback (most recent call last):
File "/opt/nodepool/lib/python3.6/site-packages/nodepool/builder.py", line 499, in _cleanupImage
self._cleanupProvider(provider, image, build.id)
File "/opt/nodepool/lib/python3.6/site-packages/nodepool/builder.py", line 292, in _cleanupProvider
self._deleteUpload(upload)
File "/opt/nodepool/lib/python3.6/site-packages/nodepool/builder.py", line 345, in _deleteUpload
upload.provider_name, upload.id)
File "/opt/nodepool/lib/python3.6/site-packages/nodepool/zk.py", line 1572, in deleteUpload
self.client.delete(path, recursive=True)
File "/opt/nodepool/lib/python3.6/site-packages/kazoo/client.py", line 1341, in delete
return self._delete_recursive(path)
File "/opt/nodepool/lib/python3.6/site-packages/kazoo/client.py", line 1374, in _delete_recursive
self._delete_recursive(child_path)
File "/opt/nodepool/lib/python3.6/site-packages/kazoo/client.py", line 1376, in _delete_recursive
self.delete(path)
File "/opt/nodepool/lib/python3.6/site-packages/kazoo/client.py", line 1343, in delete
return self.delete_async(path, version).get()
File "/opt/nodepool/lib/python3.6/site-packages/kazoo/handlers/utils.py", line 75, in get
raise self._exception
kazoo.exceptions.NotEmptyError
Order of events that might have led to the above exception:
- A locks upload
- A deletes children of upload (including lock of A)
- B locks upload (possible, because lock was deleted by A)
- A tries do delete the parent node, which leads to the above exception
- ...
Related change: https://review.opendev.org/#/c/716566/
Change-Id: I5989c9f5f4c9f1615ae7372250e2e5fdad720256
The event id will now be included in the command output of
'request-list' as well as the HTTP endpoint for outstanding node
requests.
Change-Id: I2475ae879fd0321bac020c84935a5a1d52b8b916
Having python files with exec bit and shebang defined in
/usr/lib/python-*/site-package/ is not fine in a RPM package.
Instead of carrying a patch in nodepool RPM packaging better
to fix this directly upstream.
Change-Id: I5a01e21243f175d28c67376941149e357cdacd26
This is useful for getting the list of all available labels.
Originally implemented in Iafff02d546abb34affa88310f6a97918166cbf47,
this is based on the new info available from
Icfb73fbe3b67321235a78ea7ed9bf4319567eb1a
Co-Authored-By: Tristan Cacqueray <tdecacqu@redhat.com>
Change-Id: I4b43ac0e2ba44516ff289e93dbf553033fc9e130
Depends-On: https://review.openstack.org/548376
node_list takes an argument "detail" which adds a rather arbitrary
list of results to the output. This comes from the command-line,
where we're trying to keep width under a certain length; but doesn't
make as much sense here (especially for json).
For dashboard type applications, replace this with a simple "fields"
parameter which, if set, will only return those fields it sees in the
common text output function.
Note, this purposely doesn't apply to the JSON output, as it expected
client-side filtering is more appropriate there. We could also add
generic field support to the command-line tools, if considered
worthwhile.
Add some documentation on all the end-points, and add info about these
parameters.
Change-Id: Ifbf1019b77368124961e7aa28dae403cabe50de1
All the _list functions return the same thing; save the results and
use a common output function to generate the json or text output.
Change-Id: I9cb44b09de2cb948e7381ef10302b21040433a2c
This reverts commit b737889250.
As discussed in [1] the label list is intended to provide a list of
available labels so users don't have to dig into configuration files
to determine what is available to run jobs on. As written it possibly
misses nodes without a min-ready or in-use nodes, so an alternative
implementation is probably called for. This is also currently returns
values in a plain list, rather than a list of dicts, which prevents
some de-deuplication refactoring.
Thus remove it for now, pending some more work.
[1] http://lists.zuul-ci.org/pipermail/zuul-discuss/2018-February/000039.html
Change-Id: Ic88da24dd5d140697d4af8e6f5ee0936c44f59c5
This patch refactors status functions so that instead of having one
function per output format, the output format is simply a parameter.
New status endpoints are added:
* image-list.json
* request-list(.json)
The endpoint node-list(.json) accepts a node_id parameter for filtering.
Change-Id: I44f386754d343239fc047701cc3cab2dc8b1572a
Extract same logics into one internal function, and call it
when get one node's values or iterate some nodes' value.
Change-Id: Idd536931a40f2369d3bfc4c623e94844c95f78a0
The pep8 rules used in nodepool are somewhat broken. In preparation to
use the pep8 ruleset from zuul we need to fix the findings upfront.
Change-Id: I9fb2a80db7671c590cdb8effbd1a1102aaa3aff8
With the upcoming windows support we don't have ssh as the only
connection type. As a preparation for this generalize ssh_port to
connection_port.
Change-Id: Ic1939054f0604411e0122db8dbd7e9886ceaa974
"AZ" is an almost always unused piece of info so move it from non
detailed info to detailed info when listing instances. Public IP addrs
are used all the time though so move from from detailed to non detailed
listings.
This should give better balance to the default usage of the list command
for day to day ops.
Change-Id: I5bb6255ea76b78b8427d1f145d3f8e077f11013d
When running 'nodepool list --detail' the ordering of the headings is
wrong. The 'Hold Job' and 'Launcher' headings need to be swapped.
Change-Id: I195392744d5b596c2d51335af9eba68e0a1ad671
This is set by Zuul when a node is held. The comment
field is also filled in by Zuul, but we already output
that.
Change-Id: I7af5c6c12d94ba9be15daa13dca138ee3785a972
The output of the 'list' command has gotten quite large. Simplify
the default output, and output the complete info when the --detail
option is given.
Also adds/modifies the list_nodes tests to assert that the column
output changes when the new option is used.
Change-Id: I60f849225c8c7fce1c524f132e54da58b25ae752
To prove it works, this also reworks the 'delete' command to use
ZooKeeper.
Summary:
- Re-enables the 'delete' command
- Adds waitForNodeDeletion() for testing.
- Re-enables tests:
- test_node_delete_success
- test_delete
- test_delete_now
- Fixes a bug in Node.__eq__ causing it to fail.
Change-Id: I539bca7d2d3d3b90f8e04e9098065e8b6797b194
Enables the nodepool 'list' command to speak ZooKeeper.
Re-enables the test_node_list test as well. Needed to
individually skip failing tests in test_commands.py.
Adds 'comment' attribute to the Node model since this is
output by the 'list' command.
Update waitForNodes() to use zookeeper syntax.
Change-Id: I61a92470054985c974f3c20d5be358b399925795
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
I have monitoring scripts that push the build status to me. This
became a little harder with the split to multiple builders, as the
logs for each build are kept on the builder.
Rather than parsing tables, add a .json target to the dib-image-list
so my script can find the most recent build, see the builder and
retrieve the latest logs from there.
Future work is to do the image-list and make a nicer status page, if
this approach is OK.
Change-Id: I3410a4e4efd649732747b0502d66d212f50fa1bb
DIB image IDs used to be unique across all images. This is no longer
the case as each image uses sequence numbers within each image build.
DIB files on disk are named with <image-name>-<seq> format for the
base name. Change the dib-image-list output for the ID field to be
the combined image name/ID since that uniquely identfies DIB builds.
Change-Id: I78906272126ac33f504d905f056247cd17af78e6
The methods to get and store builds and uploads will now take and/or
return objects of ImageBuild or ImageUpload types.
ZooKeeper API methods that used to return tuples (to include a build
or upload ID along with the corresponding data) are now simplified to
return a list of objects which will contain the ID.
Since our object model validates state values, we no longer need to
test for empty state values (which are not allowed), so two tests
accounting for that are removed.
Story: 2000767
Change-Id: I4e97f8d0d1e3095a6c3e988a8266805a82a0b520
This updates image-list to use ZK instead of the db.
It also adds the 'external_name' attribute to image uploads in
zookeeper. This displays the image name as it would appear in
a glance image list to the user.
Change-Id: I41b2b80243bec0e5152494e2744106a874cf6804
This updates the nodepool dib-image-list command to use
Zookeeper.
It adds a couple of methods to the ZK facade to assist.
Also, log kazoo at WARNING level in the CLI (because
connect/disconnect are logged at INFO, and are unecessary for a
normal CLI invocation).
Change-Id: I0c7fd3a96e00ae853638f982df5c8e0b036cca7c