nodepool

Commit Graph

Author	SHA1	Message	Date
James E. Blair	64452f1a26	Demote launch/delete timeeouts to warnings If we hit the internal timeout while launching or deleting a server, we raise an exception and then log the traceback. This is a not- unexpected occurance, and the traceback is not useful since it's just one stack frame within the same class, so instead, let's log these timeouts at warning level without the traceback. Change-Id: Id4806d8ea2d0a232504e5a75d69cec239bcac670	2024-04-17 14:46:36 -07:00
James E. Blair	6a71686b6b	Fix deletion of intermediate image builds The delete-after-upload option would sometimes delete before upload if there are intermediate build files which match its criteria. To ensure that we don't delete any build files until the build is at least complete, only run that method if the ZK build state is marked as ready. Change-Id: Ia263b478b0a2b9d77833bc19d0967a65dcbf5b27	2024-04-02 15:26:41 -07:00
James E. Blair	d6b13dd47f	Fix inheritance of delete-after-upload option This diskimage option was not correctly inherited when using a parent diskimage. Change-Id: If48c218a0b8b9432e54a55bd947b04016db78479	2024-03-28 09:19:15 -07:00
Zuul	75ebae3162	Merge "gce: clear machine type cache on bad data"	2024-03-25 19:35:17 +00:00
Zuul	28c6c1d4f7	Merge "Continuously ensure the component registry is up to date"	2024-03-25 18:17:01 +00:00
Zuul	22cd51fdf6	Merge "Fix leaked upload cleanup"	2024-03-20 13:04:01 +00:00
Zuul	b1a40f1fd3	Merge "Add delete-after-upload option"	2024-03-18 21:06:46 +00:00
Benjamin Schanzel	1199ac77e5	Fix nodepool builder stats test In the metric name, we use the builders fqdn as a key, but in the test we used the hostname. So this test fails on systems where that's not the same. Change-Id: If286f19371d1fd70dc9bee4b7af814d13396357b	2024-03-18 15:33:03 +01:00
James E. Blair	fc95724601	Fix leaked upload cleanup The cleanup routine for leaked image uploads based its detection on upload ids, but they are not unique except in the context of a provider and build. This meant that, for example, as long as there was an upload with id 0000000001 for any image build for the provider (very likely!) we would skip cleaning up any leaked uploads with id 0000000001. Correct this by using a key generated on build+upload (provider is implied because we only consider uploads for our current provider). Update the tests relevant to this code to exercise this condition. Change-Id: Ic68932b735d7439ca39e2fbfbe1f73c7942152d6	2024-03-09 13:46:17 -08:00
James E. Blair	fd454706ca	Add delete-after-upload option This allows operators to delete large diskimage files after uploads are complete, in order to save space. A setting is also provided to keep certain formats, so that if operators would like to delete large formats such as "raw" while retaining a qcow2 copy (which, in an emergency, could be used to inspect the image, or manually converted and uploaded for use), that is possible. Change-Id: I97ca3422044174f956d6c5c3c35c2dbba9b4cadf	2024-03-09 06:51:56 -08:00
Zuul	f4941d4f03	Merge "Add some builder operational stats"	2024-03-07 20:36:14 +00:00
James E. Blair	963cb3f5b4	gce: clear machine type cache on bad data We have observed GCE returning bad machine type data which we then cache. If that happens, clear the cache to avoid getting stuck with the bad data. Change-Id: I32fac2a92d4f9d400fe2db41fffd8d189d097542	2024-03-05 16:05:18 -08:00
Zuul	eef07d21f3	Merge "Metastatic: Copy cloud attribute from backing node"	2024-03-05 16:51:00 +00:00
Zuul	dfd2877c81	Merge "Use seprate log package for NodescanRequest messages"	2024-03-05 15:32:18 +00:00
James E. Blair	619dee016c	Continuously ensure the component registry is up to date On startup, the launcher waits up to 5 seconds until it has seen its own registry entry because it uses the registry to decide if other components are able to handle a request, and if not, fail the request. In the case of a ZK disconnection, we will lose all information about registered components as well as the tree caches. Upon reconnection, we will repopulate the tree caches and re-register our component. If the tree cache repopulation happens first, our component registration may be in line behind several thousand ZK events. It may take more than 5 seconds to repopulate and it would be better for the launcher to wait until the component registry is up to date before it resumes processing. To fix this, instead of only waiting on the initial registration, we check each time through the launcher's main loop that the registry is up-to-date before we start processing. This should include disconnections because we expect the main loop to abort with an error and restart in those cases. This operates only on local cached data, so it doesn't generate any extra ZK traffic. Change-Id: I1949ec56610fe810d9e088b00666053f2cc37a9a	2024-03-04 14:28:11 -08:00
Zuul	392cf017c3	Merge "Add support for AWS IMDSv2"	2024-02-28 02:46:53 +00:00
Benjamin Schanzel	4c4e8aefdb	Metastatic: Copy cloud attribute from backing node Like done with several other meta data, copy the `cloud` attribute from the backing node to the metastatic node. Change-Id: Id83b3e09147baaab8a85ace4d5beba77d1eb87bd	2024-02-23 14:20:09 +01:00
James E. Blair	646b7f4927	Add some builder operational stats This adds some stats keys that may be useful when monitoring the operation of individual nodepool builders. Change-Id: Iffdeccd39b3a157a997cf37062064100c17b1cb3	2024-02-19 15:47:17 -08:00
James E. Blair	21f1b88b75	Add host-key-checking to metastatic driver If a long-running backing node used by the metastatic driver develops problems, performing a host-key-check each time we allocate a new metastatic node may detect these problems. If that happens, mark the backing node as failed so that no more nodes are allocated to it and it is eventually removed. Change-Id: Ib1763cf8c6e694a4957cb158b3b6afa53d20e606	2024-02-13 14:12:52 -08:00
James E. Blair	f89b41f6ad	Reconcile docs/validation for some options Some drivers were missing docs and/or validation for options that they actually support. This change: adds launch-timeout to: metastatic docs and validation aws validation gce docs and validation adds post-upload-hook to: aws validation adds boot-timeout to: metastatic docs and validation adds launch-retries to: metastatic docs and validation Change-Id: Id3f4bb687c1b2c39a1feb926a50c46b23ae9df9a	2024-02-08 09:36:35 -08:00
James E. Blair	c78fe769f2	Allow custom k8s pod specs This change adds the ability to use the k8s (and friends) drivers to create pods with custom specs. This will allow nodepool admins to define labels that create pods with options not otherwise supported by Nodepool, as well as pods with multiple containers. This can be used to implement the versatile sidecar pattern, which, in a system where it is difficult to background a system process (such as a database server or container runtime) is useful to run jobs with such requirements. It is still the case that a single resource is returned to Zuul, so a single pod will be added to the inventory. Therefore, the expectation that it should be possible to shell into the first container in the pod is documented. Change-Id: I4a24a953a61239a8a52c9e7a2b68a7ec779f7a3d	2024-01-30 15:59:34 -08:00
Clark Boylan	e5c1790be7	Rollback to 1.28/stable microk8s in functional testing We use latest/stable by default which very recently updated to 1.29/stable. Unfortunately it appears there are issues [0] with this version on Debian Bookworm which also happens to be the platform we test on. Our jobs have been consistently failing in a manner that appears related to this issue. Update the job to collect logs so that we can better confirm this is the case and rollback to 1.28 which should be working. Also update the AWS tests to handle a recent moto release which requires us to use mock_aws rather than individual mock_* classes. [0] https://github.com/canonical/microk8s/issues/4361 Change-Id: I72310521bdabfc3e34a9f2e87ff80f6d7c27c180 Co-Authored-By: James E. Blair <jim@acmegating.com> Co-Authored-By: Jeremy Stanley <fungi@yuggoth.org>	2024-01-29 14:15:54 -08:00
James E. Blair	3f4fb008b0	Add support for AWS IMDSv2 This is an authenticated http metadata service which is typically available by default, but a more secure setup is to enforce its usage. This change adds the ability to do that for both instances and AMIs. Change-Id: Ia8554ff0baec260289da0574b92932b37ffe5f04	2024-01-24 15:11:35 -08:00
Zuul	42f9100d82	Merge "Fix gpu parameter type in openshiftpods"	2023-12-12 20:36:18 +00:00
James E. Blair	9c496e2939	Redact k8s connection info in node list The node list (web and cli) displays the connection port for the node, but the k8s drivers use that to send service account credential info to zuul. To avoid exposing this to users if operators have chosen to make the nodepool-launcher webserver accessible, redact the connection port if it is not an integer. This also affects the command-line nodepool-list in the same way. Change-Id: I7a309f95417d47612e40d983b3a2ec6ee4d0183a	2023-12-01 10:30:03 -08:00
James E. Blair	5849c3b9a7	Fix gpu parameter type in openshiftpods In config validation, the gpu parametr type was specified as str rather than float. This is corrected. This was not discovered in testing because the only tests which use the gpu parameter for the other k8s drivers are not present in the openshiftpods driver. This change also adds the missing tests for the default resource and resource limits feature which exercises the gpu limits. Change-Id: Ife932acaeb5a90ebc94ad36c3b4615a4469f0c40	2023-12-01 08:06:26 -08:00
James E. Blair	cb8366d70c	Use backing node attributes as metastatic default To support the use case where one has multiple pools providing metastatic backing nodes, and those pools are in different regions, and a user wishes to use Zuul executor zones to communicate with whatever metastatic nodes eventually produced from those regions, this change updates the launcher and metastatic driver to use the node attributes (where zuul executor region names are specified) as default values for metastatic node attributes. This lets users configure nodepool with zuul executor zones only on the backing pools. Change-Id: Ie6bdad190f8f0d61dab0fec37642d7a078ab52b3 Co-Authored-By: Benedikt Loeffler <benedikt.loeffler@bmw.de>	2023-11-27 10:34:24 -08:00
James E. Blair	7a1c75f918	Fix metastatic missing pool config The metastatic driver was ignoring the 3 standard pool configuration options (max-servers, priority, and node-attributes) due to a missing superclass method call. Correct that and update tests to validate. Further, the node-attributes option was undocumented for the metastatic driver, so add it to the docs. Change-Id: I6a65ea5b8ddb319bc131f87e0793f3626379e15f Co-Authored-By: Benedikt Loeffler <benedikt.loeffler@bmw.de>	2023-11-27 10:34:19 -08:00
Benjamin Schanzel	65a81ad7b5	Use seprate log package for NodescanRequest messages The state transition log messages for the Nodescan statemachine can be quite excessive. While they might be useful for debugging, it's not always needed to have all the log messages available. To provide an easier way to filter these messages, use a dedicated log package in the NodescanRequest class. Change-Id: I2b1a625f5e5e375317951e410a27ff4243d4a0ef	2023-11-20 14:46:36 +01:00
Simon Westphahl	2aeaee92f1	Ignore unrelated error labels in request handler Nodepool was declining node requests when other unrelated instance types of a provider were unavailable: Declining node request <NodeRequest {... 'node_types': ['ubuntu'], ... }> due to ['node type(s) [ubuntu-invalid] not available'] To fix this we will the check error labels against the requested labels before including them in the list of invalid node types. Change-Id: I7bbb3b813ca82baf80821a9e84cc10385ea95a01	2023-11-09 13:58:54 +01:00
James E. Blair	8669acfe6b	Use a state machine for ssh key scanning We currently use a threadpool executor to scan up to 10 nodes at a time for ssh keys. If they are slow to respond, that can create a bottleneck. To alleviate this, use a state machine model for managing the process, and drive each state machine from a single thread. We use select.epoll() to handle the potentially large number of connections that could be happening simultaneously. Note: the paramiko/ssh portion of this process spawns its own thread in the background (and always has). Since we are now allowing more keyscan processis in parallel, we could end up with an unbounded set of paramiko threads in the background. If this is a concern we may need to cap the number of requests handled simultaneously. Even if we do that, this will still result in far fewer threads than simply increasing the cap on the threadpool executor. Change-Id: I42b76f4c923fd9441fb705e7bffd6bc9ea7240b1	2023-11-04 08:54:20 -07:00
James E. Blair	7d7d81cd46	AWS: improve service quota handling The AWS API call to get the service quota has its own rate limit that is separate from EC2. It is not documented, but the defaults appear to be very small; experimentally it appear to be something like a bucket size of 30 tokens and a refill rate somewhere between 3 and 10 tokens per minute. This change moves the quota lookup calls to their own rate limiter so they are accounted for separately from other calls. We should configure that rate limiter with the new very low values, however, that would significantly slow startup since we need to issue serveral calls at once when we start; after that we are not sensitive to a delay. The API can handle a burst at startup (with a bucket size of 30) but our rate limiter doesn't have a burst option. Instead of cofiguring it properly, we will just configure it with the rate limit we use for normal operations (so that we at least have some delay), but otherwise, rely on caching so that we know that we won't actually exceed the rate limit. This change therefore also adds a Lazy Executor TTL cache to the operations with a timeout of 5 minutes. This means that we will issue bursts of requests every 5 minutes, and as long as the number of requests is less than the token replacement rate, we'll be fine. Because this cache is on the adapter, multiple pool workers will use the same cache. This will cause a reduction in API calls since currently there is only pool-worker level caching of nodepool quota information objects. When the 5 minute cache on the nodepool quota info object expires, we will now hit the adapter cache (with its own 5 minute timeout) rather than go directly to the API repeatedly for each pool worker. This does mean that quota changes may take between 5 and 10 minutes to appear in nodepool. The current code only looks up quota information for instance and volume types actually used. If that number is low, all is well, but if it is high, then we could potentially approach or exceed the token replacement rate. To make this more predictable, we will switch the API call to list all quotas instead of fetching only the ones we need. Due to pagination, this results in a total of 8 API calls as of writing; 5 for ec2 quotas and 3 for ebs. These are likely to grow over time, but very slowly. Taken all together, these changes mean that a single launcher should issue at most 8 quota service api requests every 5 minutes, which is below the lowest observed token replacement rate. Change-Id: Idb3fb114f5b8cda8a7b6d5edc9c011cb7261be9f	2023-10-17 14:36:37 -07:00
Simon Westphahl	3c71fc9f4b	Use thread pool executor for AWS API requests So far we've cached most of the AWS API listings (instances, volumes, AMIs, snapshots, objects) but with refreshes happening synchronously. Since some of those methods are used as part of other methods during request handling we make them asynchronous. Change-Id: I22403699ebb39f3e4dcce778efaeb09328acd932	2023-10-17 14:36:37 -07:00
James E. Blair	40eb0fd8d4	Use ec2_client for instances and volumes There are two modes of use for the boto3 interface library for AWS: resource or client calls. The resource approach returns magic objects representing instances of each type of resource which have their own methods and can lazy-load relationships to other objects. The client approach is more low-level and returns dictionary representations of the HTTP results. In general in the AWS driver, we try to use resources when we can and fall back to client methods otherwise. When we construct our own AwsInstance objects (which translates what we get from AWS to what nodepool expects), we filled the "az" field by getting the availability zone from the subnet. The subnet is an object relationship that is lazy-loaded, which means that each time we retrieved an instance record, we issued another API call to look up the subnet (and this information is not cached). To correct this, we switch to getting the subnet from the placement field of the instance (which is more sensible anyway). To prevent future accidental lazy-loads of un-cached data, we switch instance and volume operations from the resource style of usage to the client. The result is a massive increase in performance (as the subnet lazy-load could take 0.1 seconds for each instance running, and we very frequently list all instances). Change-Id: I529318896fc8096bbd9dbdac60d1a29c3ac641b6	2023-10-17 14:36:37 -07:00
Zuul	76fb25d529	Merge "Handle invalid AWS instance in quota lookup"	2023-09-25 21:38:34 +00:00
Zuul	2813a7df1f	Merge "Kubernetes/OpenShift drivers: allow setting dynamic k8s labels"	2023-09-25 07:43:11 +00:00
Benjamin Schanzel	4660bb9aa7	Kubernetes/OpenShift drivers: allow setting dynamic k8s labels Just like for the OpenStack/AWS/Azure drivers, allow to configure dynamic metadata (labels) for kubernetes resources with information about the corresponding node request. Change-Id: I5d174edc6b7a49c2ab579a9a0b1b560389d6de82	2023-09-11 10:49:27 +02:00
James E. Blair	b3921d5c40	AWS: handle 'InvalidConversionTaskId.Malformed' error When we call the describe_import_image_tasks paginator with a task id specified and that task id does not exist, AWS can return a InvalidConversionTaskId.Malformed error instead of the expected empty list. Handle that case and add a test for it. Change-Id: I1a3a9297e123a50a4beeb7e18124b8cc17aa28bb	2023-09-05 16:09:26 -07:00
Zuul	785f7dcbc9	Merge "AWS: Add support for retrying image imports"	2023-08-28 18:43:56 +00:00
Joshua Watt	32b3158dd2	tests: Add test for console logging on keyscan error Adds a test to verify that the Adapter API to fetch the console logs when a keyscan error occurs is called correctly. Change-Id: I599b97f6829406ddfc7f1355b85c73fbe57533e7	2023-08-25 08:08:07 -06:00
Zuul	77e2550326	Merge "Test if username is set for diskimage based nodes in AWS"	2023-08-24 08:54:51 +00:00
Zuul	7c9e1bc0d8	Merge "Add AWS volume quota support"	2023-08-23 00:22:47 +00:00
Zuul	255070c245	Merge "Don't reload config when file wasn't modified"	2023-08-17 08:28:31 +00:00
Zuul	909973ff06	Merge "Update Azure API and add volume-size"	2023-08-16 23:28:14 +00:00
Zuul	0c9099a20d	Merge "Add Azure gallery image support"	2023-08-16 23:28:12 +00:00
Zuul	6dde5c55cb	Merge "Add ZK cache stats"	2023-08-14 21:00:10 +00:00
James E. Blair	c2d9c45655	AWS: Add support for retrying image imports AWS has limits on the number of image import tasks that can run simultaneously. In a busy system with large images, it would be better to wait until those limits clear rather than delete the uploaded s3 object and start over, uploading it again. To support this, we now detect that condition and optionally retry for a specified amount of time. The default remains to bail on the first error. Change-Id: I6aa7f79b2f73c4aa6743f11221907a731a82be34	2023-08-12 11:45:22 -07:00
Benedikt Loeffler	16ef0f8bad	Test if username is set for diskimage based nodes in AWS Change-Id: I4c13e95baf213831d02c31df0e91c82c2c86d3d8	2023-08-03 12:31:31 -07:00
James E. Blair	07c83f555d	Add ZK cache stats To observe the performance of the ZK connection and the new tree caches, add some statsd metrics for each of these. This will let us monitor queue size over time. Also, update the assertReportedStat method to output all received stats if the expected stat was not found (like Zuul). Change-Id: Ia7e1e0980fdc34007f80371ee0a77d4478948518 Depends-On: https://review.opendev.org/886552	2023-08-03 10:27:25 -07:00
James E. Blair	4ef3ebade8	Update references of build "number" to "id" This follows the previous change and is intended to have little or no behavior changes (only a few unit tests are updated to use different placeholder values). It updates all textual references of build numbers to build ids to better reflect that they are UUIDs instead of integers. Change-Id: I04b5eec732918f5b9b712f8caab2ea4ec90e9a9f	2023-08-02 11:18:15 -07:00

1 2 3 4 5 ...

1040 Commits