nodepool

Commit Graph

Author	SHA1	Message	Date
James E. Blair	619dee016c	Continuously ensure the component registry is up to date On startup, the launcher waits up to 5 seconds until it has seen its own registry entry because it uses the registry to decide if other components are able to handle a request, and if not, fail the request. In the case of a ZK disconnection, we will lose all information about registered components as well as the tree caches. Upon reconnection, we will repopulate the tree caches and re-register our component. If the tree cache repopulation happens first, our component registration may be in line behind several thousand ZK events. It may take more than 5 seconds to repopulate and it would be better for the launcher to wait until the component registry is up to date before it resumes processing. To fix this, instead of only waiting on the initial registration, we check each time through the launcher's main loop that the registry is up-to-date before we start processing. This should include disconnections because we expect the main loop to abort with an error and restart in those cases. This operates only on local cached data, so it doesn't generate any extra ZK traffic. Change-Id: I1949ec56610fe810d9e088b00666053f2cc37a9a	2024-03-04 14:28:11 -08:00
Zuul	2240f001eb	Merge "Refactor config loading in builder and launcher"	2024-02-15 10:23:51 +00:00
Simon Westphahl	4ae0a6f9a6	Refactor config loading in builder and launcher In I93400cc156d09ea1add4fc753846df923242c0e6 we've refactore the launcher config loading to use the last modified timestamps of the config files to detect if a reload is necessary. In the builder the situation is even worse as we reload and compare the config much more often e.g. in the build worker when checking for manual or scheduled image updates. With a larger config (2-3MB range) this is a significant performance problem that can lead to builders being busy with config loading instead of building images. Yappi profile (performed with the optimization proposed in I786daa20ca428039a44d14b1e389d4d3fd62a735, which doesn't fully solve the problem): name ncall tsub ttot tavg ..py:880 AwsProviderDiskImage.__eq__ 812.. 17346.57 27435.41 0.000034 ..odepool/config.py:281 Label.__eq__ 155.. 1.189220 27403.11 0.176285 ..643 BuildWorker._checkConfigRecent 58 0.000000 27031.40 466.0586 ..depool/config.py:118 Config.__eq__ 58 0.000000 26733.50 460.9225 Change-Id: I929bdb757eb9e077012b530f6f872bea96ec8bbc	2024-01-30 13:59:36 +01:00
James E. Blair	cc51696a33	Fix max concurrency log line The extra comma produces a (non-fatal) log formatting error. Correct it. Change-Id: I144347da2ac99cba788da6e60889d2b2bc320c6e	2023-12-21 07:45:14 -08:00
Zuul	ac703de734	Merge "Resolve statsd client once at startup"	2023-12-09 16:18:57 +00:00
Zuul	76fb25d529	Merge "Handle invalid AWS instance in quota lookup"	2023-09-25 21:38:34 +00:00
James E. Blair	711104d3b4	Optimize order of operation in cleanupNodes We have some quick local checks that let us skip expensive operations in the cleanupNodes method, but we run them after a code block that could be expensive. Run them before it instead. The block in question is responsible for finding ready nodes that were allocated to a request which has since been deleted. This ran before the quick local checks because technically it could be executed despite them failing, but in practice, it doesn't make much sense. There are two checks: 1) That this provider is the node's provider. This isn't necessary -- any provider could deallocate the request, which could speed up the recovery of ready-but-not-allocated nodes. In practice, the node's provider is going to be the next thing to do something with the node anyway, so why have every provider in the system trying to lock the node when we can just let its own provider do it. 2) That the node is (probably) not locked. We have a weak check for locking here, in that the cache could be slightly out of sync with reality. But it's good enough to generally prevent us from locking nodes that are likely already locked. By skipping the lock attempt, we can save some time. Especially since every sucessful node is going to have a window where the node is ready and locked but the request is deleted. This happens between the time an executor starts a job and it actually touches the nodes. During that window, the block would execute and fail to lock the nodes. Let's just skip it in that case. Change-Id: Id1814f194e987032a2e797fe25ab91cfca47693c	2023-09-05 15:08:45 -07:00
Zuul	255070c245	Merge "Don't reload config when file wasn't modified"	2023-08-17 08:28:31 +00:00
Zuul	6d61644f07	Merge "Sleep between deferred requests on paused handler"	2023-08-16 13:36:12 +00:00
James E. Blair	21b8451947	Resolve statsd client once at startup We currently create a new statsd client each time we replace a provider manager. This means that if we are unable to resolve DNS at that time, the new provider may crash due to unhandled exceptions. To resolve this, let's adopt the same behavior we have in Zuul which is to set up the statsd client once at startup and continue to use the same client object for the life of the process. This means that operators will still see errors at startup during a misconfiguration, but any external changes after that will not affect nodepool. Change-Id: I65967c71e859fddbea15aee89f6ddae44344c87b	2023-08-14 10:47:53 -07:00
James E. Blair	07c83f555d	Add ZK cache stats To observe the performance of the ZK connection and the new tree caches, add some statsd metrics for each of these. This will let us monitor queue size over time. Also, update the assertReportedStat method to output all received stats if the expected stat was not found (like Zuul). Change-Id: Ia7e1e0980fdc34007f80371ee0a77d4478948518 Depends-On: https://review.opendev.org/886552	2023-08-03 10:27:25 -07:00
James E. Blair	a602654482	Handle invalid AWS instance in quota lookup Early nodepool performed all cloud operations within the context of an accepted node request. This means that any disagreement between the nodepool configuration and the cloud (such as what instance types, images, networks, or other resources are actually available) would be detected within that context and the request would be marked as completed and failed. When we added tenant quota support, we also added the possibility of needing to interact with the cloud before accepting a request. Specifically, we may now ask the cloud what resources are needed for a given instance type (and volume, etc) before deciding whether to accept a request. If we raise an exception here, the launcher will simply loop indefinitely. To avoid this, we will add a new exception class to indicate a permanent configuration error which was detected at runtime. If AWS says an instance type doesn't exist when we try to calculate its quota, we mark it as permanently errored in the provider configuration, then return empty quota information back to the launcher. This allows the launcher to accept the request, but then immediately mark it as failed because the label isn't available. The state of this error is stored on the provider manager, so the typical corrective action of updating the configuration to correct the label config means that a new provider will be spawned with an empty error label list; the error state will be cleared and the launcher will try again. Finally, an extra exception handler is added to the main launcher loop so that if any other unhandled errors slip through, the request will be deferred and the launcher will continue processing requests. Change-Id: I9a5349203a337ab23159806762cb46c059fe4ac5	2023-07-18 13:51:13 -07:00
Simon Westphahl	e959b4efc1	Sleep between deferred requests on paused handler It looks like a launcher can busy loop and starve other threads from running when a provider is at quota and a lot of requests are deferred. Sleep 250ms between deferred requests to not hog the CPU. Change-Id: I3dbf74ee2fe50308d1b5b286f091a052ec6c5ef9	2023-07-14 12:07:35 +02:00
Simon Westphahl	1a71b46cb5	Don't reload config when file wasn't modified The launcher thread will reload the config in the main loop every second (watermark sleep) without checking if the config changed. This can take quite a bit of CPU time as we've seen from different Yappi profiles. This change uses the last modified timestamp of the config file to detect if the Nodepool config needs to be reloaded. The test `test_nodepool_occ_config_reload` was changed to update the provider config so that a new OpenStack client object is created. Change-Id: I93400cc156d09ea1add4fc753846df923242c0e6	2023-06-30 07:40:59 +02:00
Zuul	61b5ca3c8c	Merge "Defer node request when label is not available"	2023-06-29 06:03:00 +00:00
Zuul	df8a96da9b	Merge "Announce only pool labels in pool component"	2023-06-29 06:01:43 +00:00
James E. Blair	218da4aa81	Use cache more aggressively when searching for ready nodes When assigning node requests, the pool main loop attempts to find existing ready nodes for each request it accepts before creating a NodeLauncher (which is the class which usually causes asynchronous node creation to happen). Even though we iterate over every ready node in the system looking for a match, this should be fast because: * We only consider nodes from our provider+pool. * We only consider nodes which are not already assigned (and newly created nodes are created with assigned_to already set). * We use the node cache. However, there is room for improvement: * We could use cached_ids so that we don't have to list children. * We can avoid trying to lock nodes that the cache says are already locked (this should generally be covered by the "assigned_to" check mentioned above, but this is a good extra check and is consistent with other new optimizations. * We can ignore nodes in the cache with incomplete data. The first and last items are likely to make the most difference. It has been observed that in a production system we can end up spending much more time than expected looking for ready nodes. The most likely cause for this is cache misses, since if we are able to use cached data we would expect all the other checks to be dealt with quickly. That leaves nodes that have appeared very recently (so they show up in a live get_children call but aren't in the cache yet, or if they are, are in the cache with incomplete (empty) data) or nodes that have been disappeared very recently (so they showed up in our initial listing but by the time we get around to trying to lock them, they have been removed from the cache). This also adds a log entry to the very chatty "cache event" log handler in the case of a cache miss in order to aid future debugging. Change-Id: I32a0f2d0c2f1b8bbbeae1441ee48c8320e3b9129	2023-05-30 14:43:12 -07:00
James E. Blair	99d2a361a1	Use cached ids in node iterator more often There are several places where it is now probably safe to use cached ids when iterating over ZK nodes. The main reasons not to use cached ids are in the case of race conditions or in case the tree cache may have missed an event and is unaware of a node. We have increased confidence in the accuracy of our cache now, so at least in the cases where we know that races are not an issue, we can switch those to use cached ids and save a ZK round trip (a possibly significant one if there is a long list of children). This change adds the flag in the following places (with explanations of why it's safe): * State machine cleanup routines Leaked instances have to show up on two subsequent calls to be acted upon, so this is not sensitive to timing * Quota calculation If we do get the quota wrong, drivers are expected to handle that gracefully anyway. * Held node cleanup Worst case is we wait until next iteration to clean up. * Stats They're a snapshot anyway, so a cache mismatch is really just a shift in the snapshot time. Change-Id: Ie7af2f62188951bf302ffdb64827d868609a1e3c	2023-05-30 13:27:45 -07:00
Simon Westphahl	0a858c4be8	Double check node allocation in a few more places Since we operate a lot on cached data we need to make sure that the state of the node did not check after we've locked it. This change mainly focuses on checking the `allocated_to` field where necessary. Change-Id: I7eb51464f5de88a973ee39e25b57ec9ff2d851da	2023-05-24 18:00:13 +02:00
Simon Westphahl	adf44ecdf0	Double check node request during node cleanup This change fixes a race condition between a node getting re-used and the cleanup of reusable ready nodes. The sequence of events seems to be the following: * "ready" node-x allocated to request-1 with request-1 no longer existing * cleanup-worker-1 starts iterating (cached data) * another cleanup-worker-2 starts iterating (cached data) * node-x is deallocated by a cleanup-worker-1 * node-x is allocated to request-2 * cleanup-worker-2 deallocates node-x as it is operating on cached data thinking that the node is still allocated to request-1 * node-x is allocated to request-3 * executor for job-2 tries to lock node-x (previously assigned to request-3) leading to the following exception 2023-05-24 11:55:12,116 ERROR zuul.nodepool: [e: 0943d670-fa28-11ed-8540-a307d01e6b77] Error locking nodes: Traceback (most recent call last): File "/opt/zuul/lib/python3.10/site-packages/zuul/nodepool.py", line 381, in lockNodes raise Exception("Node %s allocated to %s, not %s" % Exception: Node 0032857715 allocated to 100-0031415561, not 200-0031415229 Change-Id: If80624cad9a7beb13b6ece3dadddb5beba9243fc	2023-05-24 17:31:44 +02:00
Tobias Henkel	b79f33fc6b	Defer node request when label is not available During assignHandlers we already gather the pools that can serve a label. In case the pool doesn't offer a label of the request the current process is that we lock and decline the node request right away. In a system where there are many pools that don't satisfy a label most requests typically get locked and declined by many pools before they get served by a pool that supports a label. This changes the behavior in a way that request handling is deferred if the label is not supported on the pool but we know that there is another pool that supports it. This way the pools that are in charge for the label will be the first ones to lock and process the request. If there is no candidate launcher pool anymore the request will still be processed and declined in the nextassign handlers loop. In summary this should reduce load on zookeeper and at the same time improve performance of the assignHandlers loop by avoiding locking in potentially a lot of cases. Change-Id: Ic48b87218470d3c652c1e0b3815298354789efea	2023-05-23 11:23:44 +02:00
Tobias Henkel	71f3ccf2bf	Announce only pool labels in pool component Currently nodepool announces all labels from all providers in its config on all PoolComponents. Although this doesn't directly break anything it calculates a wrong set of candidate_launcher_pools which might affect performance when yielding to other providers. Change-Id: Ie0773830806f8823043f34cf386e6b1d55e3baf0	2023-05-12 12:56:11 +02:00
James E. Blair	3e8dce8873	Add missing request unlock Because we now iterate over cached requests, when we decide that we should lock a request to attempt to handle or decline it, we double check the status of the request after locking it. The act of locking the request refreshes the data from ZK, so at this point we are certain that the data are up to date. In the case where, after locking and refreshing, we double checked that we had not already declined the request, we are missing an instruction to unlock the request. This can leave a request perpetually locked. This change adds the missing unlock. Change-Id: Ifcf52bbca03329a8ca7015412f9aaf795c5ae7c0	2023-05-01 10:34:14 -07:00
James E. Blair	190e432da4	Reduce sleeps to improve time to ready The statemachine driver currently sleeps 10 seconds between each iteration of the loop that runs all the state machines. This time can delay the time to ready for nodes if the cloud is relatively fast. This change reduces the delay to a max of 1 second (if the loop takes longer than 1 second, the delay will be 0). Additionally, the Lazy TTL cache used in the openstack driver has a TTL of 10 seconds. This can contribute to long times to ready as well. This change reduces the TTL to 1 second (we will wait at least 1 second after the completion of a list servers call before we fire off the next one). Because list servers can take quite some time, this could cause one of the threadpoolexecutor threads to spend more time waiting for a list servers call to return. However, local load testing with 600 outstanding requests shows no impact to overall throughput. Finally, in a semi-related change, the launcher watermark sleep (the time that the launcher waits between loop iterations to check for new handlers and completed handlers) is reduced as well. Now that most of the launcher work operates on local tree cache data, we can run this more often without incurring additional ZK load. Given a hypothetical cloud that responds instantaneously, we should be able to satisfy a request in about 3 seconds. The first two changes improve response time without lowering throughput, the third change improves both. These will output considerably more debug log messages. Change-Id: Idaad3de2a2afb5d51ede680ecf33c1d5c62fbdbb	2023-04-17 10:45:08 -07:00
Zuul	f0ba5169a3	Merge "Handle NoNodeError in _assignHandlers"	2023-04-13 22:21:08 +00:00
James E. Blair	198605b685	Log the reason we decline a request Change-Id: I5d8f2cd2816579251ddaca4095255cabe860ec44	2023-04-10 15:57:01 -07:00
James E. Blair	e5ba4f98f0	Use node cache in node deleter The node deleter method runs every 5 seconds and iterates over every node. For any node that is not READY, it attempts to lock the node to determine if it should be deleted. In other words, this method attempts to lock almost every node every 5 seconds. Locking nodes is expensive (involving a number of ZK round trips). However, our node cache keeps a count of the number of lock contenders on a given node. That can be used as a proxy for whether the node is probably locked. Zero contenders means it could be unlocked. One or more contenders means it's probably locked. To greatly reduce ZK traffic, we can use the node cache and only attempt to lock nodes if the cache indicates they are potentially unlocked. If we're wrong, we'll try again in 5 seconds anyway. Change-Id: Ieb54babef92d5dbf3173316c6a1711e0e4a70403	2023-04-10 15:57:01 -07:00
James E. Blair	b0a40f0b47	Use image cache when launching nodes We consult ZooKeeper to determine the most recent image upload when we decide whether we should accept or decline a request. If we accept the request, we also consult it again for the same information when we start building the node. In both cases, we can use the cache to avoid what may potentially be (especially in the case of a large number of images or uploads) quite a lot of ZK requests. Our cache should be almost up to date (typically milliseconds, or at the worst, seconds behind), and the worst case is equivalent to what would happen if an image build took just a few seconds longer. The tradeoff is worth it. Similarly, when we create min-ready requests, we can also consult the cache. With those 3 changes, all references to getMostRecentImageUpload in Nodepool use the cache. The original un-cached method is kept as well, because there are an enormous number of references to it in the unit tests and they don't have caching enabled. In order to reduce the chances of races in many tests, the startup sequence is normalized to: 1) start the builder 2) wait for an image to be available 3) start the launcher 4) check that the image cache in the launcher matches what is actually in ZK This sequence (apart from #4) was already used by a minority of tests (mostly newer tests). Older tests have been updated. A helper method, startPool, implements #4 and additionally includes the wait_for_config method which was used by a random assortment of tests. Change-Id: Iac1ff8adfbdb8eb9a286929a59cf07cd0b4ac7ad	2023-04-10 15:57:01 -07:00
Dong Zhang	f84fed18dc	Handle NoNodeError in _assignHandlers It could happen a pending request being cancelled by scheduler when the job is cancelled. In this case trying to lock the request would throw NoNodeError. In this case, we just log and ignore. Change-Id: Ia8c21ed0e44a6a906368e7a92ce9e49f22c30b1a	2023-03-22 12:00:08 +01:00
James E. Blair	70f143690d	Make assignHandlers a generator Our intent with assignHandlers is to run it almost continuously, but in case it is slow, to occasionally remove completed handlers and run paused handlers in order to make room for new requests which may otherwise be starved. The way that currently works is that we stop processing requests after 15 seconds, allow the other methods to run, then start over. If this happens continuously, we may never see requests near the bottom of the list (which we might be able to satisfy). To avoid that, this change turns assignHandlers into a generator which picks up where it left off each time we yield to the other processors. Change-Id: I32096ae7342cc8aafd2a14de79acc4267293349a	2023-03-09 13:33:51 -08:00
James E. Blair	fada5d9edf	Don't update node request in assign handlers loop Most of the evaluations that happen in the assignHandlers method are safe to operate on slightly out-of date node requests. They mostly consist of comparing what the request is asking for to what the provider has available. The loop iterates over a set of cached node requests which can be slightly out of date (but not too much) because they are updated out-of-band by a cache listener. It's safe for us to make these comparisons on request data that generally doesn't change over time any way. We can greatly speed up the loop by avoiding the explicit refresh of every request. If we're a little wrong and we defer request handling, we'll get it on the next time through the loop. Later in the loop, we lock the request, and at that point, the lock routine automatically refreshes the request object in place, so we know we have current data. There are a few things we check that can change: the request status, and the declined worker list(though it's very unlikely to be out of sync in such a way that it's missing our own name). We can check those again after the lock to be sure. Change-Id: Ib9e16f9e16d05537171b1acb37aa110477495a6e	2023-03-09 13:30:29 -08:00
Simon Westphahl	3209226fc1	Respect timeout when not accepting requests So far the timeout was only in effect when any of the requests was accepted by the launcher. When there are a lot of other requests (e.g. to be declined) we might exceed the timout in some cases. Change-Id: Iaed8302c94e12467834fadee25a80198db5e629d	2023-03-09 13:29:46 -08:00
Tobias Henkel	1ed2b855c8	Process paused requests first We observed a starvation problem in the following scenario leading to gaps in request processing of several minutes. Scenario: - there are many pending node request - a request handler gets paused by running into quota In this case nodepool loops over all node requests while deferring all requests due to the paused handler until it retries the paused handler. In our case this took 10 minutes until it unpaused and continued normal work normally. As long as the request queue is long this starts over as soon as it reaches the cloud quota again. The paused handlers are processed before looping through the node requests. Similar to fixing the starvation problem of removing completed handlers we process them within the loop as well. Change-Id: Iadacd4969c883574d8947e8ab2313e42820cb298	2023-03-09 13:29:05 -08:00
James E. Blair	06e5d2f843	Add debug log messages to handler assignment/removal In some circumstances, I'm seeing the assign handlers method run much longer than expected. This may help identify the problem. Change-Id: I9a636dc325fa83125a20e5a3d4c24c215078093e	2023-02-28 06:51:43 +00:00
Simon Westphahl	0f1680be7e	Serve all paused handlers before unpausing The launcher implementation assumed that only one request handler will be paused at any given point in time. However, this is not true when e.g. the request handler accepts multiple requests that all run into a quota limit during launch. The consequence of this is that the pool is unpaused too early and we might accept other node requests until the provider is paused again. This could lead to a starvation of earlier paused handlers as they were fulfilled in a LIFO fashion. To fix this edge case we will store paused request handlers in a set and only unpause the provider when there are no paused handlers anymore. Paused handlers are now also run in priority order. Change-Id: Ia34e2844533ce9942d489838c4ce14a605d79287	2022-10-20 12:06:11 +02:00
James E. Blair	abcba60a16	Demote "Starting/Finished cleanup" log entries to debug These two log entries cause 4 log lines to be emitted every 5 seconds on an idle system. They should not be at info level. Change-Id: I074a1eca9e588074a2d9ecd636236732f5894a1c	2022-09-21 07:32:05 -07:00
James E. Blair	f31a0dadf8	Cause providers to continue to decline requests when at quota When a provider is at quota, we pause it, and paused means paused. That means we don't do anything with any other requests. Unfortunately, that includes requests that the given provider can't even handle. So if a provider pauses because it is at quota while other providers continue to operate, if a request for a node type that no providers can handle arrives, then that request will remain outstanding until this provider becomes unpaused and can decline it. Requests shouldn't need to wait so long to be declined by providers which can never under any circumstances handle them. To address this, we will now run the assignHandlers method whether we are paused or not. Within assignHandlers, we will process all requests regardless of whether we are paused (but we won't necessarily accept them yet). We will decide whether a request will be declined or not, and if it will be declined, we will do so regardless of whether we are paused. Finally, only if we are unpaused and do not expect to decline the request will we accept it. Change-Id: Ied9e4577670ea65b1d5ecfef95a7f837a7b6ac61	2022-09-16 17:16:09 -07:00
Simon Westphahl	7b4ce1be8e	Consider all node types when adjusting label quota Since nodes can have multiple other labels apart from the requested type, we need to adjust the available quota for all labels of nodes that were allocated to a request. This fixes a problem where a static driver pool could pause because the requested node types were no longer available but the request was still accepted due to wrong label quota. Change-Id: Ia9626ec26a66870574019ecc3f119a18e6c5022d	2022-08-24 15:40:36 +02:00
James E. Blair	7bbdfdc9fd	Update ZooKeeper class connection methods This updates the ZooKeeper class to inherit from ZooKeeperBase and utilize its connection methods. It also moves the connection loss detection used by the builder to be more localized and removes unused methods. Change-Id: I6c9dbe17976560bc024f74cd31bdb6305d51168d	2022-06-29 07:46:34 -07:00
Zuul	6416b14838	Merge "Default limits for k8s labels and quota support"	2022-05-31 09:19:59 +00:00
Zuul	a2a1a4d8cd	Merge "Update some variable names"	2022-05-24 16:02:36 +00:00
Zuul	ff7dd84aef	Merge "Add provider/pool priority support"	2022-05-24 16:02:34 +00:00
Zuul	492f6d5216	Merge "Add the component registry from Zuul"	2022-05-24 01:02:26 +00:00
Zuul	a4acb5644e	Merge "Use Zuul-style ZooKeeper connections"	2022-05-23 22:56:54 +00:00
James E. Blair	1323d0b556	Update some variable names Now that the component we registered is a "pool" change the call sites to use "launcher_pools" instead of "launchers". This may reduce some ambiguity. (s/launcher/pool/ might still be ambiguous since it may not be clear whethere we're talking about our own pools or other pools; thus the choice of "launcher_pool" for the variable name.) Also, remove a redundant test assertion. Change-Id: I865883cdb115bf72a3bd034d9290f60666d64b66	2022-05-23 13:30:50 -07:00
James E. Blair	ea35fd5152	Add provider/pool priority support This lets users configure providers which should fulfill requests before other providers. This facilitates using a less expensive cloud before using a more expensive one. The default priority is 100, to facilitate either raising above or lowering below the default (while using only positive integers in order to avoid confusion). Change-Id: I969ea821e10a7773a0a8d135a4f13407319362ee	2022-05-23 13:28:21 -07:00
James E. Blair	a612aa603c	Add the component registry from Zuul This uses a cache and lets us update metadata about components and act on changes quickly (as compared to the current launcher registry which doesn't have provision for live updates). This removes the launcher registry, so operators should take care to update all launchers within a short period of time since the functionality to yield to a specific provider depends on it. Change-Id: I6409db0edf022d711f4e825e2b3eb487e7a79922	2022-05-23 07:41:27 -07:00
James E. Blair	10df93540f	Use Zuul-style ZooKeeper connections We have made many improvements to connection handling in Zuul. Bring those back to Nodepool by copying over the zuul/zk directory which has our base ZK connection classes. This will enable us to bring other Zuul classes over, such as the component registry. The existing connection-related code is removed and the remaining model-style code is moved to nodepool.zk.zookeeper. Almost every file imported the model as nodepool.zk, so import adjustments are made to compensate while keeping the code more or less as-is. Change-Id: I9f793d7bbad573cb881dfcfdf11e3013e0f8e4a3	2022-05-23 07:40:20 -07:00
Tobias Henkel	d6ed4b0d5b	Add more logs to cleanup workers Add start/end logs around the cleanup method so we can see the time it takes to run. Change-Id: I7504a57b5dbc9af181939622ff26ba0c26fc57b2	2022-05-19 15:20:58 +02:00
Benjamin Schanzel	d60a27a787	Default limits for k8s labels and quota support This adds config options to enforce default resource (cpu,mem) limits on k8s pod labels. With this, we can ensure all pod nodes have resource information set on them. This allows to account for max-cores and max-ram quotas for k8s pod nodes. Therefore also adding these config options. Also tenant-quotas can then be considered for pod nodes. Change-Id: Ida121c20b32828bba65a319318baef25b562aef2	2022-05-02 11:35:04 +02:00

1 2 3 4

171 Commits