Commit Graph

28 Commits

Author SHA1 Message Date
James E. Blair 25d802df75 Fix error with static node reuse
An error in the slot logic was causing static nodes not to be
immediately re-enrolled after deletion.  This went unnoticed because
the cleanup worker acts as a fallback and re-enrolls them.

Correcting this logic will avoid an approximately 60 second delay
in static node reuse.

Change-Id: Ib39e840555bc998058ead9b284908ff2569ebf51
2023-06-26 15:38:58 -07:00
James E. Blair b0a40f0b47 Use image cache when launching nodes
We consult ZooKeeper to determine the most recent image upload
when we decide whether we should accept or decline a request.  If
we accept the request, we also consult it again for the same
information when we start building the node.  In both cases, we
can use the cache to avoid what may potentially be (especially in
the case of a large number of images or uploads) quite a lot of
ZK requests.  Our cache should be almost up to date (typically
milliseconds, or at the worst, seconds behind), and the worst
case is equivalent to what would happen if an image build took
just a few seconds longer.  The tradeoff is worth it.

Similarly, when we create min-ready requests, we can also consult
the cache.

With those 3 changes, all references to getMostRecentImageUpload
in Nodepool use the cache.

The original un-cached method is kept as well, because there are
an enormous number of references to it in the unit tests and they
don't have caching enabled.

In order to reduce the chances of races in many tests, the startup
sequence is normalized to:
1) start the builder
2) wait for an image to be available
3) start the launcher
4) check that the image cache in the launcher matches what
   is actually in ZK

This sequence (apart from #4) was already used by a minority of
tests (mostly newer tests).  Older tests have been updated.
A helper method, startPool, implements #4 and additionally includes
the wait_for_config method which was used by a random assortment
of tests.

Change-Id: Iac1ff8adfbdb8eb9a286929a59cf07cd0b4ac7ad
2023-04-10 15:57:01 -07:00
James E. Blair be3edd3e17 Convert openstack driver to statemachine
This updates the OpenStack driver to use the statemachine framework.

The goal is to revise all remaining drivers to use the statemachine
framework for two reasons:

1) We can dramatically reduce the number of threads in Nodepool which
is our biggest scaling bottleneck.  The OpenStack driver already
includes some work in that direction, but in a way that is unique
to it and not easily shared by other drivers.  The statemachine
framework is an extension of that idea implemented so that every driver
can use it.  This change further reduces the number of threads needed
even for the openstack driver.

2) By unifying all the drivers with a simple interface, we can prepare
to move them into Zuul.

There are a few updates to the statemachine framework to accomodate some
features that only the OpenStack driver used to date.

A number of tests need slight alteration since the openstack driver is
the basis of the "fake" driver used for tests.

Change-Id: Ie59a4e9f09990622b192ad840d9c948db717cce2
2023-01-10 10:30:14 -08:00
James E. Blair 08fdeed241 Add "slots" to static node driver
Add persistent slot numbers for static nodes.

This facilitates avoiding workspace collisions on nodes with
max-parallel-jobs > 1.

Change-Id: I30bbfc79a60b9e15f1255ad001a879521a181294
2022-10-11 07:02:53 -07:00
James E. Blair 1724d3a60f Don't pause the static provider at quota
The static provider avoids accepting requests for a label when
all the nodes of that label are busy.  In other words, it only
accepts requests when it expects to be able to fulfill them.

However, it also checks whether a node is reachable before fulfilling
a request.  And if the reachability of a node changes before a
request is fulfilled, then Nodepool will pause the handler until
the request is fulfilled.  If the node never comes back online,
no other nodes in the provider will be handled.

In general, we should never pause the static provider since pausing
is designed to allow for the accumulation of sufficient free resources
to fulfill the next request.  Static provider resources are not
fungible, so pausing until label A is available doesn't make label
B any less available.

To correct the problem, we no longer pause the static provider when
there is insufficient quota; instead we simply release the current
node request without declining it, and then allow the provider to
either defer it (if the node continues to be unreachable) or accept
it (if the node later becomes available again).

In other words, the handling of a request for a node with connection
problems will now be the same regardless of whether it is detected
on startup or while running.

Change-Id: I61565e9864efcfc81a205318d30972c25d4ea880
2022-09-12 13:02:40 -07:00
Simon Westphahl 7b4ce1be8e Consider all node types when adjusting label quota
Since nodes can have multiple other labels apart from the requested
type, we need to adjust the available quota for all labels of nodes that
were allocated to a request.

This fixes a problem where a static driver pool could pause because the
requested node types were no longer available but the request was still
accepted due to wrong label quota.

Change-Id: Ia9626ec26a66870574019ecc3f119a18e6c5022d
2022-08-24 15:40:36 +02:00
James E. Blair 10df93540f Use Zuul-style ZooKeeper connections
We have made many improvements to connection handling in Zuul.
Bring those back to Nodepool by copying over the zuul/zk directory
which has our base ZK connection classes.

This will enable us to bring other Zuul classes over, such as the
component registry.

The existing connection-related code is removed and the remaining
model-style code is moved to nodepool.zk.zookeeper.  Almost every
file imported the model as nodepool.zk, so import adjustments are
made to compensate while keeping the code more or less as-is.

Change-Id: I9f793d7bbad573cb881dfcfdf11e3013e0f8e4a3
2022-05-23 07:40:20 -07:00
Simon Westphahl 3874d0b5c0 Add support for label quota
This change add support for deferring requests when a provider cannot
fulfill a request temporarily based on the requested labels. How the
label quota is used depends on the driver.

The reason for introducing the concept of label quota has to do with how
the static driver handled requests when a label wasn't available
immediately. So far Nodepool locked the request and created a node with
the requested label(s) in state BUILDING. When a node with matching
label(s) was returned, the static driver re-registered it using the
waiting nodes in state BUILDING.

The problem with this approach was, that as soon as the request was
locked Zuul could no longer revise the relative priority of the
request. This could lead to the situation that later node requests where
prioritized when they had a lower inital relative priority.

To fix this the static provider will now use the label quota to defer
requests that can't be fulfilled with the existing free nodes.

Change-Id: I55f6655c0e34b7a3c136b270658009b4f4d76394
2022-03-31 15:03:41 +02:00
James E. Blair 9bcc046ffc Add QuotaSupport to drivers that don't have it
This adds QuotaSupport to all the drivers that don't have it, and
also updates their tests so there is at least one test which exercises
the new tenant quota feature.

Since this is expected to work across all drivers/providers/etc, we
should start including at least rudimentary quota support in every
driver.

Change-Id: I891ade226ba588ecdda835b143b7897bb4425bd8
2022-01-27 10:11:01 -08:00
Zuul 49997a424d Merge "Assign waiting static nodes by request priority" 2021-06-23 18:31:39 +00:00
James E. Blair 36a7df3677 Fix static driver equality check
A recent refactor of the config objects (which simplified the
repetitive __eq__ methods) missed updating the static driver.  This
could lead to infinite recursion (as the static driver explicitly
called super().__eq__ which itself called __eq__).

This updates the driver to use the new framework, and it also adds
a check to a unit test which exercises it.

Change-Id: I0f443cde147e46d73112025cd2f819159a8b4f86
2021-06-21 15:24:07 -07:00
Simon Westphahl 56acaf51d3 Assign waiting static nodes by request priority
The static provider will create nodes in state "building" in case a
label is currently not available. When a freed up node can be assigned
to a waiting node we must use the request prio to decide which node to
update.

Before nodepool ordered the waiting nodes by creation time. This can
lead to node requests being fulfilled in the wrong order in certain
cases.

Change-Id: Iae4091b1055d6bb0933f51ce1bbf860e62843206
2021-05-11 14:04:02 +02:00
Albin Vass 0c84b7fa4e Add shell-type config
Ansible needs to know which shell type the node uses to operate
correctly, especially for ssh connections for windows nodes because
otherwise ansible defaults to trying bash.

Change-Id: I71abfefa57aaafd88f199be19ee7caa64efda538
2021-03-05 15:14:29 +01:00
Tobias Henkel de6eaacc8f
Parallelize initial static node synchronization
We currently register static nodes serially on startup of the static
node provider. This is not a problem as long as all static nodes are
reachable. However if there are multiple static nodes unreachable and
fail with timeout the startup of the provider can take very long since
it must wait for many timeouts one after another. In order to handle
this better parallelize the initial sync of static nodes.

Change-Id: I4e9b12de277ef0c3140815fb61fed612be2d9396
2020-04-21 09:44:40 +02:00
David Shrewsbury e389ae2af0 Support node-attributes in static driver
Because the static driver doesn't go through the common driver code
launch process (its nodes are pre-launched), it is in charge of setting
the node attributes itself. It wasn't setting the node-attributes attribute.

Change-Id: I865c3b15711f8c5559964859db92cb4499b901ae
2020-03-24 11:07:29 -04:00
Simon Westphahl b0c0d27384 Fix missing node tuple arg in static provider
Bug was introduced in https://review.opendev.org/#/c/701969/

Change-Id: Id32361a3c45c9e2c6fe6439c62a85952fe302c73
2020-01-14 09:37:30 +01:00
Simon Westphahl 40c703b906 Always identify static nodes by node tuple
In some places in the static provider the node was identified by
hostname instead of the node tuple (hostname, username, port).

In case a node is registered with the same name multiple times using
different users and/or ports, this caused problems in the code paths
that were expecting the hostname to be unique.

This was the case in the liveness check where the provider e.g. did not
check different ports. Also the re-registration of deleted nodes was
sometimes not using the correct node config.

Log messages were modified to log the node tuple instead of the
hostname.

Change-Id: I7ead7e9e6903249449d350dc35deffaf98171304
2020-01-10 15:08:26 +01:00
Zuul 5a69b6ad37 Merge "Validate openstack provider pool labels have top-level labels" 2019-10-15 13:16:42 +00:00
Ian Wienand ddbcf1b07d Validate openstack provider pool labels have top-level labels
We broke nodepool configuration with
I3795fee1530045363e3f629f0793cbe6e95c23ca by not having the labels
defined in the OpenStack provider in the top-level label list.

The added check here would have found such a case.

The validate() function is reworked slightly; previously it would
return various exceptions from the tools it was calling (YAML,
voluptuous, etc.).  Now we have more testing (and I'd imagine we could
do even more, similar vaildations too) we'd have to keep adding
exception types.  Just make the function return a value; this also
makes sure the regular exit paths are taken from the caller in
nodepoolcmd.py, rather than dying with an exception at whatever point.

A unit test is added.

Co-Authored-By: Mohammed Naser <mnaser@vexxhost.com>
Change-Id: I5455f5d7eb07abea34c11a3026d630dee62f2185
2019-10-15 15:32:32 +11:00
Simon Westphahl 6afe2ece8d Sort waiting static nodes by creation time
Zookeeper doesn't guarantee any ordering, so we need to make sure the
waiting nodes are sorted by creation time. This way waiting nodes are
fulfilled in order.

Change-Id: I440acd14d656105c91567f996ca31e0247fa3752
2019-10-10 14:09:00 +02:00
Simon Westphahl f5c509ae1d Don't touch static nodes that are allocated
The method for retrieving registered ready nodes from Zookeeper was not
checking if a node was allocated. This could lead to reconfiguration,
deletion or reassignment of allocated READY nodes.

Change-Id: I53202a03ea7bf8449a2404b927899eb3e868a19f
2019-10-10 14:09:00 +02:00
Simon Westphahl d7ac31b5cd Assign static 'building' nodes in cleanup handler
In certain cases there can be a race between a node being re-registered
and a request handler creating a new 'building' node. This will lead to
a starvation of the 'building' node until a node with a matching label
is re-registered again (if at all).

To avoid this, the cleanup handler will check if there are any ready
nodes that can be assigned to nodes waiting for a matching label.

Change-Id: Ifc7fe2c59d5c4393c462f508ed449736ae145371
2019-10-08 12:19:09 +02:00
Tristan Cacqueray 9694382c47 static: add host-key-checking toggle
This change adds a static node toggle to disable checkHost and
nodescan.

Change-Id: If3aefb81624b573faf1129bb82d46ee848ff27cf
2019-07-19 19:34:18 +00:00
Simon Westphahl cb03f8b280 Don't pause static pool on single label quota
Nodes in a static pool can potentially have multiple labels. The current
implementation pauses any further request processing in case a label
could not be served immediately.

With the proposed change we will re-use the existing launch logic to
create a placeholder node in BUILDING state. As soon as a used node is
re-registerd this placeholder node will be used and the pending request
fulfilled.

This allows having a single pool for all static nodes, which prevents
orphaned Zookeeper entries. That can happen when e.g. a label specific
pool is removed or renamed.

Change-Id: If69cbeb33b8abb4f8e36146b9d849403dccb7d32
2019-07-12 14:33:36 +02:00
Tristan Cacqueray 552140037f static: enable using a single host with different user or port
This change modifies the static driver registering logic to take
into account the username and connection-port when identifying
unique node so that a single hostname can be registered multiple
times with different user or port.

Change-Id: I8f770e13c0d00d65094abccee30e54ec0e6371b6
2019-06-13 00:01:47 +00:00
Tristan Cacqueray 720dcec886 static: fix missing python-path update
This change fixes a missing python-path node attribute update for
the static node provider.

Change-Id: I4a0b8dfd7e134c635acbf52d63e565638565cdbb
2019-05-13 09:54:08 +00:00
Tobias Henkel bcaa264712
Default host_keys to empty list in static driver
Everywhere in nodepool we default the host_keys to an empty list and
zuul relies on this behavior. However the static driver sets this to
None for non-ssh connections. Change this to return an empty list to
match the behavior of the other drivers.

Change-Id: I3f14803c35a2039d2c53f4939857a531ce900097
2019-01-09 14:08:07 +01:00
David Shrewsbury 511ffd9c29
Add tox functional testing for drivers
Reorganizes the unit tests into a new subdirectory, and
adds a new functional test subdirectory (that will be further
divided into subdirs for drivers).

Change-Id: I027bba619deef9379d293fdd0457df6a1377aea8
2018-11-01 15:33:44 +01:00