Commit Graph

50 Commits

Author SHA1 Message Date
Zuul 4b8bb4668e Merge "Fix error with static node reuse" 2023-06-28 23:28:00 +00:00
James E. Blair 25d802df75 Fix error with static node reuse
An error in the slot logic was causing static nodes not to be
immediately re-enrolled after deletion.  This went unnoticed because
the cleanup worker acts as a fallback and re-enrolls them.

Correcting this logic will avoid an approximately 60 second delay
in static node reuse.

Change-Id: Ib39e840555bc998058ead9b284908ff2569ebf51
2023-06-26 15:38:58 -07:00
James E. Blair 8c9b157dff Parallelize static startup more
We intend to parallelize the startup of the static driver because
it involves a number of network checks which may slow down startup
if there are a lot of hosts.  However, the current implementation
only does that for half of the startup checks.

When we startup we:

1) Register any nodes defined in the config that don't already
   exist in ZK.
2) Update any nodes that do exist in ZK with any changes from the
   config.

Notably, in all cases at startup we perform steps 1 and 2, so new
nodes get written to ZK twice.  Step 1 is parallelize in an executor,
step 2 is not.

This change makes an update so that both steps are parallelized,
and step 2 is run only if step 1 is not.

Step 1 corresponds to syncNodeCount() and step 2 is
updateNodeFromConfig().

Note that they can't be combined directly because syncNodeCount is
called in other code paths (the delete/reregister path) but
updateNodeFromConfig is only needed at startup (keep in mind that
an online config change is seen as a new startup -- the provider
object is idled and a new provider started).

Change-Id: I80822757a9378138919c9f4157641b4b76014db2
2023-05-23 07:07:07 -07:00
Clark Boylan bcfc44b4a4 Correct race in static node reregistration
The static node driver has two paths that reregister nodes. The first
runs to synchronize our internal node representation with the
configuration periodically via the cleanup process. The second is
triggered when a node is deleted to reregister it. Unfortunately,
this second path was not checking appropriately to see if something else
(the cleanup routine) had reregisterd a node before it could register
it. The end result was duplicate registrations for a single node.

This logging illustrates what is happening:

  17:22:03,387 [StaticNodeProvider] Static driver cleanup leaked resources beginning
  17:22:03,387 [StaticNodeProvider] Static driver cleanup leaked resources lock acquired
  17:22:03,389 [StaticNodeProvider] Static node deletion handler beginning
  17:22:03,390 [StaticNodeProvider] Provider static-provider pool main syncing nodes with config
  17:22:03,392 [StaticNodeProvider] Registered static node NodeTuple(hostname='fake-host-1', username='zuul', port=22022)
  17:22:03,392 [StaticNodeProvider] Provider static-provider pool main syncing nodes with config complete
  17:22:03,392 [StaticNodeProvider] Static driver cleanup leaked resources complete
  17:22:03,392 [StaticNodeProvider] Static node deletion handler lock acquired
  17:22:03,396 [StaticNodeProvider] Re-registering deleted node: NodeTuple(hostname='fake-host-1', username='zuul', port=22022)
  17:22:03,398 [StaticNodeProvider] Registered static node NodeTuple(hostname='fake-host-1', username='zuul', port=22022)
  17:22:03,398 [StaticNodeProvider] Static node deletion handler complete

With a simplified view:

    A cleanupLeakedResources begins
  L A cleanupLeakedResources lock acquired
    B nodeDeletedNotification begins
  L A syncNodeCount begins
  L A registerNodeFromConfig
  L A syncNodeCount ends
    A cleanupLeakedResources ends
  L B nodeDeletedNotification lock acquired
  L B nodeDeletedNotification re-registers node
  L B registerNodeFromConfig
    B nodeDeletedNotification ends

To address this in the deleted node reregistration method we check if
the node being reregistered is already updated in the slot info. If so
that implies the cleanup process has already reregistred us.

Change-Id: I08a77ec34f0a0080d7a6482e9ec9d0620929105e
2022-12-08 10:58:46 -08:00
Clark Boylan 2a231a08c9 Add idle state to driver providers
This change adds an idle state to driver providers which is used to
indicate that the provider should stop performing actions that are not
safe to perform while we bootstrap a second newer version of the
provider to handle a config update.

This is particularly interesting for the static driver because it is
managing all of its state internally to nodepool and not relying on
external cloud systems to track resources. This means it is important
for the static provider to not have an old provider object update
zookeeper at the same time as a new provider object. This was previously
possible and created situtations where the resources in zookeeper did
not reflect our local config.

Since all other drivers rely on external state the primary update here
is to the static driver. We simply stop performing config
synchronization if the idle flag is set on a static provider. This will
allow the new provider to take over reflecting the new config
consistently.

Note, we don't take other approaches and essentially create a system
specific to the static driver because we're trying to avoid modifying
the nodepool runtime significantly to fix a problem that is specific to
the static driver.

Change-Id: I93519d0c6f4ddf8a417d837f6ae12a30a55870bb
2022-10-24 15:30:31 -07:00
Clark Boylan 6cfda7de66 Add debug logs to the static driver
We've got a test race (that is probably a real bug) that is difficult to
understand because it implies that the static driver's start() method
may be called more often than expected. To understand when start() is
called and when the two methods that register nodes are called we add
logging to capture when these steps are taken and not just when they
fail.

Change-Id: I9fea269dc08eef8cecc0a2cf803d2dbe5e05fc0f
2022-10-24 10:56:20 -07:00
James E. Blair 08fdeed241 Add "slots" to static node driver
Add persistent slot numbers for static nodes.

This facilitates avoiding workspace collisions on nodes with
max-parallel-jobs > 1.

Change-Id: I30bbfc79a60b9e15f1255ad001a879521a181294
2022-10-11 07:02:53 -07:00
James E. Blair 1724d3a60f Don't pause the static provider at quota
The static provider avoids accepting requests for a label when
all the nodes of that label are busy.  In other words, it only
accepts requests when it expects to be able to fulfill them.

However, it also checks whether a node is reachable before fulfilling
a request.  And if the reachability of a node changes before a
request is fulfilled, then Nodepool will pause the handler until
the request is fulfilled.  If the node never comes back online,
no other nodes in the provider will be handled.

In general, we should never pause the static provider since pausing
is designed to allow for the accumulation of sufficient free resources
to fulfill the next request.  Static provider resources are not
fungible, so pausing until label A is available doesn't make label
B any less available.

To correct the problem, we no longer pause the static provider when
there is insufficient quota; instead we simply release the current
node request without declining it, and then allow the provider to
either defer it (if the node continues to be unreachable) or accept
it (if the node later becomes available again).

In other words, the handling of a request for a node with connection
problems will now be the same regardless of whether it is detected
on startup or while running.

Change-Id: I61565e9864efcfc81a205318d30972c25d4ea880
2022-09-12 13:02:40 -07:00
Zuul a4acb5644e Merge "Use Zuul-style ZooKeeper connections" 2022-05-23 22:56:54 +00:00
James E. Blair 10df93540f Use Zuul-style ZooKeeper connections
We have made many improvements to connection handling in Zuul.
Bring those back to Nodepool by copying over the zuul/zk directory
which has our base ZK connection classes.

This will enable us to bring other Zuul classes over, such as the
component registry.

The existing connection-related code is removed and the remaining
model-style code is moved to nodepool.zk.zookeeper.  Almost every
file imported the model as nodepool.zk, so import adjustments are
made to compensate while keeping the code more or less as-is.

Change-Id: I9f793d7bbad573cb881dfcfdf11e3013e0f8e4a3
2022-05-23 07:40:20 -07:00
Vitaliy Lotorev 6520ae56b6 Log node info when failing to sync static node
In case host is offline nodepool throws an exception without mentioning
hostname:

  ERROR nodepool.driver.static.StaticNodeProvider: Couldn't sync node:
  Traceback (most recent call last):
    File "/usr/local/lib/python3.7/site-packages/nodepool/driver/static/provider.py", line 427, in cleanupLeakedResources
      self.syncNodeCount(registered, node, pool)
    File "/usr/local/lib/python3.7/site-packages/nodepool/driver/static/provider.py", line 320, in syncNodeCount
      register_cnt, self.provider.name, pool, node)
    File "/usr/local/lib/python3.7/site-packages/nodepool/driver/static/provider.py", line 210, in registerNodeFromConfig
      nodeutils.set_node_ip(node)
    File "/usr/local/lib/python3.7/site-packages/nodepool/nodeutils.py", line 48, in set_node_ip
      addrinfo = socket.getaddrinfo(node.hostname, node.connection_port)[0]
    File "/usr/local/lib/python3.7/socket.py", line 752, in getaddrinfo
      for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
  socket.gaierror: [Errno -2] Name or service not known

Change-Id: I8d9a5b382c4905ce14022f6d6f02f6d323799700
2022-05-03 00:27:54 +03:00
Simon Westphahl 3874d0b5c0 Add support for label quota
This change add support for deferring requests when a provider cannot
fulfill a request temporarily based on the requested labels. How the
label quota is used depends on the driver.

The reason for introducing the concept of label quota has to do with how
the static driver handled requests when a label wasn't available
immediately. So far Nodepool locked the request and created a node with
the requested label(s) in state BUILDING. When a node with matching
label(s) was returned, the static driver re-registered it using the
waiting nodes in state BUILDING.

The problem with this approach was, that as soon as the request was
locked Zuul could no longer revise the relative priority of the
request. This could lead to the situation that later node requests where
prioritized when they had a lower inital relative priority.

To fix this the static provider will now use the label quota to defer
requests that can't be fulfilled with the existing free nodes.

Change-Id: I55f6655c0e34b7a3c136b270658009b4f4d76394
2022-03-31 15:03:41 +02:00
James E. Blair 9bcc046ffc Add QuotaSupport to drivers that don't have it
This adds QuotaSupport to all the drivers that don't have it, and
also updates their tests so there is at least one test which exercises
the new tenant quota feature.

Since this is expected to work across all drivers/providers/etc, we
should start including at least rudimentary quota support in every
driver.

Change-Id: I891ade226ba588ecdda835b143b7897bb4425bd8
2022-01-27 10:11:01 -08:00
Zuul 49997a424d Merge "Assign waiting static nodes by request priority" 2021-06-23 18:31:39 +00:00
Simon Westphahl 56acaf51d3 Assign waiting static nodes by request priority
The static provider will create nodes in state "building" in case a
label is currently not available. When a freed up node can be assigned
to a waiting node we must use the request prio to decide which node to
update.

Before nodepool ordered the waiting nodes by creation time. This can
lead to node requests being fulfilled in the wrong order in certain
cases.

Change-Id: Iae4091b1055d6bb0933f51ce1bbf860e62843206
2021-05-11 14:04:02 +02:00
James E. Blair 63f38dfd6c Support threadless deletes
The launcher implements deletes using threads, and unlike with
launches, does not give drivers an opportunity to override that
and handle them without threads (as we want to do in the state
machine driver).

To correct this, we move the NodeDeleter class from the launcher
to driver utils, and add a new driver Provider method that returns
the NodeDeleter thread.  This is added in the base Provider class
so all drivers get this behavior by default.

In the state machine driver, we override the method so that instead
of returning a thread, we start a state machine and add it to a list
of state machines that our internal state machine runner thread
should drive.

Change-Id: Iddb7ed23c741824b5727fe2d89c9ddbfc01cd7d7
2021-03-21 14:39:01 -07:00
Albin Vass 0c84b7fa4e Add shell-type config
Ansible needs to know which shell type the node uses to operate
correctly, especially for ssh connections for windows nodes because
otherwise ansible defaults to trying bash.

Change-Id: I71abfefa57aaafd88f199be19ee7caa64efda538
2021-03-05 15:14:29 +01:00
Andy Ladjadj 60362d87aa [provider][static] add external_id field like other provider
add name of node in external_id field to hydrate UI
 in "Server" column

Change-Id: Iad37220232d3a44ec65c9be54bfdc50707441264
2020-10-01 10:27:10 +02:00
Tobias Henkel de6eaacc8f
Parallelize initial static node synchronization
We currently register static nodes serially on startup of the static
node provider. This is not a problem as long as all static nodes are
reachable. However if there are multiple static nodes unreachable and
fail with timeout the startup of the provider can take very long since
it must wait for many timeouts one after another. In order to handle
this better parallelize the initial sync of static nodes.

Change-Id: I4e9b12de277ef0c3140815fb61fed612be2d9396
2020-04-21 09:44:40 +02:00
David Shrewsbury e389ae2af0 Support node-attributes in static driver
Because the static driver doesn't go through the common driver code
launch process (its nodes are pre-launched), it is in charge of setting
the node attributes itself. It wasn't setting the node-attributes attribute.

Change-Id: I865c3b15711f8c5559964859db92cb4499b901ae
2020-03-24 11:07:29 -04:00
Simon Westphahl 568a8a02ca Cleanup exception logging in static provider
The static provider was logging a lot of exceptions with traceback in
case of an unreachable nodes.

Since offline nodes are part of the normal operation (e.g. for
maintenance) and we don't gain any additional insights from the
traceback, the log level was reduced to warning on StaticNodeErrors.

The warning messages will still log the exception message, so we don't
loose any important information. In case of other exceptions we will
still print the full traceback.

Change-Id: I5a6035a090acac2edb8529e91be63995feceb23d
2020-01-20 07:14:15 +01:00
Simon Westphahl b0c0d27384 Fix missing node tuple arg in static provider
Bug was introduced in https://review.opendev.org/#/c/701969/

Change-Id: Id32361a3c45c9e2c6fe6439c62a85952fe302c73
2020-01-14 09:37:30 +01:00
Simon Westphahl 40c703b906 Always identify static nodes by node tuple
In some places in the static provider the node was identified by
hostname instead of the node tuple (hostname, username, port).

In case a node is registered with the same name multiple times using
different users and/or ports, this caused problems in the code paths
that were expecting the hostname to be unique.

This was the case in the liveness check where the provider e.g. did not
check different ports. Also the re-registration of deleted nodes was
sometimes not using the correct node config.

Log messages were modified to log the node tuple instead of the
hostname.

Change-Id: I7ead7e9e6903249449d350dc35deffaf98171304
2020-01-10 15:08:26 +01:00
Simon Westphahl 975f3523a8 Fix register race condition in static provider
The fix for a data race in https://review.opendev.org/#/c/687261/
introduced another race condition since nodes are now (re-)registered
from multiple threads.

We have to introduce a lock to avoid data races between the cleanup and
node deleted worker.

Change-Id: Icd2cb3da82ce05a41b63bee5e76ef6406b59f12f
2019-10-17 09:49:00 +02:00
Simon Westphahl 6afe2ece8d Sort waiting static nodes by creation time
Zookeeper doesn't guarantee any ordering, so we need to make sure the
waiting nodes are sorted by creation time. This way waiting nodes are
fulfilled in order.

Change-Id: I440acd14d656105c91567f996ca31e0247fa3752
2019-10-10 14:09:00 +02:00
Simon Westphahl f5c509ae1d Don't touch static nodes that are allocated
The method for retrieving registered ready nodes from Zookeeper was not
checking if a node was allocated. This could lead to reconfiguration,
deletion or reassignment of allocated READY nodes.

Change-Id: I53202a03ea7bf8449a2404b927899eb3e868a19f
2019-10-10 14:09:00 +02:00
Simon Westphahl d7ac31b5cd Assign static 'building' nodes in cleanup handler
In certain cases there can be a race between a node being re-registered
and a request handler creating a new 'building' node. This will lead to
a starvation of the 'building' node until a node with a matching label
is re-registered again (if at all).

To avoid this, the cleanup handler will check if there are any ready
nodes that can be assigned to nodes waiting for a matching label.

Change-Id: Ifc7fe2c59d5c4393c462f508ed449736ae145371
2019-10-08 12:19:09 +02:00
Tristan Cacqueray 9694382c47 static: add host-key-checking toggle
This change adds a static node toggle to disable checkHost and
nodescan.

Change-Id: If3aefb81624b573faf1129bb82d46ee848ff27cf
2019-07-19 19:34:18 +00:00
Simon Westphahl cb03f8b280 Don't pause static pool on single label quota
Nodes in a static pool can potentially have multiple labels. The current
implementation pauses any further request processing in case a label
could not be served immediately.

With the proposed change we will re-use the existing launch logic to
create a placeholder node in BUILDING state. As soon as a used node is
re-registerd this placeholder node will be used and the pending request
fulfilled.

This allows having a single pool for all static nodes, which prevents
orphaned Zookeeper entries. That can happen when e.g. a label specific
pool is removed or renamed.

Change-Id: If69cbeb33b8abb4f8e36146b9d849403dccb7d32
2019-07-12 14:33:36 +02:00
Tristan Cacqueray 552140037f static: enable using a single host with different user or port
This change modifies the static driver registering logic to take
into account the username and connection-port when identifying
unique node so that a single hostname can be registered multiple
times with different user or port.

Change-Id: I8f770e13c0d00d65094abccee30e54ec0e6371b6
2019-06-13 00:01:47 +00:00
Tristan Cacqueray 720dcec886 static: fix missing python-path update
This change fixes a missing python-path node attribute update for
the static node provider.

Change-Id: I4a0b8dfd7e134c635acbf52d63e565638565cdbb
2019-05-13 09:54:08 +00:00
Tristan Cacqueray 76aa62230c Add python-path option to node
This change adds a new python_path Node attribute so that zuul executor
can remove the default hard-coded ansible_python_interpreter.

Change-Id: Iddf2cc6b2df579636ec39b091edcfe85a4a4ed10
2019-05-07 02:22:45 +00:00
Paul Belanger 3b893e3c28 Gather host keys for connection-type network_cli
The network_cli conneciton in ansible, is built a top of paramiko,
which then uses SSH at the transport. This means hosts using this
connection should also have their host keys collected.

Change-Id: I0383227374bdc1a9b0cecc255eabcb8a8d097b09
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2019-04-15 16:40:39 -04:00
Tobias Henkel bcaa264712
Default host_keys to empty list in static driver
Everywhere in nodepool we default the host_keys to an empty list and
zuul relies on this behavior. However the static driver sets this to
None for non-ssh connections. Change this to return an empty list to
match the behavior of the other drivers.

Change-Id: I3f14803c35a2039d2c53f4939857a531ce900097
2019-01-09 14:08:07 +01:00
Tobias Henkel b6a3e319e7
Update node during lockNode
After locking a node we most of the time update a node to double check
if the state changed on us. This is important to not operate on stale
data as we only have a guarantee that the node data is updated if we
update the data after getting the lock (regardless if we operate on
cached data or not).. However it is cumbersome to do this after every
lockNode call so doing this centrally reduces the risk to forget that.

This does not introduce any behavior change.

Change-Id: I06001e487041a3b67d969070b43f491c4ed3dce0
2018-11-26 19:37:41 +01:00
Simon Westphahl a217373948 Implement liveness check for static nodes
The node liveness will be checked during request handling to make sure
no dead nodes are assigned. Unreachable nodes will be deregistered.

Change-Id: I0c1c9adbaec07419026239746603eba9234b4c8e
2018-09-20 16:21:56 +02:00
Simon Westphahl e0919ab2cd Re-register missing nodes in static driver
The cleanup hook will take care of re-registering previously offline
nodes (e.g. because of a failed liveness check).

In addition the current state is now always checked in Zookeeper by
removing the local dict with static nodes.

Change-Id: Iae4488dcfb33d196953ae6f1b166b9dd3f81d998
2018-09-20 16:21:19 +02:00
Simon Westphahl 4c5d75c9e8 Fix missing node state refresh in static driver
Change-Id: Ie60ae37f40c149120fa5ed0c74e1722bf038c3f9
2018-09-05 16:22:38 +02:00
Simon Westphahl 55e70f7636 Improve static provider to check non-ssh hosts
This behavior is now in line with how the OpenStack driver is checking
hosts with a non-ssh connection type.

Change-Id: Ica553430d1d64fe69f5a8ccc9d4b453c7f21dac8
2018-09-05 12:38:00 +02:00
Simon Westphahl d4485fbad0 Update static nodes in Zookeeper on config change
Config changes of existing nodes were not updated in Zookeeper, leading
to inconsistent states.

Changing a label or connection related attributes is now reflected
without the workaround of first removing and re-adding the node.

Change-Id: Ia6278512c2e81a720a957fe269d1ae0522860b8f
2018-09-03 08:04:50 +02:00
David Shrewsbury ac11c4592d Static driver: fix node registration
After fulfilling a node request from zuul, it was possible for the
static driver to not have an accurate representation of the nodes
that were registered in ZK. It needs to re-read from ZK on each
node deletion notification.

Change-Id: I82e6934edabcb0c720e15593fb38e8fcfafa121c
2018-06-25 09:25:45 -04:00
David Shrewsbury 28d8dde0b1 Remove a done TODO in static driver
Change-Id: I5b1cc98d64c4233306fc0dbb7806f92a957eff6e
2018-06-21 14:01:19 -04:00
David Shrewsbury 9de68b83dd Pre-register static nodes
This changes the static driver to register nodes from its config
file with ZooKeeper at startup and when the config file changes.
A node is registered max-parallel-jobs times with ZooKeeper (default
is 1).

Once a node is deleted, it is immediately re-registered, unless it
no longer exists in our config, or it's not needed based on a change
to max-parallel-job count.

Change-Id: Ie386e440c2016af390087d109f7e5211c4fd968d
2018-06-13 12:13:40 -04:00
David Shrewsbury a418aabb7a Pass zk connection to ProviderManager.start()
In order to support static node pre-registration, we need to give
the provider manager the opportunity to register/deregister any
nodes in its configuration file when it starts (on startup or when
the config change). It will need a ZooKeeper connection to do this.
The OpenStack driver will ignore this parameter.

Change-Id: Idd00286b2577921b3fe5b55e8f13a27f2fbde5d6
2018-06-12 12:04:16 -04:00
Zuul 7eeefebbd4 Merge "Log connection port in static driver on timeout" 2018-06-11 15:39:43 +00:00
James E. Blair ae0f54abeb Directly link providers and request handlers
Rather than relying on inspection and loading classes from specific
filenames, ask each driver's Provider implementation to supply the
NodeRequestHandler instance for requests.

This is the first in a series of changes to remove all similar
indirection from the drivers in order to make them easier to
understand.

The test driver is moved out of the fixtures subdir to be a
"regular" driver because it now needs relative imports (provider
code needs to import the handler code).

Change-Id: Iad60d8bfe102d43b9fb1a1ff3ede13e4fa753572
2018-06-04 11:53:44 -04:00
Tobias Henkel 07d1cdc2d4
Log connection port in static driver on timeout
When there is a timeout checking a static node we also should log the
connection-port.

Change-Id: I7e6291e18d6fc714d593b538a47b6849c9a99c0f
2018-05-18 07:59:12 +02:00
Monty Taylor da95a817bb
Support winrm hosts in static driver
The static driver currently assumes ssh connectivity. Add a
connection-type parameter and rename the ssh-port to connection-port to
match the diskimages setting name.

Keep the old 'ssh-port' setting for backwards compat.

Change-Id: I1a96f03f9845b0d99d9ce89d2213be4d483afdd9
2018-04-13 11:36:58 -05:00
Tobias Henkel 2da274e2ae
Don't gather host keys for non ssh connections
In case of an image with the connection type winrm we cannot scan the
ssh host keys. So in case the connection type is not ssh we
need to skip gathering the host keys.

Change-Id: I56f308baa10d40461cf4a919bbcdc4467e85a551
2018-04-03 17:31:45 +02:00
Tristan Cacqueray 6ac2f33cb3 Implement a static driver for Nodepool
This change adds a static node driver.

Change-Id: I065f2b42af49a57218f1c04a08b3ddd15ccc0832
Story: 2001044
Task: 4615
2018-01-31 03:55:56 +00:00