Commit Graph

29 Commits

Author SHA1 Message Date
James E. Blair 7262ef7f6f Include job_uuid in NodeRequests
This is part of the circular dependency refactor.  It updates the
NodeRequest object to include the job_uuid in addition to the job_name
(which is temporarily kept for backwards compatability).  When node
requests are completed, we now look up the job by uuid if supplied.

Change-Id: I57d4ab6c241b03f76f80346b5567600e1692947a
2023-12-20 10:44:04 -08:00
James E. Blair 80f2d5ae7d Fix race in test_node_request_canceled
Even though our nodepool is paused, it still sees events and so adds
them to the provisioned_requests list.  This test only passed before
in the cases where it ran very quickly and beat the arrival of the
event.

To make this safer, allow the event to arrive and assert that even
after the request is deleted, we still only saw one request and
it remains in 'requested' state (ie, it was never fulfilled after
we canceled it).

Change-Id: I92996842e1205c7cde880c20d3350c05b2ca1edf
2021-10-08 09:28:50 -07:00
Felix Edel 38776452bb Don't use the AnsibleJob in the nodepool client
This change follows up on a few TODOs left by the lock/unlock nodes on
executor change.

When locking the nodes on the executor we used the AnsibleJob as a
replacement for the old build parameter that was provided to the
nodepool client methods as they were originally called by the scheduler.

However, the AnsibleJob class should only be used internally by the
executor, so we now provide all parameters directly to the nodepool
methods.

This also annotates the logger in the updated nodepool client methods
and fixes an outdated method signature in
test_scheduler.TestSemaphore.test_semaphore_zk_error.

Remove two comments about storing timestamps on the build request in
ZooKeeper as this doesn't make much sense. It sounded like a good idea
in the beginning, but with the current solution, the scheduler doesn't
need to care about the build request anymore after it was submitted
(except for canceling/cleanup purposes) and the result data is
self-contained.

Change-Id: I2d1005f69904c6ace8f79523133f382af0024c52
2021-09-10 10:55:01 -07:00
James E. Blair aee6ef6f7f Report nodepool resource stats gauges in scheduler
We currently report nodepool resource usage whenever we use or return
nodes.  This now happens on the executors, and they don't have a
global view of all nodes used.  The schedulers do, and they already
have a periodic stats reporting method.

Shift the reporting of node resource gauges to the scheduler.  To make
this efficient, use a tree cache for nodes.  Because node records
alone don't have enough information to tie them back to a tenant or
project, use the new user_data field on the Node object to store that
info when we mark a node in use.  Also, store the zuul system id on
the node, so that we can ensure we're only reporting nodes that belong
to us.

Update the node list in the REST API to use the cache as well, and
also filter its results by zuul system id and tenant.

Depends-On: https://review.opendev.org/807362
Change-Id: I9d0987b250b8fb54b3b937c86db327d255e54abd
2021-09-10 10:54:59 -07:00
James E. Blair b41f467340 Remove internal nodepool request cache
The internal zuul.nodepool.Nodepool.requests dictionary is used so
the scheduler can keep track of its requests.  Since we will have
multiple schedulers emitting requests, we can't use that any more.
Remove any remaining uses of it.

The NodeRequest uid was only used to index that dictionary (and
was used to persist a request across resubmission).  Since it isn't
needed any more, it is removed.

Change-Id: I7c82485d95979c6c9a246c3dc3954bae3c65ac13
2021-09-10 10:53:47 -07:00
James E. Blair 514f62ea31 Refactor the checkNodeRequest method
We perform some checks which aren't necessary any more.  This
method is better thought of as a method of getting a nodeset from
a fulfilled node request, so update it accordingly.

Change-Id: I1113820115af68b706b6fe06d6d03cd35ae6b382
2021-09-10 08:46:42 -07:00
Felix Edel 4e2985638c Add node request cache to zk nodepool interface
This adds a TreeCache to the ZK nodepool interface; it's nearly
identical to the one on the nodepool side.

Co-Authored-By: James E. Blair <jim@acmegating.com>
Change-Id: Ie972c397cf235d637619d1e40c5e7ff78431ac0d
2021-09-06 15:26:39 -07:00
James E. Blair e225a28fa5 Make node requests persistent
The original Nodepool protocol specified that node requests should
be ephemeral, that way if the requestor crashed before accepting
the nodes, the request would automatically be cleaned up and the
nodes returned.  This doesn't comport with multiple schedulers, as
we will soon expect schedulers to stop and start routinely while
we want the node requests they spawn to persist and be handled by
other schedulers.

Fortunately, Nodepool doesn't really care if the request is
ephemeral or not.  So we'll drop the "ephemeral" flag.

But in the short term, we will be stopping the scheduler and that
will leave orphan node requests.  And even in the long term, we
may have a complete Zuul system shutdown or even a bug which may
leak node requests, so we still need a way of deleting node requests
which don't belong.  To handle that, we add a cleanup routine which
we run immediately on startup and every hour that looks for node
requests created by this Zuul system but don't correspond to any
queue entries.  We create a new UUID to identify the Zuul system
and store it in ZK (so that if Nodepool has any other users we
don't delete their requests).

We no longer need to resubmit requests on connection loss, so tests
addressing that behavior are removed.

Change-Id: Ie22e99ef71cbe6b31d40c25a21498c1e867ca777
2021-09-03 16:17:15 -07:00
James E. Blair dbe13ce076 Remove nodeset from NodeRequest
To make things simpler for schedulers to handle node provisioned
events for node requests which they may not have in their local
pipeline state, we need to make the pipeline storage of node requests
simpler.  That starts by removing the nodeset object as an attribute
of the NodeRequest object.  This means that the scheduler can work
with a node request object without relying on having the associated
nodeset.  It also simplifies the ZooKeeper code that deserializes
NodeRequests (as it doesn't have to create fake NodeSet objects too).
And finally, it simplifies what must be stored in the pipeline
and queue item structures, which will also come in handy later.

Two tests designed to verify that the request->nodeset magic
deserialization worked have been removed since they are no longer
applicable.

Change-Id: I70ae083765d5cd9a4fd1afc2442bf22d6c52ba0b
2021-09-02 09:29:44 -07:00
James E. Blair d87a9a8b8f Clear nodeset when re-submitting node requests
We encountered an issue where Zuul:

* submitted a node request
* nodepool fulfilled it
* zuul received the ZK watch and refreshed the NodeRequest object
* zuul submitted the node provisioned event to the event queue
* ZK was disconnected/reconnected
* zuul processed the node provisioned event
* zuul found the node request no longer existed (because it's ephemeral)
* zuul resubmitted the node request

Because the NodeRequest object had the provisioned node information
attached to it, the re-submitted request was created with an
existing 'nodes' list.  Nodepool appended to that list and fulfilled
the new request (which requested 1 but received 2 nodes).  This caused
an exception in Zuul's nodepool request watch callback, which caused
Zuul to ignore that and all future updates to the node request.

To address this, we make a new copy of the nodeset without any allocated
node info when re-submitting a request.

This contains an unrelated change to the event id handling from an earlier
revision; it is kept because it will simplify future changes which eliminate
the node request cache altogether.

Change-Id: I72f5ed7ad53e44d77b37870546daf61b8a4e7e09
2021-08-04 12:20:11 -07:00
Felix Edel 8f8f4f1898 Switch to ZooKeeper backed NodesProvisionedEvents
This puts the NodesProvisionedEvents into ZooKeeper. With this, all
result events are now in ZooKeeper.

To make the NodesProvisionedEvent serializable, we cannot store the
whole NodeRequest object in the event anymore. Instead we are using the
request.id, job name and the buildset UUID to look up the corresponding
NodeRequest object from the buildset when the NodesProvisionedEvent is
handled.

As a result of this we have to find a way to return the NodeSet even in
case the NodeRequest or the buildset cannot be found anymore (e.g. due
to a gate reset). In this case we look up the NodeRequest directly from
ZooKeeper and provide a faked NodeSet which allows us to retrieve the
node information from Zookeeper (via the update mechanism of the
NodeRequest in the Nodepool client).

Finally, we can get rid of the local result event queue in the scheduler
as now all result events are in ZooKeeper. Together with that the
`zuul.scheduler.eventqueues.result` gauge was also removed.

Change-Id: Ib5e0f13d25a21ebad908d38f0201e92b704a1c85
2021-07-13 14:57:01 -07:00
Felix Edel 040c5c8032 Move parent provider determination to pipeline manager
Moving the parent provider determination into the pipeline manager
allows us to remove the buildset and job objects from the NodeRequest
constructor. This way we can fully serialize the NodeRequest to
ZooKeeper and restore it without missing important information.

This has also an impact on the NodeRequest's priority property. As this
is only needed to determine the znode path when the NodeRequest is
submitted, we can provide it directly as parameter to the
submitNodeRequest call (and the related update callbacks).

To ensure that NodePool doesn't strip those additional information when
it fulfills the NodeRequest, we use the new "requestor_data" field which
is implemented in [1].

To make this work, we also have to look up the buildset by its UUID from
the active tenants and pipelines when the NodesProvisioned event is
handled in the scheduler. Something similar was already done for
handling the other result events as well.

[1]: https://review.opendev.org/c/zuul/nodepool/+/798746/

Depends-On: https://review.opendev.org/c/zuul/nodepool/+/798746/
Change-Id: Id794643dcf26b0565499d20adba99d3b0518fdf1
2021-07-08 13:27:08 -07:00
Felix Edel fee46c25bc Lock/unlock nodes on executor server
Currently, the nodes are locked in the scheduler/pipeline manager before
the actual build is created in the executor client. When the nodes are
locked, the corresponding NodeRequest is also deleted.

With this change, the executor will lock the nodes directly before
starting the build and unlock them when the build is completed.

To keep the order of events intact, the nodepool.acceptNodes() method is
split up into two:
    1. nodepool.acceptNodeRequest() does most of the old acceptNodes()
       method except for locking the nodes and deleting the node
       request. It is called on the scheduler side when the
       NodesProvisionedEvent is handled (which is also where
       acceptNodes() was previously called).
    2. nodepool.acceptNodes() is now called on the executor side when
       the job is started. It locks the nodes and deletes the node
       request in ZooKeeper.

Finally, it's also necessary to move the autohold processing to the
executor, as this requires a lock on the node. To allow processing of
autoholds, the executor now also determines the build attempts and sets
the RETRY_LIMIT result if necessary.

Change-Id: I7392ce47e84dcfb8079c16e34e0ed2062ebf4136
2021-07-01 05:46:02 +00:00
Felix Edel ba7f81be2d Provide statsd client to Nodepool and make scheduler optional
To lock/unlock the nodes directly in the executor server, we have to
make the Nodepool API work without a scheduler instance.

To keep the stats emitting intact, we provide a statsd client directly
to the Nodepool instance.

This leaves only one place where the scheduler is used in the Nodepool
class, which is the onNodesProvisioned() callback.
This callback won't be necessary anymore when the nodes are locked on
the executor and thus this function call and the scheduler parameter
itself can be removed.

Change-Id: I3f3e4bfff08e244f68a9be7c6a4efcc194a23332
2021-04-30 12:12:28 +02:00
Felix Edel 71a990d42a Move setupZK() helper function to BaseTestCase class
As we are going to move initialization of the ZooKeeper connection to
the server classes rather than the cmd classes, this should help to
provide the necessary parameters to the various components in the test
cases.

This also removes some duplicate code as the setupZK() functionality can
now also  be used by other TestClasses which don't inherit from
ZuulTestCase.

Change-Id: Iee64b7173ceb78967e48dcd6f51bc998ce9469c5
2021-03-08 06:50:03 -08:00
James E. Blair 74a9c9de9b Use ZooKeeper TLS in tests
This mirrors the configuration in Nodepool for using TLS-enabled
ZooKeeper in tests.  We use the ensure-zookeeper role in order
to get a newer ZooKeeper than is supplied in bionic.

Change-Id: I14413fccbc9a6a7a75b6233d667e2a1d2856d894
2021-03-08 06:49:57 -08:00
Felix Edel b4d8a4e74b Simplify ZooKeeper client initialization
The ZooKeeperClient now provides a fromConfig() method that parses all
necessary configuration values to instantiate a ZooKeeperClient.
Previously, this needed to be done in every component to initialize the
connection to ZooKeeper.

Change-Id: I5fa4ddab5f85c658291f1262ee0392a60086846e
2021-02-21 07:41:43 -08:00
Jan Kubovy d518e56208 Prepare Zookeeper for scale-out scheduler
This change is a common root for other
Zookeeper related changed regarding
scale-out-scheduler. Zookeeper becoming
a central component requires to increase
"maxClientCnxns".

Since the ZooKeeper class is expected to grow
significantly (ZooKeeper is becoming a central part
of Zuul) a split of the ZooKeeper class (zk.py) into
zk module is done here to avoid the current god-class.

Also the zookeeper log is copied to the "zuul_output_dir".

Change-Id: I714c06052b5e17269a6964892ad53b48cf65db19
Story: 2007192
2021-02-15 14:44:18 +01:00
David Shrewsbury f6b6991af2 Add caching of autohold requests
Change-Id: I94d4a0d2e8630d360ad7c5d07690b6ed33b22f75
2019-09-16 10:46:36 -04:00
James E. Blair 0b00c4685b
Set relative priority of node requests
Add a relative_priority field to node requests and continuously
adjust it for each queue item based on the contents of queues.

This allows for a more fair distribution of build resources between
different projects.  The first item in a pipeline from a given
project (or, in the case of a dependent pipeline, group of projects)
has equal priority to all other first-items of other projcets in
the same pipeline.  Second items have a lower priority, etc.

Depends-On: https://review.openstack.org/620954
Change-Id: Id3799aeb2cec6d96a662bfa394a538050f7ea947
2018-11-30 12:50:34 +01:00
Paul Belanger ecb0b84f11
Add support for shared ansible_host in inventory
Today it is possible to create the following ansible inventory file:

  [foo]
  foo01 ansible_host=192.168.1.1

  [bar]
  bar01 ansible_host=192.168.1.1

Which allows a user to create multiple host aliases for a single
connection. This could be done with ansible groups, however there is
some functional differences on how ansible runs in that configuration.

We could also request 2 nodes from nodepool, however in this case, it
would be a waste of CI resources because every alias would need a new
node from nodepool.

Now, a user is able to alias multiple host names to a single node from
nodepool by doing the following:

  nodeset:
    nodes:
      - name:
          - foo
          - bar
        label: ubuntu-xenial

This would result in a single node request from nodepool, but create
an inventory file with 2 alaises sharing and single ansible_host
variable.

Change-Id: I674d6baac26852ee1503feb1ed16c279bf773688
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2017-11-18 16:22:01 -05:00
James E. Blair 4f1731ba86 Emit some nodepool stats
Change-Id: I7bc3914e8b8d64afee061c002dcc9cca5dd1ef4d
2017-10-13 15:56:59 -07:00
David Shrewsbury 94e95886e2 Handle double node locking snafu
If our queue processing is slow, and we lose the ZooKeeper session
after a node request has been fulfilled, but before we actually accept
the nodes, we need to be aware of this and not try to use the nodes
given to us.

Also pass the request ID along in the event queue since the actual
request object can have its ID changed out from underneath us on a
resubmit. Compare this ID with the request ID, and also verify that
the actual request still exists.

Furthermore, when we lock nodes, let's make sure that they are actually
allocated to the request we are processing.

Change-Id: Id89f6542afcf3f5d4a0b392b5cb8cf21ec3f6865
2017-10-05 17:26:59 -04:00
Clark Boylan ffe8f8bb81 Cleanup zookeeper and fake nodepool in nodepool tests
The nodepool tests set up zookeeper connections and a fake nodepool.
These test resources spawned threads that were leaking and causing
subsequent tests to run with zookeeper and nodepool threads running.
Clean that up.

Change-Id: Ib799d4a40eb51e7bbe71673a082f345235c41019
2017-04-25 10:19:31 -07:00
Clark Boylan 246201835f Use current class name on super call
In order to get the method resolution correct for setUp in the tests we
need to call super on the current Class not the parent.

Change-Id: I1cf7f45c107a1fad2b70e3b18dc23b8bcc6c4b33
2017-04-25 10:14:41 -07:00
Clark Boylan 621ec9a22b Put test id in zk chroot path
In an effort to help debug leaked zk chroots add the test id to the
chroot path so that we can see which tests are leaking.

Change-Id: I27047ecfa69d59964ac9ad4a03941b9826099420
2017-04-10 12:36:10 +00:00
James E. Blair 8b2a14704f Use hostname in Nodepool requests
Rather than identifying a Nodepool request as originating from
'zuul', identify it with the hostname of the zuul scheduler.  This
way the operator of a site with multiple zuuls can identify whence
requests originated.

Change-Id: I69aaca90d2bc3fac5fe55cdd9c57d724f6e103b4
2017-03-06 13:44:10 -08:00
James E. Blair 0d5a36e3ff Remove zk host list parsing
Just pass through the zookeeper host list string unparsed since
that what we're going to expect from our config file.

Change-Id: Ife8fd97860ad35c793ef956adbb9d626569f60bf
2017-02-21 11:08:47 -05:00
James E. Blair 2a8e0fa5f1 Move tests into test/unit
This makes room for a sibling directory for nodepool functional tests.

Change-Id: Iace94d313edb04192ac23a533ed967f076410980
2017-01-24 10:18:38 -08:00