Commit Graph

167 Commits

Author SHA1 Message Date
Simon Westphahl 3c71fc9f4b Use thread pool executor for AWS API requests
So far we've cached most of the AWS API listings (instances, volumes,
AMIs, snapshots, objects) but with refreshes happening synchronously.

Since some of those methods are used as part of other methods during
request handling we make them asynchronous.

Change-Id: I22403699ebb39f3e4dcce778efaeb09328acd932
2023-10-17 14:36:37 -07:00
Zuul 6dde5c55cb Merge "Add ZK cache stats" 2023-08-14 21:00:10 +00:00
James E. Blair 07c83f555d Add ZK cache stats
To observe the performance of the ZK connection and the new tree
caches, add some statsd metrics for each of these.  This will
let us monitor queue size over time.

Also, update the assertReportedStat method to output all received
stats if the expected stat was not found (like Zuul).

Change-Id: Ia7e1e0980fdc34007f80371ee0a77d4478948518
Depends-On: https://review.opendev.org/886552
2023-08-03 10:27:25 -07:00
James E. Blair 4ef3ebade8 Update references of build "number" to "id"
This follows the previous change and is intended to have little
or no behavior changes (only a few unit tests are updated to use
different placeholder values).  It updates all textual references
of build numbers to build ids to better reflect that they are
UUIDs instead of integers.

Change-Id: I04b5eec732918f5b9b712f8caab2ea4ec90e9a9f
2023-08-02 11:18:15 -07:00
James E. Blair 3815cce7aa Change image ID from int sequence to UUID
When we export and import image data (for backup/restore purposes),
we need to reset the ZK sequence counter for image builds in order
to avoid collisions.  The only way we can do that is to create and
then delete a large number of znodes.  Some sites (including
OpenDev) have sequence numbers that are in the hundreds of thousands.

To avoid this time-consuming operation (which is only intended to
be run to restore from backup -- when operators are already under
additional stress!), this change switches the build IDs from integer
sequences to UUIDs.  This avoids the problem with collisions after
import (at least, to the degree that UUIDs avoid collisions).

The actual change is fairly simple, but many unit tests need to be
updated.

Since the change is user-visible in the command output (image lists,
etc), a release note is added.

A related change which updates all of the textual references of
build "number" to build "id" follows this one for clarity and ease
of review.

Change-Id: Ie7c68b094bc9733914a808756eeee8b62f696713
2023-08-02 11:18:15 -07:00
James E. Blair 066699f88a Use low-level OpenStack SDK calls for server listing
The OpenStack SDK performs a lot of processing on the JSON data
returned by nova, and on large server lists, this can dwarf the
actual time needed to receive and parse the JSON.

Nodepool uses very little of this information, so let's use the
keystoneauth session to get a simple JSON list.

The Server object that SDK normally returns is a hybrid object
that provides both attributes and dictionary keys.  One method
that we call has some lingering references to accessors, so we
create a UserDict subclass to handle those. Nodepool-internal
references are updated from attributes to dictionary keys.

Change-Id: Iecc5976858e8d2ee6894a521f6a30f10ae9c6177
2023-07-25 11:29:25 -07:00
James E. Blair 9d07d26f51 Move statemachine node init into TPE
This moves the node initialization and lock from the assignHandlers
thread to a new threadpool executor.  There are several ZK calls
that happen in sequence as part of this, and if we move them out
of the assignHandlers thread we can increase overall throughput.

Change-Id: I67a32eed4102ab6ff56b1c21a65fe7dd071448e5
2023-05-16 20:42:49 -07:00
James E. Blair b0a40f0b47 Use image cache when launching nodes
We consult ZooKeeper to determine the most recent image upload
when we decide whether we should accept or decline a request.  If
we accept the request, we also consult it again for the same
information when we start building the node.  In both cases, we
can use the cache to avoid what may potentially be (especially in
the case of a large number of images or uploads) quite a lot of
ZK requests.  Our cache should be almost up to date (typically
milliseconds, or at the worst, seconds behind), and the worst
case is equivalent to what would happen if an image build took
just a few seconds longer.  The tradeoff is worth it.

Similarly, when we create min-ready requests, we can also consult
the cache.

With those 3 changes, all references to getMostRecentImageUpload
in Nodepool use the cache.

The original un-cached method is kept as well, because there are
an enormous number of references to it in the unit tests and they
don't have caching enabled.

In order to reduce the chances of races in many tests, the startup
sequence is normalized to:
1) start the builder
2) wait for an image to be available
3) start the launcher
4) check that the image cache in the launcher matches what
   is actually in ZK

This sequence (apart from #4) was already used by a minority of
tests (mostly newer tests).  Older tests have been updated.
A helper method, startPool, implements #4 and additionally includes
the wait_for_config method which was used by a random assortment
of tests.

Change-Id: Iac1ff8adfbdb8eb9a286929a59cf07cd0b4ac7ad
2023-04-10 15:57:01 -07:00
James E. Blair be3edd3e17 Convert openstack driver to statemachine
This updates the OpenStack driver to use the statemachine framework.

The goal is to revise all remaining drivers to use the statemachine
framework for two reasons:

1) We can dramatically reduce the number of threads in Nodepool which
is our biggest scaling bottleneck.  The OpenStack driver already
includes some work in that direction, but in a way that is unique
to it and not easily shared by other drivers.  The statemachine
framework is an extension of that idea implemented so that every driver
can use it.  This change further reduces the number of threads needed
even for the openstack driver.

2) By unifying all the drivers with a simple interface, we can prepare
to move them into Zuul.

There are a few updates to the statemachine framework to accomodate some
features that only the OpenStack driver used to date.

A number of tests need slight alteration since the openstack driver is
the basis of the "fake" driver used for tests.

Change-Id: Ie59a4e9f09990622b192ad840d9c948db717cce2
2023-01-10 10:30:14 -08:00
Zuul 9dd883107a Merge "Add hold command to disable nodes" 2022-11-30 20:05:41 +00:00
mbecker 1658aa9851 Add hold command to disable nodes
This allows nodes to be set in an idle state
so that they will not have jobs scheduled
while e.g. maintenance tasks are performed.
This is probably most useful for static nodes.

Change-Id: Iebc6b909f370fca11fab2be0b8805d4daef33afe
2022-10-13 12:43:34 +02:00
James E. Blair 08fdeed241 Add "slots" to static node driver
Add persistent slot numbers for static nodes.

This facilitates avoiding workspace collisions on nodes with
max-parallel-jobs > 1.

Change-Id: I30bbfc79a60b9e15f1255ad001a879521a181294
2022-10-11 07:02:53 -07:00
James E. Blair 6320b06950 Add support for dynamic tags
This allows users to create tags (or properties in the case of OpenStack)
on instances using string interpolation values.  The use case is to be
able to add information about the tenant* which requested the instance
to cloud-provider tags.

* Note that ultimately Nodepool may not end up using a given node for
the request which originally prompted its creation, so care should be
taken when using information like this.  The documentation notes that.

This feature uses a new configuration attribute on the provider-label
rather than the existing "tags" or "instance-properties" because existing
values may not be safe for use as Python format strings (e.g., an
existing value might be a JSON blob).  This could be solved with YAML
tags (like !unsafe) but the most sensible default for that would be to
assume format strings and use a YAML tag to disable formatting, which
doesn't help with our backwards-compatibility problem.  Additionally,
Nodepool configuration does not use YAML anchors (yet), so this would
be a significant change that might affect people's use of external tools
on the config file.

Testing this was beyond the ability of the AWS test framework as written,
so some redesign for how we handle patching boto-related methods is
included.  The new approach is simpler, more readable, and flexible
in that it can better accomodate future changes.

Change-Id: I5f1befa6e2f2625431523d8d94685f79426b6ae5
2022-08-23 11:06:55 -07:00
James E. Blair 6a56940275 Fix race with two builders deleting images
In a situation with multiple builders, each configured with different
providers, it is possible for one builder to delete the ZK ImageBuild
record for a build from another builder between the time that the build
is completed but before the first upload starts.

This is because every builder looks for images to delete from ZK.  It
keeps the 2 most recent ready images (this should normally cover the
time period between a build and upload), unless the image is not
configured for any provider this builder knows about.  This is where
the disjoint providers come into play -- builder1 in our scenario
is not expected to have a configuration for provider2.

To correct this, we adjust this check so that the only time we
bypass the 2-most-recent-ready-images check is if the diskimage is
not configured at all.

That means that we still expect all builders to have a "diskimage"
entry for every image, but we don't need those to be configured
for any providers which this builder is not expected to handle.

Change-Id: Ic2fefda293fa0bcbc98ee7313198b37df0576299
2022-07-25 13:06:25 -07:00
James E. Blair 7bbdfdc9fd Update ZooKeeper class connection methods
This updates the ZooKeeper class to inherit from ZooKeeperBase
and utilize its connection methods.

It also moves the connection loss detection used by the builder
to be more localized and removes unused methods.

Change-Id: I6c9dbe17976560bc024f74cd31bdb6305d51168d
2022-06-29 07:46:34 -07:00
James E. Blair cacef76d3a Avoid collisions after ZK image data import
When image data are imported, if there are holes in the sequence
numbers, ZooKeeper may register a collision after nodepool-builder
builds or uploads a new image.  This is because ZooKeeper stores
a sequence node counter in the parent node, and we lose that
information when exporting/importing.  Newly built images can end
up with the same sequence numbers as imported images.  To avoid this,
re-create missing sequence nodes so that the import state more
closely matches the export state.

Change-Id: I0b96ebecc53dcf47324b8a009af749a3c04e574c
2022-06-20 13:00:05 -07:00
Zuul 492f6d5216 Merge "Add the component registry from Zuul" 2022-05-24 01:02:26 +00:00
Zuul a4acb5644e Merge "Use Zuul-style ZooKeeper connections" 2022-05-23 22:56:54 +00:00
James E. Blair a612aa603c Add the component registry from Zuul
This uses a cache and lets us update metadata about components
and act on changes quickly (as compared to the current launcher
registry which doesn't have provision for live updates).

This removes the launcher registry, so operators should take care
to update all launchers within a short period of time since the
functionality to yield to a specific provider depends on it.

Change-Id: I6409db0edf022d711f4e825e2b3eb487e7a79922
2022-05-23 07:41:27 -07:00
James E. Blair 10df93540f Use Zuul-style ZooKeeper connections
We have made many improvements to connection handling in Zuul.
Bring those back to Nodepool by copying over the zuul/zk directory
which has our base ZK connection classes.

This will enable us to bring other Zuul classes over, such as the
component registry.

The existing connection-related code is removed and the remaining
model-style code is moved to nodepool.zk.zookeeper.  Almost every
file imported the model as nodepool.zk, so import adjustments are
made to compensate while keeping the code more or less as-is.

Change-Id: I9f793d7bbad573cb881dfcfdf11e3013e0f8e4a3
2022-05-23 07:40:20 -07:00
Joshua Watt 2c632af426 Do not reset quota cache timestamp when invalid
The quota cache may not be a valid dictionary when
invalidateQuotaCache() is called (e.g. when 'ignore-provider-quota' is
used in OpenStack). In that case, don't attempt to treat the None as a
dictionary as this raises a TypeError exception.

This bug was preventing Quota errors from OpenStack from causing
nodepool to retry the node request when ignore-provider-quota is True,
because the OpenStack handler calles invalidateQuotaCache() before
raising the QuotaException. Since invalidateQuotaCache() was raising
TypeError, it prevented the QuotaException from being raised and the
node allocation was outright failed.

A test has been added to verify that nodepool and OpenStack will now
retry node allocations as intended.

This fixes that bug, but does change the behavior of OpenStack when
ignore-provider-quota is True and it returns a Quota error.

Change-Id: I1916c56c4f07c6a5d53ce82f4c1bb32bddbd7d63
Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
2022-05-10 15:04:25 -05:00
James E. Blair 46e130fe1a Add more debug info to AWS driver
These changes are all in service of being able to better understand
AWS driver log messages:

* Use annotated loggers in the statemachine provider framework
  so that we see the request, node, and provider information
* Have the statemachine framework pass annotated loggers to the
  state machines themselves so that the above information is available
  for log messages on individual API calls
* Add optional performance information to the rate limit handler
  (delay and API call duration)
* Add some additional log entries to the AWS adapter

Also:

* Suppress boto logging by default in unit tests (it is verbose and
  usually not helpful)
* Add coverage of node deletion in the AWS driver tests

Change-Id: I0e6b4ad72d1af7f776da73c5dd2a50b40f60e4a2
2022-04-11 10:14:20 -07:00
James E. Blair 0b1fa1d57d Add commands to export/import image data from ZK
Change-Id: Id1ac6403f4fe80059b90900c519e56bca7dee0a0
2021-08-24 10:28:39 -07:00
James E. Blair 91804a5e16 Azure: switch to Azul
The Azure SDK for Python uses threads to manage async operations.
Every time a virtual machine is created, a new thread is spawned
to wait for it to finish (whether we actually end up polling it or
not).  This will cause the Azure driver to have significant
scalability limits compared to other drivers, possibly limiting
the number of simultaneous nodes to 50% compared to others.

To address this, switch to using a very simple requests-based
REST client I'm calling Azul.  The consistency of the Azure API
makes this simple.  As a bonus, we can use the excellent Azure
REST API documentation directly, rather that mapping attribute
names through the Python SDK (which has subtle differences).

A new fake Azure test fixture is also created in order to make
the current unit test a more thorough exercise of the code.

Finally, the "zuul-private-key" attribute is misnamed since we
have a policy of a one-way dependency from Zuul -> Nodepool.  It's
name is updated to match the GCE driver ("key") and moved to the
cloud-image section so that different images may be given different
keys.

Change-Id: I87bfa65733b2a71b294ebe2cf0d3404d0e4333c5
2021-03-08 14:58:31 -08:00
Zuul 74d299ec01 Merge "Offload waiting for server creation/deletion" 2021-03-06 06:09:49 +00:00
James E. Blair 4c5fa46540 Require TLS
Require TLS Zookeeper connections before making the 4.0 release.

Change-Id: I69acdcec0deddfdd191f094f13627ec1618142af
Depends-On: https://review.opendev.org/776696
2021-02-19 18:42:33 +00:00
Tobias Henkel 2e59f7b0b3
Offload waiting for server creation/deletion
Currently nodepool has one thread per server creation or
deletion. Each of those waits for the cloud by regularly getting the
server list and checking if their instance is active or gone. On a
busy nodepool this leads to severe thread contention when the server
list gets large and/or there are many parallel creations/deletions in
progress.

This can be improved by offloading the waiting to a single thread that
regularly retrieves the server list and compares that to the list of
waiting server creates/deletes. The calling threads are then waiting
until the central thread wakes them up to proceed their task. The
waiting threads are waiting for the event outside of the GIL and thus
are not contributing to the thread contention problem anymore.

An alternative approach would be to redesign the threading to be less
threaded but this would be a much more complex redesign. Thus this
change keeps the many threads approach but makes them wait much more
lightweight which shows a substantial improvement during load testing
in a test environment.

Change-Id: I5525f2558a4f08a455f72e6b5479f27684471dc7
2021-02-16 15:37:57 +01:00
Clark Boylan 6276562939 Use iterate_timeout in test waits
This ensures that we don't wait forever for tests to complete tasks.
This is particularly useful if you've disabled the global test timeout.

Change-Id: I0141e62826c3594ed20605cac25e39091d1514e2
2020-01-14 08:25:09 -08:00
Zuul 0a010d94a1 Merge "Fix builder shutdown race in tests" 2019-10-15 15:24:27 +00:00
Ian Wienand ddbcf1b07d Validate openstack provider pool labels have top-level labels
We broke nodepool configuration with
I3795fee1530045363e3f629f0793cbe6e95c23ca by not having the labels
defined in the OpenStack provider in the top-level label list.

The added check here would have found such a case.

The validate() function is reworked slightly; previously it would
return various exceptions from the tools it was calling (YAML,
voluptuous, etc.).  Now we have more testing (and I'd imagine we could
do even more, similar vaildations too) we'd have to keep adding
exception types.  Just make the function return a value; this also
makes sure the regular exit paths are taken from the caller in
nodepoolcmd.py, rather than dying with an exception at whatever point.

A unit test is added.

Co-Authored-By: Mohammed Naser <mnaser@vexxhost.com>
Change-Id: I5455f5d7eb07abea34c11a3026d630dee62f2185
2019-10-15 15:32:32 +11:00
David Shrewsbury e732fec5bf Fix builder shutdown race in tests
The builder intentionally does not attempt to shutdown the uploader
threads since that could take an unreasonable amount of time. This
causes a race in our tests where we can shutdown the ZooKeeper
connection while the upload thread is still in progress, which can
cause the test to fail with a ZooKeeper error. This adds uploader
thread cleanup for the builder used in tests.

Change-Id: I25d4b52e17501e5dc6543adef585dd3b86bd70f9
2019-10-10 15:30:35 -04:00
David Shrewsbury 5c605b3240 Reduce upload threads in tests from 4 to 1
Only a single test actually depends on having more than a single
upload thread active, so this is just wasteful. Reduce the default
to 1 and add an option to useBuilder() that tests may use to alter
the value.

Change-Id: I07ec96000a81153b51b79bfb0daee1586491bcc5
2019-09-18 15:39:12 -04:00
Ian Wienand 9367cf8ed8 Add a dib-cmd option for diskimages
This change allows you to specify a dib-cmd parameter for disk images,
which overrides the default call to "disk-image-create".  This allows
you to essentially decide the disk-image-create binary to be called
for each disk image configured.

It is inspired by a couple of things:

The "--fake" argument to nodepool-builder has always been a bit of a
wart; a case of testing-only functionality leaking across into the
production code.  It would be clearer if the tests used exposed
methods to configure themselves to use the fake builder.

Because disk-image-create is called from the $PATH, it makes it more
difficult to use nodepool from a virtualenv.  You can not just run
"nodepool-builder"; you have to ". activate" the virtualenv before
running the daemon so that the path is set to find the virtualenv
disk-image-create.

In addressing activation issues by automatically choosing the
in-virtualenv binary in Ie0e24fa67b948a294aa46f8164b077c8670b4025, it
was pointed out that others are already using wrappers in various ways
where preferring the co-installed virtualenv version would break.

With this, such users can ensure they call the "disk-image-create"
binary they want.  We can then make a change to prefer the
co-installed version without fear of breaking.

In theory, there's no reason why a totally separate
"/custom/venv/bin/disk-image-create" would not be valid if you
required a customised dib for some reason for just one image.  This is
not currently possible, even modulo PATH hacks, etc., all images will
use the same binary to build.  It is for this flexibility I think this
is best at the diskimage level, rather than as, say a global setting
for the whole builder instance.

Thus add a dib-cmd option for diskimages.  In the testing case, this
points to the fake-image-create script, and the --fake command-line
option and related bits are removed.

It should have no backwards compatibility effects; documentation and a
release note is added.

Change-Id: I6677e11823df72f8c69973c83039a987b67eb2af
2019-08-22 10:09:00 +10:00
Tobias Henkel 4131d7da59
Cleanup kube_config temp files between test runs
In my local tests running tox the tmp files of each test case get
deleted after the run. However kube_config maintains a static list of
temporary files it knows about and tries to re-use them in subsequent
test runs which causes the test to fail [1]. Fix this by telling
kube_config to cleanup its temporary files in the cleanup phase.

[1] Trace
Traceback (most recent call last):
  File "/home/tobias/src/nodepool/nodepool/tests/unit/test_builder.py", line 239, in test_image_rotation_invalid_external_name
    build001, image001 = self._test_image_rebuild_age(expire=172800)
  File "/home/tobias/src/nodepool/nodepool/tests/unit/test_builder.py", line 186, in _test_image_rebuild_age
    self.useBuilder(configfile)
  File "/home/tobias/src/nodepool/nodepool/tests/__init__.py", line 539, in useBuilder
    BuilderFixture(configfile, cleanup_interval, securefile)
  File "/home/tobias/src/nodepool/.tox/py37/lib/python3.7/site-packages/testtools/testcase.py", line 756, in useFixture
    reraise(*exc_info)
  File "/home/tobias/src/nodepool/.tox/py37/lib/python3.7/site-packages/testtools/_compat3x.py", line 16, in reraise
    raise exc_obj.with_traceback(exc_tb)
  File "/home/tobias/src/nodepool/.tox/py37/lib/python3.7/site-packages/testtools/testcase.py", line 731, in useFixture
    fixture.setUp()
  File "/home/tobias/src/nodepool/nodepool/tests/__init__.py", line 318, in setUp
    self.builder.start()
  File "/home/tobias/src/nodepool/nodepool/builder.py", line 1304, in start
    self._config = self._getAndValidateConfig()
  File "/home/tobias/src/nodepool/nodepool/builder.py", line 1279, in _getAndValidateConfig
    config = nodepool_config.loadConfig(self._config_path)
  File "/home/tobias/src/nodepool/nodepool/config.py", line 246, in loadConfig
    driver.reset()
  File "/home/tobias/src/nodepool/nodepool/driver/openshift/__init__.py", line 29, in reset
    config.load_kube_config(persist_config=True)
  File "/home/tobias/src/nodepool/.tox/py37/lib/python3.7/site-packages/kubernetes/config/kube_config.py", line 540, in load_kube_config
    loader.load_and_set(config)
  File "/home/tobias/src/nodepool/.tox/py37/lib/python3.7/site-packages/kubernetes/config/kube_config.py", line 422, in load_and_set
    self._load_cluster_info()
  File "/home/tobias/src/nodepool/.tox/py37/lib/python3.7/site-packages/kubernetes/config/kube_config.py", line 385, in _load_cluster_info
    file_base_path=self._config_base_path).as_file()
  File "/home/tobias/src/nodepool/.tox/py37/lib/python3.7/site-packages/kubernetes/config/kube_config.py", line 112, in as_file
    raise ConfigException("File does not exists: %s" % self._file)
kubernetes.config.config_exception.ConfigException: File does not exists: /tmp/tmplafutg0j/tmpmiti10bn
Ran 2 tests in 4.524s (+0.175s)
FAILED (id=20, failures=1)

Change-Id: Idce8ca9bed49162874af24b224e573121e250385
2019-05-04 11:08:40 +02:00
David Shrewsbury fa2d4bd17c Fix for image build leaks
If, during a long DIB image build, we lose the ZooKeeper session,
it's likely that the CleanupWorker thread could have run and removed
the ZK record for the build (its state would be BUILDING and unlocked,
indicating something went wrong). In that scenario, when the DIB
process finishes (possibly writing out DIB files), it will never get
cleaned up since the ZK record would now be gone. If we fail to update
the ZK record at the end of the build, just delete the leaked DIB files
immediately after the build.

Change-Id: I5cb58318efe51b5b0c3413b7a01f02a50215a8b6
2019-04-01 15:44:31 -04:00
Zuul 280cd5937d Merge "Revert "Revert "Add a timeout for the image build""" 2019-02-06 13:16:06 +00:00
David Shrewsbury 890ea4975e Revert "Revert "Add a timeout for the image build""
This reverts commit ccf40a462a.

The previous version would not work properly when daemonized
because there was no stdout. This version maintains stdout and
uses select/poll with non-blocking stdout to capture the output
to a log file.

Depends-On: https://review.openstack.org/634266

Change-Id: I7f0617b91e071294fe6051d14475ead1d7df56b7
2019-01-31 11:36:47 -05:00
Tristan Cacqueray aa16b8b891 Amazon EC2 driver
This change adds an experimental AWS driver. It lacks some of the deeper
features of the openstack driver, such as quota management and image
building, but is highly functional for running tests on a static AMI.

Note that the test base had to be refactored to allow fixtures to be
customized in a more flexible way.

Change-Id: I313f9da435dfeb35591e37ad0bec921c8b5bc2b5
Co-Authored-By: Tristan Cacqueray <tdecacqu@redhat.com>
Co-Authored-By: David Moreau-Simard <dmsimard@redhat.com>
Co-AUthored-By: Clint Byrum <clint@fewbar.com>
2019-01-28 12:08:36 -08:00
Zuul f2c155821c Merge "Revert "Add a timeout for the image build"" 2019-01-25 22:37:34 +00:00
David Shrewsbury ccf40a462a Revert "Add a timeout for the image build"
This reverts commit 7225354ec0.

The disk-image-create command does not appear to be starting.

Change-Id: I81abe25a253a385cae08a57561129a678546f18f
2019-01-25 17:36:31 +00:00
Zuul 26c57ee5a9 Merge "Add a timeout for the image build" 2019-01-24 16:15:32 +00:00
David Shrewsbury 7225354ec0 Add a timeout for the image build
A builder thread can wedge if the build process wedges. Add a timeout
to the subprocess. Since it was the call to readline() that would block,
we change the process to have DIB write directly to the log. This allows
us to set a timeout in the Popen.wait() call. And we kill the dib
subprocess, as well.

The timeout value can be controlled in the diskimage configuration and
defaults to 8 hours.

Change-Id: I188e8a74dc39b55a4b50ade5c1a96832fea76a7d
2019-01-23 16:27:19 -05:00
Tristan Cacqueray c1378c4407 Implement an OpenShift resource provider
This change implements an OpenShift resource provider. The driver currently
supports project request and pod request to enable both containers as machine
and native containers workflow.

Depends-On: https://review.openstack.org/608610
Change-Id: Id3770f2b22b80c2e3666b9ae5e1b2fc8092ed67c
2019-01-10 05:05:46 +00:00
Tobias Henkel 64487baef0
Asynchronously update node statistics
We currently updarte the node statistics on every node launch or
delete. This cannot use caching at the moment because when the
statistics are updated we might end up pushing slightly outdated
data. If then there is no further update for a longer time we end up
with broken gauges. We already get update events from the node cache
so we can use that to centrally trigger node statistics updates.

This is combined with leader election so there is only a single
launcher that keeps the statistics up to date. This will ensure that
the statistics are not cluttered because of several launchers pushing
their own slightly different view into the stats.

As a side effect this reduces the runtime of a test that creates 200
nodes from 100s to 70s on my local machine.

Change-Id: I77c6edc1db45b5b45be1812cf19eea66fdfab014
2018-11-29 16:48:30 +01:00
Tobias Henkel 9d77f05d8e
Only setup zNode caches in launcher
We currently only need to setup the zNode caches in the
launcher. Within the commandline client and the builders this is just
unneccessary work.

Change-Id: I03aa2a11b75cab3932e4b45c5e964811a7e0b3d4
2018-11-26 20:13:39 +01:00
Ian Wienand cd9aa75640 Use pipelines for stats keys
Pipelines buffer stats and then send them out in more reasonable sized
chunks, helping to avoid small UDP packets going missing in a flood of
stats.  Use this in stats.py.

This needs a slight change to the assertedStats handler to extract the
combined stats.  This function is ported from Zuul where we updated to
handle pipeline stats (Id4f6f5a6cd66581a81299ed5c67a5c49c95c9b52) so
it is not really new code.

Change-Id: I3f68450c7164d1cf0f1f57f9a31e5dca2f72bc43
2018-07-25 16:46:13 +10:00
Clark Boylan f385a5821f Fix test patching of clouds.yaml file locations
OpenStack Client Config has been pulled into openstacksdk. As part of
this work OSCC internals were dropped and aliased into the sdk lib. This
move broke patching of the clouds.yaml file location for nodepool tests.

We quickly work around this by using the new location for the value to
be overridden in openstacksdk.

Change-Id: I55ad4333ffddec8eeb023e345156e96773504400
2018-05-03 12:50:33 -07:00
James E. Blair baa831192f Store build logs automatically
This updates the builder to store individual build logs in dedicated
files, one per build, named for the image and build id.  Old logs are
automatically pruned.  By default, they are stored in
/var/log/nodepool/builds, but this can be changed.

This removes the need to specially configure logging handler for the
image build logs.

Change-Id: Ia7415d2fbbb320f8eddc4e46c3a055414df5f997
2018-02-09 07:50:20 -08:00
Zuul a5173f8f46 Merge "Do pep8 housekeeping according to zuul rules" into feature/zuulv3 2018-01-17 17:07:28 +00:00
Tobias Henkel 7d79770840 Do pep8 housekeeping according to zuul rules
The pep8 rules used in nodepool are somewhat broken. In preparation to
use the pep8 ruleset from zuul we need to fix the findings upfront.

Change-Id: I9fb2a80db7671c590cdb8effbd1a1102aaa3aff8
2018-01-17 02:17:45 +00:00