Commit Graph

124 Commits

Author SHA1 Message Date
Clark Boylan 2a231a08c9 Add idle state to driver providers
This change adds an idle state to driver providers which is used to
indicate that the provider should stop performing actions that are not
safe to perform while we bootstrap a second newer version of the
provider to handle a config update.

This is particularly interesting for the static driver because it is
managing all of its state internally to nodepool and not relying on
external cloud systems to track resources. This means it is important
for the static provider to not have an old provider object update
zookeeper at the same time as a new provider object. This was previously
possible and created situtations where the resources in zookeeper did
not reflect our local config.

Since all other drivers rely on external state the primary update here
is to the static driver. We simply stop performing config
synchronization if the idle flag is set on a static provider. This will
allow the new provider to take over reflecting the new config
consistently.

Note, we don't take other approaches and essentially create a system
specific to the static driver because we're trying to avoid modifying
the nodepool runtime significantly to fix a problem that is specific to
the static driver.

Change-Id: I93519d0c6f4ddf8a417d837f6ae12a30a55870bb
2022-10-24 15:30:31 -07:00
James E. Blair b8035de65f Improve handling of errors in provider manager startup
If a provider (or its configuration) is sufficiently broken that
the provider manager is unable to start, then the launcher will
go into a loop where it attempts to restart all providers in the
system until it succeeds.  During this time, no pool managers are
running which mean all requests are ignored by this launcher.

Nodepool continuously reloads its configuration file, and in case
of an error, the expected behavior is to continue running and allow
the user to correct the configuration and retry after a short delay.

We also expect providers on a launcher to be independent of each
other so that if ones fails, the others continue working.

However since we neither exit, nor process node requests if a
provider manager fails to start, an error with one provider can
cause all providers to stop handling requests with very little
feedback to the operator.

To address this, if a provider manager fails to start, the launcher
will now behave as if the provider were absent from the config file.
It will still emit the error to the log, and it will continuously
attempt to start the provider so that if the error condition abates,
the provider will start.

If there are no providers on-line for a label, then as long as any
provider in the system is running, node requests will be handled
and declined and possibly failed while the broken provider is offilne.

If the system contains only a single provider and it is broken, then
no requests will be handled (failed), which is the current behavior,
and still likely to be the most desirable in that case.

Change-Id: If652e8911993946cee67c4dba5e6f88e55ac7099
2022-01-14 19:07:32 -08:00
Fabien Boucher f57ac1881a
Remove uneeded shebang and exec bit on some files
Having python files with exec bit and shebang defined in
/usr/lib/python-*/site-package/ is not fine in a RPM package.

Instead of carrying a patch in nodepool RPM packaging better
to fix this directly upstream.

Change-Id: I5a01e21243f175d28c67376941149e357cdacd26
2019-12-13 19:30:03 +01:00
Monty Taylor 7618b714e2 Remove unused use_taskmanager flag
Now that there is no more TaskManager class, nor anything using
one, the use_taskmanager flag is vestigal. Clean it up so that we
don't have to pass it around to things anymore.

Change-Id: I7c1f766f948ad965ee5f07321743fbaebb54288a
2019-04-02 12:11:07 +00:00
Tristan Cacqueray c7f2538457 builder: do not configure provider that doesn't manage images
This change prevent the builder service from starting provider that doesn't
manage images.

Change-Id: Id179e2d3bedb9c9914b13241c77bddad3ec7ca57
2018-07-15 23:10:05 +00:00
David Shrewsbury a418aabb7a Pass zk connection to ProviderManager.start()
In order to support static node pre-registration, we need to give
the provider manager the opportunity to register/deregister any
nodes in its configuration file when it starts (on startup or when
the config change). It will need a ZooKeeper connection to do this.
The OpenStack driver will ignore this parameter.

Change-Id: Idd00286b2577921b3fe5b55e8f13a27f2fbde5d6
2018-06-12 12:04:16 -04:00
James E. Blair e20858755f Have Drivers create Providers
Use the new Driver class to create instances of Providers

Change-Id: Idfbde8d773a971133b49fbc318385893be293fac
2018-06-06 14:57:40 -04:00
Tristan Cacqueray d0a67878a3 Add a plugin interface for drivers
This change adds a plugin interface so that driver can be loaded dynamically.
Instead of importing each driver in the launcher, provider_manager and config,
the Drivers class discovers and loads driver from the driver directory.

This change also adds a reset() method to the driver Config interface to
reset the os_client_config reference when reloading the OpenStack driver.

Change-Id: Ia347aa2501de0e05b2a7dd014c4daf1b0a4e0fb5
2018-01-19 00:45:56 +00:00
Tristan Cacqueray b01227c9d4 Move the fakeprovider module to the fake driver
This change is a follow-up to the drivers spec and it makes the fake provider
a real driver. The fakeprovider module is merged into the fake provider and
the get_one_cloud config loader is simplified.

Change-Id: I3f8ae12ea888e7c2a13f246ea5f85d4a809e8c8d
2017-07-28 11:35:07 +00:00
Tristan Cacqueray c0e6d5112b Extend Nodepool configuration syntax to support multiple drivers
Change-Id: I220e8e71c1205174a0a7515899c9bb6c4cc6adcb
Story: 2001044
Task: 4616
2017-07-25 14:27:17 +00:00
Tristan Cacqueray 4d201328f5 Collect request handling implementation in an OpenStack driver
This change moves OpenStack related code to a driver. To avoid circular
import, this change also moves the StatsReporter to the stats module so that
the handlers doesn't have to import the launcher.

Change-Id: I319ce8780aa7e81b079c3f31d546b89eca6cf5f4
Story: 2001044
Task: 4614
2017-07-25 14:27:17 +00:00
Tristan Cacqueray 27b600ee2c Abstract Nodepool provider management code
This change adds a generic Provider meta class to the common
driver module to support multiple implementation. It also renames
some method to better match other drivers use-cases, e.g.:
* listServers into listNodes
* cleanupServer into cleanupNode

Change-Id: I6fab952db372312f12e57c6212f6ebde59a1a6b3
Story: 2001044
Task: 4612
2017-07-25 14:27:13 +00:00
Jenkins 279809ed1d Merge "Create group for label type" into feature/zuulv3 2017-06-13 17:34:36 +00:00
Ricardo Carrillo Cruz 7c3263c7df Create group for label type
Currently, we get OOTB groups per provider and per image.
It would be nice to have also groups per label type, for running
plays against a particular label.

Change-Id: Ib4173fc0c15184444a91dc402bb306d34f295106
2017-06-13 18:54:48 +02:00
Monty Taylor 8c59361032
Support booting cloud-images by name or id
The docs say we support this, but the code doesn't.

Also, self._cloud_image.name == self._label._cloud_image and is
essentially a foreign key. That's hard to read at the call site, so just
use self._cloud_image.

We have a cloud id if it's a disk image- so wrap that in a dict. Pass
the other one through unmodified so that we'll search for it.

We also don't have any codepaths using image_name, nor a reason to
distinguish.

Change-Id: I4aa9bd8e7c578ae63d05df453b9886c710a092c0
2017-06-10 10:16:51 -05:00
Paul Belanger 1d0990a1c1
Add boot-from-volume support for nodes
For example, a cloud may get better preformance from a cinder volume
then the local compute drive. As a result, give nodepool to option to
choose if the server should boot from volume or not.

Change-Id: I3faefe99096fef1fe28816ac0a4b28c05ff7f0ec
Depends-On: If58cd96b0b9ce4569120d60fbceb2c23b2f7641d
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2017-05-30 14:23:24 -04:00
Paul Belanger e4e98123d3
Fetch server console log if ssh connection fails
Currently, if the ssh connection fails, we are blind to what the
possible failures are.  As a result, attempt to fetch the server
console log to help debug the failure.

This is the continuation of I39ec1fe591d6602a3d494ac79ffa6d2203b5676b
but for the feature/zuulv3 branch. This was done to avoid merge
conflicts on the recent changes to nodepool.yaml layout.

Change-Id: I75ccb6d01956fb6052473f44cce8f097a56dd16a
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2017-05-23 12:53:44 -04:00
Paul Belanger 71ff1a9bc5
Sort flavors with operator.itemgetter('ram')
The current syntax is not python3 compatible, so we look to shade to
help accomplish our sorting syntax.

Change-Id: Iadb39f976840fd2af6e0bd7b08bd3b01169e37a1
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2017-05-17 15:19:52 -04:00
Paul Belanger d892837cad
Fix imports for python3
The syntax for imports has changed for python3, lets use the new
syntax.

Change-Id: Ia985424bf23b44e492f51182179d2e476cdcccbb
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2017-05-17 15:19:48 -04:00
Monty Taylor 642f14c076 Add ability to select flavor by name or id
It's possible that it's easier for a nodepool user to just specify a
name or id of a flavor in their config instead of the combo of min-ram
and name-filter.

In order to not have two name related items, and also to not have the
pure flavor-name case use a term called "name-filter" - change
name-filter to flavor-name, and introduce the semantics that if
flavor-name is given by itself, it will look for an exact match on
flavor name or id, and if it's given with min-ram it will behave as
name-filter did already.

Change-Id: I8b98314958d03818ceca5abf4e3b537c8998f248
2017-04-27 13:44:25 -07:00
David Shrewsbury 92f375c70b Remove support for nodepool_id
This was a temporary measure to keep production nodepool from
deleting nodes created by v3 nodepool. We don't need to carry
it over.

This is an alternative to: https://review.openstack.org/449375

Change-Id: Ib24395e30a118c0ea57f8958a8dca4407fe1b55b
2017-03-30 12:08:04 -04:00
Jenkins 73f3b56376 Merge "Merge branch 'master' into feature/zuulv3" into feature/zuulv3 2017-03-30 16:03:36 +00:00
Joshua Hesketh 94f33cb666 Merge branch 'master' into feature/zuulv3
The nodepool_id feature may need to be removed. I've kept it to simplify
merging both now and if we do it again later.

A couple of the tests are disabled and need reworking in a subsquent
commit.

Change-Id: I948f9f69ad911778fabb1c498aebd23acce8c89c
2017-03-30 21:46:15 +11:00
Monty Taylor 19e8f2788c
Fetch list of AZs from nova if it's not configured
Nova has an API call that can fetch the list of available AZs. Use it to
provide a default list so that we can provide sane choices to the
scheduler related to multi-node requests rather than just letting nova
pick on a per-request basis.

Change-Id: I1418ab8a513280318bc1fe6e59301fda5cf7b890
2017-03-29 13:09:50 -05:00
James E. Blair 440c427662 Remove deprecated networks syntax
And simplify.

Change-Id: I8be53c228de9be5dc3cb39ff9d90cda6bbde9124
2017-03-27 11:35:12 -07:00
James E. Blair dcc3b5e071 Update nodepool config syntax
This implements the changes described in:

http://lists.openstack.org/pipermail/openstack-infra/2017-January/005018.html

It also removes some, but not all, extraneous keys from test config files.

Change-Id: Iebc941b4505d6ad46c882799b6230eb23545e5c0
2017-03-27 09:34:02 -07:00
Paul Belanger c5c5be30f9 Remove keypair from provider section
This was an unused setting which was left over from when we supported
snapshots.

Change-Id: I940eaa57f5dad8761752d767c0dfa80f2a25c787
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2017-03-27 08:31:31 -07:00
Paul Belanger f7289a5aca Remove legacy openstack settings from nodepool.yaml
Before os-client-config and shade, we would include cloud credentials
in nodepool.yaml. But now comes the time where we can remove these
settings in favor of using a local clouds.yaml file.

Change-Id: Ie7af6dcd56dc48787f280816de939d07800e9d11
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2017-03-27 08:31:29 -07:00
Jenkins 30c4b46b48 Merge "Default config-drive to true" 2017-03-15 17:36:22 +00:00
Monty Taylor 066942a0ac Stop json-encoding the nodepool metadata
When we first started putting nodepool metadata into the server record
in OpenStack, we json encoded the data so that we could store a dict
into a field that only takes strings. We were also going to teach the
ansible OpenStack Inventory about this so that it could read the data
out of the groups list. However, ansible was not crazy about accepting
"attempt to json decode values in the metadata" since json-encoded
values are not actually part of the interface OpenStack expects - which
means one of our goals, which is ansible inventory groups based on
nodepool information is no longer really a thing.

We could push harder on that, but we actually don't need the functionality
we're getting from the json encoding. The OpenStack Inventory has
supported comma separated lists of groups since before day one. And the
other nodepool info we're storing stores and fetches just as easily
with 4 different top level keys as it does in a json dict - and is
easier to read and deal with when just looking at server records.
Finally, nova has a 255 byte limit on size of the value that can be
stored, so we cannot grow the information in the nodepool dict
indefinitely anyway.

Migrate the data to store into nodepool_ variables and a comma separated
list for groups. Consume both forms, so that people upgrading will not
lose track of existing stock of nodes.

Finally, we don't use snapshot_id anymore - so remove it.

Change-Id: I2c06dc7c2faa19e27d1fb1d9d6df78da45ffa6dd
2017-03-10 16:24:03 -05:00
Paul Belanger a6f4f6be9b Add nodepool-id to provider section
Currently, while testing zuulv3, we are wanting to share the
infracloud-chocolate provider between 2 nodepool servers.  The current
issue is, if we launch nodes from zuulv3-dev.o.o, nodepool.o.o will
detect the nodes as leaked and delete them.

A way to solve this, is to create a per provider 'nodepool-id' where
an admin can configure 2 separate nodepool servers to share the same
tenant.  The big reason for doing this, is so we don't have to stand
up a duplicate nodepool-builder and upload duplicate images.

Change-Id: I03a95ce7b8bf06199de7f46fd3d0f82407bec8f5
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2017-02-27 15:16:57 -05:00
David Shrewsbury 3f42a89df9 Support launch failures in FakeProviderManager
Let's not use mock for testing launch failures. Instead, add an
attribute to FakeProviderManager that tells it how many times
successive calls to createServer() should fail.

Change-Id: Iba6f8f89de84b06d2c858b0ee69bc65c37ef3cf0
2017-02-21 12:59:53 -05:00
James E. Blair fe153656df Don't use taskmanagers in builder
ProviderManager is a TaskManager, and TaskManagers are intended
to serialize API requests to a single cloud from multiple threads.
Currently each worker in the builder has its own set of
ProviderManagers.  That means that we are performing cloud API calls
in parallel.  That's probably okay since we perform very few of them,
mostly image uploads and deletes.  And in fact, we probably want
to avoid blocking on image uploads.

However, there is a thread associated with each of these
ProviderManagers, and even though they are idle, in aggregate they
add up to a significant CPU cost.

This makes the use of a TaskManager by a ProviderManager optional
and sets the builder not to use it in order to avoid spawning these
useless threads.

Change-Id: Iaf6498c34a38c384b85d3ab568c43dab0bcdd3d5
2016-12-07 11:58:24 -08:00
Paul Belanger baf98e052b Use diskimage-builder checksum files
We recently added the ability for diskimage-builder to generate
checksum files. This means nodepool can validate DIBs and then pass
the contents to shade, saving shade from caclucating the checksums.

Change-Id: I4cd44bb83beb4839c2c2346af081638e61899d4d
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2016-11-30 12:48:34 -05:00
Joshua Hesketh e14162da13 Merge branch 'master' into feature/zuulv3
Does not include changes to force image deletion or not-run webapp etc.

Change-Id: I74c6c2c575b29e61bb39dca36a71a747cd464587
2016-11-30 21:18:48 +11:00
Monty Taylor 9dbce5a757
Remove unused function make_image_dict
We don't use this any more.

Change-Id: Ib95ed58718a4bbf9ca46bfccc5f24a8211755270
2016-11-29 08:50:13 -06:00
Monty Taylor 919981b652 Unsubvert image and flavor caching
Recent shade allows users to pass in image and flavor to create_server
by name. This results in a potential extra lookup to find the image and
flavor. Since nodepool is not using shade caching, this is causing our
nodepool-level caching to be subverted. Although an eventual project to
get nodepool to use shade caching, that's a bad scope creep for now.
Just pass in the objects themselves, which gets shade to not attempt to
look for them. In the case where we have an image_id - put it into a
dict so that shade treats it as an object passed in and not a thing that
needs to be treated like a name_or_id.

Depends-On: I4938037decf51001ab5789ee383f6c7ed34889b1
Change-Id: Ic70b19ad5baf25413e20a658163ca718dce63bee
2016-09-01 22:43:49 +00:00
Paul Belanger f1dfb117b0
Default config-drive to true
As we depend more and more on glean to help bootstrap a node, it is
possible for new clouds added to nodepool.yaml to be missing the
setting. Which results is broken nodes and multiple configuration
updates.

As a result, we now default config-drive to true to make it easier to
bring nodes online.

Change-Id: I4e214ba7bc43a59ddffb4bfb50576ab3b96acf69
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2016-08-01 16:54:49 -04:00
Monty Taylor 60e49f110b
Cleanup leaked floating ips periodically
It should not happen in a neutron setup that we have leaked floating
ips. However, sometimes it seems that it happens around startup. It's
also safe in a neutron context to just clean the unattached ones. So
assume that sometimes clouds get into weird states and just clean them.

Change-Id: I1a30efb3b7994381592c2391881711d6b1f32dff
Depends-On: I93b0c7d0b0eefdfe0fb1cd4a66cdbba9baabeb09
2016-05-09 04:28:10 -05:00
James E. Blair c64e27be15 Make stopping more reliable
* Builders were interfering with the gear shutdown procedure
  by overriding the use of the 'running' variable on gear workers.
  Instead, just rely on the built-in shutdown process in the gear
  worker class.
* Have the builder shutdown provider managers as well.
* Correctly handle signals in the builder.
* Have the nodepool daemon shut down its gearman client.
* Use a condition object so that we can interrupt the main loop
  sleep and exit faster.

Both the builder and the daemon now exit cleanly on CTRL-C when
run in the foreground.

Change-Id: Iefd5ef7df74e701725f4bafe4df51b8276088fe5
2016-04-18 08:51:17 -07:00
James E. Blair 2e05f1850f Restore ability to run nodepoold with fakes
With OSC and shade patches, we lost the ability to run nodepoold
in the foreground with fakes.  This restores that ability.

The shade integration unit tests are updated to use the string
'real' rather than 'fake' in config files, as they are trying to
avoid actually using the nodepool fakes, and the use of the string
'fake' is what triggers their use in many cases.

Change-Id: Ia5d3c3d5462bc03edafcc1567d1bab299ea5d40f
2016-04-18 08:47:46 -07:00
Monty Taylor f0b0ba8a0a
Don't get extra flavor specs
It's not a big deal because we cache this - but we don't care at all
about the extra flavor specs, so skip fetching them for each of the
flavors.

Change-Id: Iff73bdbe598fcf7556eafc484325f79452975a4f
2016-04-16 11:15:48 -05:00
Jenkins c7f8c2be9f Merge "Pass extended network information in to occ/shade" 2016-04-14 20:17:47 +00:00
Monty Taylor 2a30810b2e
Pass extended network information in to occ/shade
We need to know which networks are public/private, which we already have
in nodepool, but were not passing in to the OCC constructor. We also
need to be able to indicate which network should be the target of NAT in
the case of multiple private networks, which can be done via
nat_destination and the new networks list argument support in OCC.
Finally, 'use_neutron' is purely the purview of shade now, so remove it.

Depends-On: I0d469339ba00486683fcd3ce2995002fa0a576d1
Change-Id: I70e6191d60e322a93127abf4105ca087b785130e
2016-04-14 13:27:09 -05:00
James E. Blair cb5a6908fb Only delete keypairs if needed
This restores some logic that was inadvertently removed in the
shade transition, without which, we issue an extra delete keypair
API call for every server delete.

Change-Id: Ib1f50c23d61c1d874f2b235fd57d2a2b0defd6c5
2016-04-01 10:15:16 -07:00
Monty Taylor df45798508 Remove unused functions
We don't use these in shade-world anymore.

Change-Id: Ib4771af9f9f30cfa27020282b6fb8f3823af0db8
2016-03-30 16:23:49 -07:00
Monty Taylor e1f4a12949 Use shade for all OpenStack interactions
We wrote shade as an extraction of the logic we had in nodepool, and
have since expanded it to support more clouds. It's time to start
using it in nodepool, since that will allow us to add more clouds
and also to handle a wider variety of them.

Making a patch series was too tricky because of the way fakes and
threading work, so this is everything in one stab.

Depends-On: I557694b3931d81a3524c781ab5dabfb5995557f5
Change-Id: I423716d619aafb2eca5c1748bc65b38603a97b6a
Co-Authored-By: James E. Blair <jeblair@linux.vnet.ibm.com>
Co-Authored-By: David Shrewsbury <shrewsbury.dave@gmail.com>
Co-Authored-By: Yolanda Robla <yolanda.robla-mota@hpe.com>
2016-03-26 10:23:25 +01:00
James E. Blair afdd58c10a Log shade inner exceptions
With the dependent change, shade now stores inner
exceptions if they occur.  Wrap our use of shade
with a context manager that logs the inner exceptions
in nodepool's own logging context.

Change-Id: I6be2422aa0352ee9f0ff7429ee6e66384c2b5d57
Depends-On: I33269743a8f62b863569130aba3cc9b5a8539aa0
2016-03-23 08:24:31 +01:00
Monty Taylor eed395d637 Be more specific in logging timeout exceptions
At the moment, grepping through logs to determine what's happening with
timeouts on a provider is difficult because for some errors the cause of
the timeout is on a different line than the provider in question.

Give each timeout a specific named exception, and then when we catch the
exceptions, log them specifically with node id, provider and then the
additional descriptive text from the timeout exception. This should
allow for easy grepping through logs to find specific instances of
types of timeouts - or of all timeouts. Also add a corresponding success
debug log so that comparitive greps/counts are also easy.

Change-Id: I889bd9b5d92f77ce9ff86415c775fe1cd9545bbc
2016-03-04 17:42:09 -06:00
Monty Taylor 536f7feab0 Add an error log with the server fault message
In case there is useful debug information in the server fault message,
log it so that we can try to track down why servers go away.

Change-Id: I33fd51cbfc110fdb1ccfa6bc30a421d527f2e928
2016-03-03 01:36:49 +00:00