Commit Graph

40 Commits

Author SHA1 Message Date
James E. Blair efcb814005
Add API timing debug statements to openstack driver
This can help identify performance issues.

It also adds a timer to the keyscan for all statemachine drivers.

Change-Id: I389bd425458c05fc99c7b9f4640de7796cdafc06
2023-02-28 07:49:51 +01:00
James E. Blair 6320b06950 Add support for dynamic tags
This allows users to create tags (or properties in the case of OpenStack)
on instances using string interpolation values.  The use case is to be
able to add information about the tenant* which requested the instance
to cloud-provider tags.

* Note that ultimately Nodepool may not end up using a given node for
the request which originally prompted its creation, so care should be
taken when using information like this.  The documentation notes that.

This feature uses a new configuration attribute on the provider-label
rather than the existing "tags" or "instance-properties" because existing
values may not be safe for use as Python format strings (e.g., an
existing value might be a JSON blob).  This could be solved with YAML
tags (like !unsafe) but the most sensible default for that would be to
assume format strings and use a YAML tag to disable formatting, which
doesn't help with our backwards-compatibility problem.  Additionally,
Nodepool configuration does not use YAML anchors (yet), so this would
be a significant change that might affect people's use of external tools
on the config file.

Testing this was beyond the ability of the AWS test framework as written,
so some redesign for how we handle patching boto-related methods is
included.  The new approach is simpler, more readable, and flexible
in that it can better accomodate future changes.

Change-Id: I5f1befa6e2f2625431523d8d94685f79426b6ae5
2022-08-23 11:06:55 -07:00
Paul Belanger 16d192c60b First ensure ssh connection is valid before scanning keys
We have a network appliance we test via nested virt. While the outer
node is live and the port we nodescan is open, the nested node is still
booting up SSHd.  Which causes nodescan to return:

  paramiko.ssh_exception.SSHException: Error reading SSH protocol banner

until SSHd is properly running.

Perviously we set out boot-timeout to 5 mins, to allow for the nested
SSHd to come online properly. This should restore that functionality.

Change-Id: I7f43530ee77a81f7c969d548190a71bfb9b03455
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2021-07-19 08:58:56 -04:00
Clark Boylan cb1860565f Have nodepool scan as many ssh host keys as possible
We are seeing users try to enable fips on their test nodes. This
presents a problem because the ssh host key which we've been using is a
ed25519 key which fips disables. Fips forces the use of another key
which ansible doesn't trust and subsequent ssh connections fail.

Address this by trying to scan all available host keys on the server and
not just the first one that paramiko returns.

Change-Id: Ibb2a07a29681dcefd4017eb2fd6134ee33ab726c
2020-11-04 08:30:23 -08:00
Clark Boylan 6276562939 Use iterate_timeout in test waits
This ensures that we don't wait forever for tests to complete tasks.
This is particularly useful if you've disabled the global test timeout.

Change-Id: I0141e62826c3594ed20605cac25e39091d1514e2
2020-01-14 08:25:09 -08:00
Fabien Boucher f57ac1881a
Remove uneeded shebang and exec bit on some files
Having python files with exec bit and shebang defined in
/usr/lib/python-*/site-package/ is not fine in a RPM package.

Instead of carrying a patch in nodepool RPM packaging better
to fix this directly upstream.

Change-Id: I5a01e21243f175d28c67376941149e357cdacd26
2019-12-13 19:30:03 +01:00
Tobias Henkel 6899e19dfc
Improve connection timeout log message
The log message doesn't contain the target ip address which can be
crucial.

Change-Id: Iff674a56267f416114b6bfd6203f3ac76bb5d569
2019-01-23 12:59:59 +01:00
Tobias Henkel e925327309
Reduce socket connect timeout in nodescan
During nodescan we currently set a socket timeout which is equal to
the timeout we wait for the entire boot. In case we have unfortunate
timing of the network interface setup of the node (especially Windows
does this very late in the boot process) we get longer wait times than
necessary. This happens because uninitialized network interfaces on
the node lead to unanswered syn packets instead of connection refused
errors. Linux typically does around 6 syn retries with an exponential
backof starting with 3s. This means the delay between syn retries is
3, 6, 12 seconds and thus in absolute time a single socket connect can
return after 0, 3, 6, 12, 45, 93 or 189 seconds.

This can be solved by setting a fixed lower timeout on the socket to
force it to return with timeout after 10s so we can avoid the
exponential syn retry backoff and thus don't waste too much time on
slower starting nodes.

Change-Id: Ibabdff1966d49752e86e15a1c2a24dd2c86d33f6
2018-10-09 11:17:51 +02:00
Tobias Henkel 687f120b3c
Add connection-port to provider diskimage
The connection port should be included in the privider diskimage.
This makes it possible to define images using other ports for
connections winrm for Windows which run on a different port than 22.

Change-Id: Ib4b335ffbcc4dc71704c06387377675a4206c663
2018-04-03 17:48:52 +02:00
Tobias Henkel 2da274e2ae
Don't gather host keys for non ssh connections
In case of an image with the connection type winrm we cannot scan the
ssh host keys. So in case the connection type is not ssh we
need to skip gathering the host keys.

Change-Id: I56f308baa10d40461cf4a919bbcdc4467e85a551
2018-04-03 17:31:45 +02:00
Tristan Cacqueray 6ac2f33cb3 Implement a static driver for Nodepool
This change adds a static node driver.

Change-Id: I065f2b42af49a57218f1c04a08b3ddd15ccc0832
Story: 2001044
Task: 4615
2018-01-31 03:55:56 +00:00
Tristan Cacqueray 318e899b89 nodeutils: use socket.getaddrinfo instead of ipaddress
This changes uses getaddrinfo in nodeutils.keyscan to seemlesly support ip and
hostname.

Change-Id: If36d3180e588a6e6e6c63792d384b9a1e05f6fa0
2018-01-30 01:05:21 +00:00
Tobias Henkel 7d79770840 Do pep8 housekeeping according to zuul rules
The pep8 rules used in nodepool are somewhat broken. In preparation to
use the pep8 ruleset from zuul we need to fix the findings upfront.

Change-Id: I9fb2a80db7671c590cdb8effbd1a1102aaa3aff8
2018-01-17 02:17:45 +00:00
James E. Blair 559b01cfa0 Add timeout for ssh negotiation on keyscan
We had a launch thread stuck here:

Thread: NodeLauncher-0000341123 (140201917658880)
  File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.5/dist-packages/nodepool/driver/openstack/handler.py", line 245, in run
    self._run()
  File "/usr/local/lib/python3.5/dist-packages/nodepool/driver/openstack/handler.py", line 216, in _run
    self._launchNode()
  File "/usr/local/lib/python3.5/dist-packages/nodepool/driver/openstack/handler.py", line 201, in _launchNode
    interface_ip, timeout=self._provider.boot_timeout)
  File "/usr/local/lib/python3.5/dist-packages/nodepool/nodeutils.py", line 74, in keyscan
    t.start_client()
  File "/usr/local/lib/python3.5/dist-packages/paramiko/transport.py", line 489, in start_client
    event.wait(0.1)
  File "/usr/lib/python3.5/threading.py", line 549, in wait
    signaled = self._cond.wait(timeout)
  File "/usr/lib/python3.5/threading.py", line 297, in wait
    gotit = waiter.acquire(True, timeout)

This adds a timeout to that method so paramiko won't get stuck there.

Change-Id: I038d88cb141f57b93d8572c067e714f4a3af9c2d
2017-10-20 16:58:11 -04:00
Tristan Cacqueray a5077fc344 Add support for custom ssh port
This change adds 'ssh_port' to the Node class.

Change-Id: I5e6d3969ae04f90abd1a3fd908c160cda4791bad
2017-07-06 06:34:28 +00:00
Jenkins 399503f3ac Merge "Set socket timeout for SSH keyscan" into feature/zuulv3 2017-06-05 13:32:53 +00:00
Jenkins cd3f625bc8 Merge "Fix base64 encoding of server key" into feature/zuulv3 2017-05-26 17:27:58 +00:00
David Shrewsbury da3b769e1a Fix base64 encoding of server key
Change-Id: Ifc5d39f5a3d4f175ea149bcabbfa8c6c67b4df0b
2017-05-26 12:17:21 -04:00
Paul Belanger 93b516d978 Update keyscan for python3 compat
Use six.text_type since unicode() doesn't exist for python3.

Change-Id: I3628759c46f44429471aa394dee5056e191e4a05
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2017-05-26 13:17:41 +00:00
David Shrewsbury d1fb0d402e Fix socket.error exception usage
This exception is not subscriptable in py3, but the proper way to
get to the errno in any version is to access the 'errno' attribute.

Change-Id: I9a2e23cee358ff0f573f29962ab03525bfd40974
2017-05-26 08:43:00 -04:00
Paul Belanger 230c7c5203 Set socket timeout for SSH keyscan
When we switch from paramiko client to paramiko transport we failed to
properly setup a timeout.

Change-Id: Ia25c7f31a55d0d6e6bd42b2b266f41a4a2daf8ba
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2017-05-18 16:03:11 +00:00
Paul Belanger d892837cad
Fix imports for python3
The syntax for imports has changed for python3, lets use the new
syntax.

Change-Id: Ia985424bf23b44e492f51182179d2e476cdcccbb
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2017-05-17 15:19:48 -04:00
Paul Belanger d0c25fc333 Remove SSH support from nodepool
As we move forward with zuulv3, we no longer need to ability to SSH
into a node from nodepool-launcher. This means we can remove SSH
private keys from production server. Now we only keyscan the node and
pass the info to zuul to do SSH operations.

We also create out own socket now for paramiko, so we can better
control the exception handling.

Change-Id: I123631aa41fd3db374ef78cf97a8b8afde93f699
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2017-03-24 11:44:58 -04:00
David Shrewsbury 88042886be Record SSH public keys for new nodes in ZK
Change-Id: I3ad63196d584d8dc93a8bcdd9b211f8f6a65bf2f
Story: 2000897
2017-03-13 17:16:04 -04:00
Paul Belanger aebd030a32
Retry SSHExceptions in nodepool
Today, when SSHExceptions are raise, nodepool will abort communication
with the node. Now, nodepool will properly trap them and try again
until the SSH timeout has been raised.

This help with potential race conditions with openssh-server and
nodepool, where nodes would restart sshd after nodepool has
established a connection.

Change-Id: I40bfa1b1af6e4e75f8f14c597c28407ed08023de
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2016-10-05 12:11:47 -04:00
Paul Belanger c4d19c1a18
Include ip address for ssh_connect exception
Add some sort of server information about our failed ssh_connect
attempts.  Currently we don't expose any information about the host.

2016-08-23 16:26:11,894 ERROR nodepool.utils: Exception while testing ssh access:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nodepool/nodeutils.py", line 55, in ssh_connect
    client = SSHClient(ip, username, **connect_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/nodepool/sshclient.py", line 30, in __init__
    key_filename=key_filename)
  File "/usr/local/lib/python2.7/dist-packages/paramiko/client.py", line 305, in connect
    retry_on_signal(lambda: sock.connect(addr))
  File "/usr/local/lib/python2.7/dist-packages/paramiko/util.py", line 270, in retry_on_signal
    return function()
  File "/usr/local/lib/python2.7/dist-packages/paramiko/client.py", line 305, in <lambda>
    retry_on_signal(lambda: sock.connect(addr))
  File "/usr/lib/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 110] Connection timed out

Change-Id: I5705798c91b228a7be2788c33c5a128653b24bbe
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
2016-08-23 14:08:25 -04:00
Monty Taylor eed395d637 Be more specific in logging timeout exceptions
At the moment, grepping through logs to determine what's happening with
timeouts on a provider is difficult because for some errors the cause of
the timeout is on a different line than the provider in question.

Give each timeout a specific named exception, and then when we catch the
exceptions, log them specifically with node id, provider and then the
additional descriptive text from the timeout exception. This should
allow for easy grepping through logs to find specific instances of
types of timeouts - or of all timeouts. Also add a corresponding success
debug log so that comparitive greps/counts are also easy.

Change-Id: I889bd9b5d92f77ce9ff86415c775fe1cd9545bbc
2016-03-04 17:42:09 -06:00
Antoine Musso 90e3812c49 Enhance message for image ssh auth
nodeutils.ssh_connect() offers an info message which suggest we
attempted to connect to an instance using password authentication:

    Password auth exception. Try number 5...

Change message to be more generic
Include ip and username to better differentiate messages in the log
spam.

Example output:

   Auth exception for debian@10.0.0.42, Try number 5...

Change-Id: Iea3c1cf3ae30919cbc6d147e16d383da91df5d75
2016-02-26 19:49:26 +00:00
James E. Blair 5bcccf7605 Suppress NoValidConnectionsError from paramiko
New versions of paramiko wrap exceptions from multiple connection
attempts for multiple address families into one
NoValidConnectionsError exception.  It is a subclass of socket.error
but with an errno set to None.  Just check for that and ignore it
to supress log entries on perfectly normal connection failures.

Change-Id: If64ab66dcc6db7c1886fb72f36078f7f819d6506
2015-11-25 17:22:43 -08:00
Jerry Zhao 5fe5ef8311 add option to use ipv6 for image update and node launching
add option to use ipv6 as ssh connect ip for building snapshot
image and launching jenkins slaves.

Conflicts:
	doc/source/configuration.rst
	nodepool/nodepool.py

Change-Id: I7e023e7581fc0b5ec1ee34d1e5a1eeaacd7d3bfd
2015-06-18 17:51:36 -07:00
Clark Boylan 047f972866 Remove duplicate code
Make testing easier by removing a copy of a method from the
provider_manager. Instead import this method from nodeutils.

Change-Id: I68addb82826c2ce5ee89e120d5f1958fde4f7f12
2015-03-10 17:38:25 -07:00
Christian Berendt e3dd94d65c Use except x as y instead of except x, y
According to https://docs.python.org/3/howto/pyporting.html the
syntax changed in Python 3.x. The new syntax is usable with
Python >= 2.6 and should be preferred to be compatible with Python3.

Enabled hacking check H231.

Change-Id: Ide60f971493440311f1dcc594e33d536beb925e5
2014-05-29 23:57:48 +02:00
Dan Prince 1963731f7d Retry ssh connections on auth failure.
Some cloud instance types (Fedora for example) create
the ssh user after sshd comes online. This allows
our ssh connection retry loop to handle this scenario
gracefully.

Change-Id: Ie345dea50fc2983112cd2e72826a708583d2712a
2014-02-19 16:07:40 -05:00
James E. Blair b1b8a569ef Add image logging
Log stdout/stderr from the image build process.  Use the provider
and image name in the log selector so that admins can route
appropriately (or at least grep).

Change-Id: I7bc74ebfca3184340b51b083695b3441f0924e83
2013-08-29 16:20:40 -07:00
Monty Taylor 1e190f5d57 Change use of error numbers to errno
The errno constants are more readble in the code.

Change-Id: I6cb4b61f4cf59f50969a7fc27cad35d9c90755f8
2013-08-28 14:56:06 -07:00
James E. Blair 0ec2246514 Add JenkinsManager
Same idea as a ProviderManager: serialize changes to each jenkins
server (with a rate limit).

Change-Id: I631d50dcfd13c29d2802c192d6e1ac7889256a90
2013-08-22 10:43:33 -07:00
James E. Blair 8dc6c870f2 Add ProviderManager
This is used to serialize all access to an individual provider
(nova client).  One ProviderManager is created for every provider
defined in the configuration.  Any actions that require interaction
with nova submit a task to the manager which processes them serially
with an appropriate delay to ensure that rate limits are not hit.

This solves not only rate-limit problems, but also ends multi-threaded
access to a single novaclient Client object.

Change-Id: I0cdaa747dac08cdbe4719cb6c9c220678b7a0320
2013-08-20 15:34:14 -07:00
James E. Blair a5a78ef441 Use a sensible SQLAlchemy session model
The existing db session strategy was inherited from a bunch of
shell scripts that ran once in a single thread and exited.

The surprising thing is that even worked at all.  This change
replaces that "strategy" with one where each thread clearly
begins a new session as a context manager and passes that around
to functions that need the DB.  A thread-local session is used
for convenience and extra safety.

This also adds a fake provider that will produce fake images and
servers quickly without needing a real nova or jenkins.  This was
used to develop the database change.

Also some minor logging changes and very brief developer docs.

Change-Id: I45e6564cb061f81d79c47a31e17f5d85cd1d9306
2013-08-16 20:21:33 -07:00
James E. Blair a7144ff7d1 Require a target name when instantiating a node
This is effectively a required db field; without it, the watermark
calculation can be wrong until it's filled in, so make sure it's
there to start.

Also some minor logging changes.

Change-Id: Idc5a9cd40fe330f7a1aea4a5513267ee3c254f60
2013-08-15 17:49:44 -07:00
James E. Blair 5866f10601 Initial commit
Much of this comes from devstack-gate.

Change-Id: I7af197743cdf9523318605b6e85d2cc747a356c7
2013-08-15 09:47:23 -07:00