Commit Graph

68 Commits

Author SHA1 Message Date
James E. Blair fc95724601 Fix leaked upload cleanup
The cleanup routine for leaked image uploads based its detection
on upload ids, but they are not unique except in the context of
a provider and build.  This meant that, for example, as long as
there was an upload with id 0000000001 for any image build for
the provider (very likely!) we would skip cleaning up any leaked
uploads with id 0000000001.

Correct this by using a key generated on build+upload (provider
is implied because we only consider uploads for our current
provider).

Update the tests relevant to this code to exercise this condition.

Change-Id: Ic68932b735d7439ca39e2fbfbe1f73c7942152d6
2024-03-09 13:46:17 -08:00
Zuul 392cf017c3 Merge "Add support for AWS IMDSv2" 2024-02-28 02:46:53 +00:00
Clark Boylan e5c1790be7 Rollback to 1.28/stable microk8s in functional testing
We use latest/stable by default which very recently updated to
1.29/stable. Unfortunately it appears there are issues [0] with this
version on Debian Bookworm which also happens to be the platform we test
on. Our jobs have been consistently failing in a manner that appears
related to this issue. Update the job to collect logs so that we can
better confirm this is the case and rollback to 1.28 which should be
working.

Also update the AWS tests to handle a recent moto release which
requires us to use mock_aws rather than individual mock_* classes.

[0] https://github.com/canonical/microk8s/issues/4361

Change-Id: I72310521bdabfc3e34a9f2e87ff80f6d7c27c180
Co-Authored-By: James E. Blair <jim@acmegating.com>
Co-Authored-By: Jeremy Stanley <fungi@yuggoth.org>
2024-01-29 14:15:54 -08:00
James E. Blair 3f4fb008b0 Add support for AWS IMDSv2
This is an authenticated http metadata service which is typically
available by default, but a more secure setup is to enforce its
usage.

This change adds the ability to do that for both instances and
AMIs.

Change-Id: Ia8554ff0baec260289da0574b92932b37ffe5f04
2024-01-24 15:11:35 -08:00
Simon Westphahl 2aeaee92f1
Ignore unrelated error labels in request handler
Nodepool was declining node requests when other unrelated instance types
of a provider were unavailable:

    Declining node request <NodeRequest {... 'node_types': ['ubuntu'],
    ... }> due to ['node type(s) [ubuntu-invalid] not available']

To fix this we will the check error labels against the requested labels
before including them in the list of invalid node types.

Change-Id: I7bbb3b813ca82baf80821a9e84cc10385ea95a01
2023-11-09 13:58:54 +01:00
James E. Blair 8669acfe6b Use a state machine for ssh key scanning
We currently use a threadpool executor to scan up to 10 nodes at
a time for ssh keys.  If they are slow to respond, that can create
a bottleneck.  To alleviate this, use a state machine model for
managing the process, and drive each state machine from a single
thread.

We use select.epoll() to handle the potentially large number of
connections that could be happening simultaneously.

Note: the paramiko/ssh portion of this process spawns its own
thread in the background (and always has).  Since we are now allowing
more keyscan processis in parallel, we could end up with an
unbounded set of paramiko threads in the background.  If this is
a concern we may need to cap the number of requests handled
simultaneously.  Even if we do that, this will still result in
far fewer threads than simply increasing the cap on the threadpool
executor.

Change-Id: I42b76f4c923fd9441fb705e7bffd6bc9ea7240b1
2023-11-04 08:54:20 -07:00
James E. Blair 7d7d81cd46 AWS: improve service quota handling
The AWS API call to get the service quota has its own rate limit
that is separate from EC2.  It is not documented, but the defaults
appear to be very small; experimentally it appear to be something
like a bucket size of 30 tokens and a refill rate somewhere
between 3 and 10 tokens per minute.

This change moves the quota lookup calls to their own rate limiter
so they are accounted for separately from other calls.

We should configure that rate limiter with the new very low values,
however, that would significantly slow startup since we need to issue
serveral calls at once when we start; after that we are not sensitive
to a delay.  The API can handle a burst at startup (with a bucket
size of 30) but our rate limiter doesn't have a burst option.  Instead
of cofiguring it properly, we will just configure it with the rate
limit we use for normal operations (so that we at least have some
delay), but otherwise, rely on caching so that we know that we won't
actually exceed the rate limit.

This change therefore also adds a Lazy Executor TTL cache to the
operations with a timeout of 5 minutes.  This means that we will issue
bursts of requests every 5 minutes, and as long as the number of
requests is less than the token replacement rate, we'll be fine.

Because this cache is on the adapter, multiple pool workers will use
the same cache.  This will cause a reduction in API calls since
currently there is only pool-worker level caching of nodepool quota
information objects.  When the 5 minute cache on the nodepool quota
info object expires, we will now hit the adapter cache (with its own
5 minute timeout) rather than go directly to the API repeatedly for
each pool worker.  This does mean that quota changes may take between
5 and 10 minutes to appear in nodepool.

The current code only looks up quota information for instance and
volume types actually used.  If that number is low, all is well, but
if it is high, then we could potentially approach or exceed the token
replacement rate.  To make this more predictable, we will switch the
API call to list all quotas instead of fetching only the ones we need.
Due to pagination, this results in a total of 8 API calls as of writing;
5 for ec2 quotas and 3 for ebs.  These are likely to grow over time,
but very slowly.

Taken all together, these changes mean that a single launcher should
issue at most 8 quota service api requests every 5 minutes, which is
below the lowest observed token replacement rate.

Change-Id: Idb3fb114f5b8cda8a7b6d5edc9c011cb7261be9f
2023-10-17 14:36:37 -07:00
James E. Blair 40eb0fd8d4 Use ec2_client for instances and volumes
There are two modes of use for the boto3 interface library for AWS:
resource or client calls.  The resource approach returns magic objects
representing instances of each type of resource which have their own
methods and can lazy-load relationships to other objects.  The client
approach is more low-level and returns dictionary representations of
the HTTP results.

In general in the AWS driver, we try to use resources when we can
and fall back to client methods otherwise.

When we construct our own AwsInstance objects (which translates what we
get from AWS to what nodepool expects), we filled the "az" field by
getting the availability zone from the subnet.  The subnet is an object
relationship that is lazy-loaded, which means that each time we
retrieved an instance record, we issued another API call to look up
the subnet (and this information is not cached).

To correct this, we switch to getting the subnet from the placement
field of the instance (which is more sensible anyway).

To prevent future accidental lazy-loads of un-cached data, we switch
instance and volume operations from the resource style of usage to
the client.

The result is a massive increase in performance (as the subnet lazy-load
could take 0.1 seconds for each instance running, and we very frequently
list all instances).

Change-Id: I529318896fc8096bbd9dbdac60d1a29c3ac641b6
2023-10-17 14:36:37 -07:00
Zuul 76fb25d529 Merge "Handle invalid AWS instance in quota lookup" 2023-09-25 21:38:34 +00:00
James E. Blair b3921d5c40 AWS: handle 'InvalidConversionTaskId.Malformed' error
When we call the describe_import_image_tasks paginator with a
task id specified and that task id does not exist, AWS can return
a InvalidConversionTaskId.Malformed error instead of the expected
empty list.  Handle that case and add a test for it.

Change-Id: I1a3a9297e123a50a4beeb7e18124b8cc17aa28bb
2023-09-05 16:09:26 -07:00
Zuul 785f7dcbc9 Merge "AWS: Add support for retrying image imports" 2023-08-28 18:43:56 +00:00
Zuul 77e2550326 Merge "Test if username is set for diskimage based nodes in AWS" 2023-08-24 08:54:51 +00:00
James E. Blair c2d9c45655 AWS: Add support for retrying image imports
AWS has limits on the number of image import tasks that can run
simultaneously.  In a busy system with large images, it would be
better to wait until those limits clear rather than delete the
uploaded s3 object and start over, uploading it again.  To support
this, we now detect that condition and optionally retry for a
specified amount of time.

The default remains to bail on the first error.

Change-Id: I6aa7f79b2f73c4aa6743f11221907a731a82be34
2023-08-12 11:45:22 -07:00
Benedikt Loeffler 16ef0f8bad Test if username is set for diskimage based nodes in AWS
Change-Id: I4c13e95baf213831d02c31df0e91c82c2c86d3d8
2023-08-03 12:31:31 -07:00
James E. Blair a602654482 Handle invalid AWS instance in quota lookup
Early nodepool performed all cloud operations within the context of
an accepted node request.  This means that any disagreement between
the nodepool configuration and the cloud (such as what instance types,
images, networks, or other resources are actually available) would
be detected within that context and the request would be marked as
completed and failed.

When we added tenant quota support, we also added the possibility
of needing to interact with the cloud before accepting a request.
Specifically, we may now ask the cloud what resources are needed
for a given instance type (and volume, etc) before deciding whether
to accept a request.  If we raise an exception here, the launcher
will simply loop indefinitely.

To avoid this, we will add a new exception class to indicate a
permanent configuration error which was detected at runtime.  If
AWS says an instance type doesn't exist when we try to calculate
its quota, we mark it as permanently errored in the provider
configuration, then return empty quota information back to the
launcher.

This allows the launcher to accept the request, but then immediately
mark it as failed because the label isn't available.

The state of this error is stored on the provider manager, so the
typical corrective action of updating the configuration to correct
the label config means that a new provider will be spawned with an
empty error label list; the error state will be cleared and the
launcher will try again.

Finally, an extra exception handler is added to the main launcher
loop so that if any other unhandled errors slip through, the
request will be deferred and the launcher will continue processing
requests.

Change-Id: I9a5349203a337ab23159806762cb46c059fe4ac5
2023-07-18 13:51:13 -07:00
James E. Blair acb6772c3a Add AWS volume quota support
Like the OpenStack driver, this automatically applies volume quota
limits if specified in the label configuration.

Change-Id: I71c1b95de08dc72cc777099952892de659d45d41
2023-07-17 15:17:50 -07:00
Zuul 63900e9bc4 Merge "Report leaked resource metrics in statemachine driver" 2023-05-02 23:29:19 +00:00
James E. Blair d4f2c8b9e7 Report leaked resource metrics in statemachine driver
The OpenStack driver reports some leaked metrics.  Extend that in
a generic way to all statemachine drivers.  Doing so also adds
some more metrics to the OpenStack driver.

Change-Id: I97c01b54b576f922b201b28b117d34b5ee1a597d
2023-04-26 06:40:12 -07:00
Zuul 58b24d2203 Merge "Amazon EC2 Spot support" 2023-04-26 11:40:34 +00:00
Christian Mueller 36dbff84ba Amazon EC2 Spot support
This adds support for launching Amazon EC2 Spot instances
(https://aws.amazon.com/ec2/spot/), which comes with huge cost saving
opportunities.

Amazon EC2 Spot instances are spare Amazon EC2 capacity, you can get
with an discount of up to 90% compared to on-demand pricing.
In contrast to on-demand instances, Spot instances can be relaimed with a
2 minute notification in advance
(https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html).

When :attr:`providers.[aws].pools.labels.use-spot` is set to True, the AWS
driver will launch Spot instances. If an instance get interrupted, it will be
terminated and no replacement instance will be launched.

Change-Id: I9868d014991d78e7b2421439403ae1371b33524c
2023-04-16 21:12:06 +02:00
James E. Blair b0a40f0b47 Use image cache when launching nodes
We consult ZooKeeper to determine the most recent image upload
when we decide whether we should accept or decline a request.  If
we accept the request, we also consult it again for the same
information when we start building the node.  In both cases, we
can use the cache to avoid what may potentially be (especially in
the case of a large number of images or uploads) quite a lot of
ZK requests.  Our cache should be almost up to date (typically
milliseconds, or at the worst, seconds behind), and the worst
case is equivalent to what would happen if an image build took
just a few seconds longer.  The tradeoff is worth it.

Similarly, when we create min-ready requests, we can also consult
the cache.

With those 3 changes, all references to getMostRecentImageUpload
in Nodepool use the cache.

The original un-cached method is kept as well, because there are
an enormous number of references to it in the unit tests and they
don't have caching enabled.

In order to reduce the chances of races in many tests, the startup
sequence is normalized to:
1) start the builder
2) wait for an image to be available
3) start the launcher
4) check that the image cache in the launcher matches what
   is actually in ZK

This sequence (apart from #4) was already used by a minority of
tests (mostly newer tests).  Older tests have been updated.
A helper method, startPool, implements #4 and additionally includes
the wait_for_config method which was used by a random assortment
of tests.

Change-Id: Iac1ff8adfbdb8eb9a286929a59cf07cd0b4ac7ad
2023-04-10 15:57:01 -07:00
James E. Blair fdc093a8de Add import_image support to AWS
In I9478c0050777bf35e1201395bd34b9d01b8d5795 we switched from using the
import_image method to import_snapshot in the AWS driver.  This method
is faster and more like other drivers in Nodepool.  However, some operating
systems (such as Windows, RHEL or SLES) require licensing metadata
associated with an AMI which is not available to be set when we register
an AMI from a snapshot.  For these systems, the only viable way to upload
images is with the import_image method.

This change restores the previous method as an option, but keeps the
"snapshot" method as the default.

Change-Id: I81daabebbc9dbe968d8aaf65e6b70f5cdfdd01bf
2023-01-30 20:25:56 -08:00
Christian von Schultz a828513ae8 Fix AWS quota limits for vCPUs
In the AWS adapter, when getting the quota for an instance type, set
the quota for the AWS service quota code to be the number of vCPUs
rather than the number of cores. The number of vCPUs is typically
twice the number of cores. This fixes "VcpuLimitExceeded" errors from
AWS.

Change-Id: I880e6abb84b0527363893576057aa105a5a448a5
2022-12-14 14:13:47 +01:00
Zuul d73d4da72c Merge "Serve all paused handlers before unpausing" 2022-11-09 16:51:11 +00:00
James E. Blair 0673599abc Fix AWS instance_profile tests for moto v4
Moto v4 has added support for iam instance profiles, which means
our tests are now failing since we have two tests that request
an instance be created with a profile association, but the profile
did not exist in moto due to the lack of support.

Now that support has been added, we create a dummy instance profile
and use that for the association.  We can now also validate that
the association was made as expected.

Change-Id: I2839b7e1536a6305e44678dc9b22c71f3eb55677
2022-11-08 07:39:13 -08:00
Simon Westphahl 0f1680be7e
Serve all paused handlers before unpausing
The launcher implementation assumed that only one request handler will
be paused at any given point in time. However, this is not true when
e.g. the request handler accepts multiple requests that all run into a
quota limit during launch.

The consequence of this is that the pool is unpaused too early and we
might accept other node requests until the provider is paused again.
This could lead to a starvation of earlier paused handlers as they were
fulfilled in a LIFO fashion.

To fix this edge case we will store paused request handlers in a set and
only unpause the provider when there are no paused handlers anymore.
Paused handlers are now also run in priority order.

Change-Id: Ia34e2844533ce9942d489838c4ce14a605d79287
2022-10-20 12:06:11 +02:00
James E. Blair 4ea824cfa9 Aws: add support for volume iops and throughput
Users can request specific IOPS and throughput allocations from EC2.
The availability and defaults vary for volume type, but IOPS are
available for all volumes, and throughput is available on gp3 volumes.

Change-Id: Icc7432d8ce1c3514bfe9d8fda20bd399b67ede7a
2022-10-14 07:08:30 -07:00
James E. Blair 6d3b5f3bab Add missing cloud/region/az/host_id info to nodes
To the greatest extent possible within the limitation of each provider,
this adds cloud, region, az, and host_id to nodes.

Each of AWS, Azure, GCE, IBMVPC have the cloud name hard-coded to
a value that makes sense for each driver given that each of these
are singleton clouds.  Their region and az values are added as
appropriate.

The k8s, openshift, and openshiftpods all have their cloud names set
to the k8s context name, which is the closest approximation of what
the "cloud" attribute means in its existing usage in the OpenStack
driver.  If pods are launched, the host_id value is set to the k8s
host node name, which is an approximation of the existing usage in
the OpenStack driver (where it is typically an opaque uuid that
uniquely identifies the hypervisor).

Change-Id: I53765fc3914a84d2519f5d4dda4f8dc8feda72f2
2022-08-25 13:41:05 -07:00
James E. Blair 6320b06950 Add support for dynamic tags
This allows users to create tags (or properties in the case of OpenStack)
on instances using string interpolation values.  The use case is to be
able to add information about the tenant* which requested the instance
to cloud-provider tags.

* Note that ultimately Nodepool may not end up using a given node for
the request which originally prompted its creation, so care should be
taken when using information like this.  The documentation notes that.

This feature uses a new configuration attribute on the provider-label
rather than the existing "tags" or "instance-properties" because existing
values may not be safe for use as Python format strings (e.g., an
existing value might be a JSON blob).  This could be solved with YAML
tags (like !unsafe) but the most sensible default for that would be to
assume format strings and use a YAML tag to disable formatting, which
doesn't help with our backwards-compatibility problem.  Additionally,
Nodepool configuration does not use YAML anchors (yet), so this would
be a significant change that might affect people's use of external tools
on the config file.

Testing this was beyond the ability of the AWS test framework as written,
so some redesign for how we handle patching boto-related methods is
included.  The new approach is simpler, more readable, and flexible
in that it can better accomodate future changes.

Change-Id: I5f1befa6e2f2625431523d8d94685f79426b6ae5
2022-08-23 11:06:55 -07:00
James E. Blair 916d62a374 Allow specifying diskimage metadata/tags
For drivers that support tagging/metadata (openstack, aws, azure),
Add or enhance support for supplying tags for uploaded diskimages.

This allows users to set metadata on the global diskimage object
which will then be used as default values for metadata on the
provider diskimage values.  The resulting merged dictionary forms
the basis of metadata to be associated with the uploaded image.

The changes needed to reconcile this for the three drivers mentioned
above are:

All: the diskimages[].meta key is added to supply the default values
for provider metadata.

OpenStack: provider diskimage metadata is already supported using
providers[].diskimages[].meta, so no further changes are needed.

AWS, Azure: provider diskimage tags are added using the key
providers[].diskimages[].tags since these providers already use
the "tags" nomenclature for instances.

This results in the somewhat incongruous situation where we have
diskimage "metadata" being combined with provider "tags", but it's
either that or have images with "metadata" while we have instances
with "tags", both of which are "tags" in EC2.  The chosen approach
has consistency within the driver.

Change-Id: I30aadadf022af3aa97772011cda8dbae0113a3d8
2022-08-23 06:39:08 -07:00
James E. Blair 89cda5a1ba AWS: Use snapshot instead of image import
AWS recommends using the import_image method for importing diskimages
into the system as AMIs, but there is a significant caveat with that:
EC2 will attempt to boot an instance from the image and snapshot the
instance.  That process requires that the image match certain
characteristics of already existing images supported by AWS, so only
certain operating systems can be succesfully imported.

If anything goes wrong with the import process, the errors are opaque
since the temporary instance used for the import is inaccessible to
the user.

An alternative is to use the import_snapshot method which will produce
a snapshot object in EC2, and then a new AMI can be "registered" and
pointed to that snapshot.  It's an extra step for the Nodepool
builder, but it's simple and takes less time overall than waiting for
EC2 to boot up the temporary instance.

This is a much closer approximation to the import scheme used in
other Nodepool drivers.

I have successfully tested this method with a cirros image, which is
notable for not being a supported operating system with the previous
method.

The caveats and instructions relating to setting up the Import/Export
service roles still apply, so the documentation related to them remain.

The method for reconciling missing tags for aborted image uploads
is updated to accomodate the new process.  Notably while the import_image
method left a breadcrumb in the snapshot description, it does not appear
that we are able to set the description when we import the snapshot.
Instead we need to examine the snapshot import task list to find the
intended tags for a given snapshot or image.

Change-Id: I9478c0050777bf35e1201395bd34b9d01b8d5795
2022-08-04 14:04:13 -07:00
Zuul 123a32f922 Merge "AWS multi quota support" 2022-07-29 17:01:09 +00:00
James E. Blair 207d8ac63c AWS multi quota support
This adds support for AWS quotas that are specific to instance types.

The current quota support in AWS assumes only the "standard" instance types,
but AWS has several additional types with particular specialties (high memory,
GPU, etc).  This adds automatic support for those by encoding their service
quota codes (like 'L-1216C47A') into the QuotaInformation object.

QuotaInformation accepts not only cores, ram, and instances as resource
values, but now also accepts arbitraly keys such as 'L-1216C47A'.
Extra testing of QI is added to ensure we handle the arithmetic correctly
in cases where one or the other operand does not have a resource counter.

The statemachine drivers did not encode their resource information into
the ZK Node record, so tenant quota was not operating correctly.  This is
now fixed.

The AWS driver now accepts max_cores, _instances, and _ram values similar
to the OpenStack driver.  It additionally accepts max_resources which can
be used to specify limits for arbitrary quotas like 'L-1216C47A'.

The tenant quota system now also accepts arbitrary keys such as 'L-1216C47A'
so that, for example, high memory nodes may be limited by tenant.

The mapping of instance types to quota is manually maintained, however,
AWS doesn't seem to add new instance types too often, and those it does are
highly specialized.  If a new instance type is not handled internally, the
driver will not be able to calculate expected quota usage, but will still
operate until the new type is added to the mapping.

Change-Id: Iefdc8f3fb8249c61c43fe51b592f551e273f9c36
2022-07-25 14:41:07 -07:00
Zuul e27ce0e8d1 Merge "Fix race in test_aws_resource_cleanup" 2022-07-01 03:05:18 +00:00
James E. Blair 025318b825 Fix race in test_aws_resource_cleanup
If the test was slow enough that the image was not immediately
deleted, we would fail the assertIsNone because image.state=='available'.

Correct this by continuing to iterate in the wait loop while that's
true.

Change-Id: I6ee14c5737abe37d90d6b57442644085b08f30ab
2022-06-30 15:23:15 -07:00
James E. Blair d5b0dee642 AWS driver create/delete improvements
The default AWS rate limit is 2 instances/sec, but in practice, we
can achieve something like 0.6 instances/sec with the current code.
That's because the create instance REST API call itself takes more
than a second to return.  To achieve even the default AWS rate
(much less a potentially faster one which may be obtainable via
support request), we need to alter the approach.  This change does
the following:

* Paralellizes create API calls.  We create a threadpool with
  (typically) 8 workers to execute create instance calls in the
  background.  2 or 3 workers should be sufficient to meet the
  2/sec rate, more allows for the occasional longer execution time
  as well as a customized higher rate.  We max out at 8 to protect
  nodepool from too many threads.
* The state machine uses the new background create calls instead
  of synchronously creating instances.  This allows other state
  machines to progress further (ie, advance to ssh keyscan faster
  in the case of a rush of requests).
* Delete calls are batched.  They don't take as long as create calls,
  yet their existence at all uses up rate limiting slots which could
  be used for creating instances.  By batching deletes, we make
  more room for creates.
* A bug in the RateLimiter could cause it not to record the initial
  time and therefore avoid actually rate limiting.  This is fixed.
* The RateLimiter is now thread-safe.
* The default rate limit for AWS is changed to 2 requests/sec.
* Documentation for the 'rate' parameter for the AWS driver is added.
* Documentation for the 'rate' parameter for the Azure driver is
  corrected to describe the rate as requests/sec instead of delay
  between requests.

Change-Id: Ida2cbc59928e183eb7da275ff26d152eae784cfe
2022-06-22 13:28:58 -07:00
Zuul a4acb5644e Merge "Use Zuul-style ZooKeeper connections" 2022-05-23 22:56:54 +00:00
James E. Blair 10df93540f Use Zuul-style ZooKeeper connections
We have made many improvements to connection handling in Zuul.
Bring those back to Nodepool by copying over the zuul/zk directory
which has our base ZK connection classes.

This will enable us to bring other Zuul classes over, such as the
component registry.

The existing connection-related code is removed and the remaining
model-style code is moved to nodepool.zk.zookeeper.  Almost every
file imported the model as nodepool.zk, so import adjustments are
made to compensate while keeping the code more or less as-is.

Change-Id: I9f793d7bbad573cb881dfcfdf11e3013e0f8e4a3
2022-05-23 07:40:20 -07:00
Zuul c1cacfc076 Merge "AWS tests: cleanup image deletion checking" 2022-05-19 16:26:17 +00:00
Ian Wienand e1461659de AWS tests: cleanup image deletion checking
This is a follow-on to I3279c3b5cb8cf26d390835fd0a7049bc43ec40b5

As discussed in the referenced issue, the blank return and
AttributeError here is the correct way to determine a recently deleted
image.  We don't need to catch the ClientError as that won't be raised
any more.  Getting a value for the state would indicate it wasn't
deleted.

This bumps the moto base requirement to ensure we have this behaviour.

Change-Id: I2d5b0ccb9802aa0d4c81555a17f40fe8b8595ebd
2022-04-28 14:16:16 +10:00
Zuul df836a6744 Merge "Delete AWS image builds" 2022-04-27 22:58:24 +00:00
James E. Blair eb7f58ce0e Delete AWS image builds
The AWS driver adapter was not actually deleting the image build
from AWS when the builder requested it (it was simply missing
the implementation of the method).

This went unnoticed because as long as a launcher is running, it
will act to clean up leaked images.  But it's still better to
have the builder do so synchronously.

Add a test which verifies the behavior.

Change-Id: I0bea930847ded45574bf53ae098b9926d4924107
2022-04-27 11:59:54 -07:00
Ian Wienand 8ffd70776a AWS tests : catch AttributeError raised by latest moto release
Per the notes inline, the recent 3.1.6 moto release has changed the
way deleted images are returned.  Catching the exception we now see
works, and we can revisit if we find better solutions.

Change-Id: I3279c3b5cb8cf26d390835fd0a7049bc43ec40b5
2022-04-27 14:18:53 +10:00
James E. Blair daa4e39a0d Fix default python paths in aws, azure, ibmvpc drivers
The python-path value should default to "auto" per documentation
and to match other drivers.  Correct that.

Change-Id: Ie8254e10d9c4d8ff8f8f298fac32140a18248293
2022-04-12 06:32:41 -07:00
James E. Blair 46e130fe1a Add more debug info to AWS driver
These changes are all in service of being able to better understand
AWS driver log messages:

* Use annotated loggers in the statemachine provider framework
  so that we see the request, node, and provider information
* Have the statemachine framework pass annotated loggers to the
  state machines themselves so that the above information is available
  for log messages on individual API calls
* Add optional performance information to the rate limit handler
  (delay and API call duration)
* Add some additional log entries to the AWS adapter

Also:

* Suppress boto logging by default in unit tests (it is verbose and
  usually not helpful)
* Add coverage of node deletion in the AWS driver tests

Change-Id: I0e6b4ad72d1af7f776da73c5dd2a50b40f60e4a2
2022-04-11 10:14:20 -07:00
James E. Blair 348fd17e73 Add additional tests to the aws driver
This adds several new tests:

This exercises the new ipv6 functionality.  The boto3 mocks do not
support assigning an ipv6 address on creation, so we adopt a
unit-test style approach and verify the calling side and our
processing of the results separately.  This uses the patching
scheme that we just removed in the previous commit, but alters it
to simply record the call arguments so we can validate them outside
the patched method.

This adds a diskimage upload test to the AWS driver.  Because moto
doesn't support the create_image method, we need to add a fake for
that.  Boto's practice of creating proxy objects makes that hard to
do, so some helper methods are added on the adapter class to make
it easier for the tests to override.

The responses in the fake are based on a recorded session.

This adds a test of leaked resource cleanup, including the automatic
tagging of untagged resources from image import tasks.

Change-Id: I0626061b246e9c52b08c49394d4d22b46beeff7a
2022-03-16 16:12:42 -07:00
James E. Blair 89b3d6a4df Refactor AWS driver tests
This makes the AWS driver tests more modular, with minimal changes.
This will allow us to add more tests without extending the
current single-function-which-performs-all-tests.

It splits the config into multiple files, one for each distinct pool
configuration.  This means that tests only run one pool thread (compared
to the current situation where every test runs every pool).

Whatever defect in moto which appeared to require patching the network
methods to verify the call signature appears to have been rectified, so
that is removed.

The two tags tests have been combined into one.

Private IP validation is improved so that we verify the values set on
the nodepool node rather than the mock call args.

There was no test that verified the EBS-optimised setting, so that has
been added.

Change-Id: I023ec41f015c2fcb20835fc0e38714149944de84
2022-02-22 17:06:10 -08:00
James E. Blair 43678bf4c1 Update AWS driver to use statemachine framework
This updates the aws driver to use the statemachine framework which
should be able to scale to a much higher number of parallel operations
than the standard thread-per-node model.  It is also simpler and
easier to maintain.  Several new features are added to bring it to
parity with other drivers.

The unit tests are changed minimally so that they continue to serve
as regression tests for the new framework.  Following changes will
revise the tests and add new tests for the additional functionality.

Change-Id: I8968667f927c82641460debeccd04e0511eb86a9
2022-02-22 17:06:07 -08:00
James E. Blair 9bcc046ffc Add QuotaSupport to drivers that don't have it
This adds QuotaSupport to all the drivers that don't have it, and
also updates their tests so there is at least one test which exercises
the new tenant quota feature.

Since this is expected to work across all drivers/providers/etc, we
should start including at least rudimentary quota support in every
driver.

Change-Id: I891ade226ba588ecdda835b143b7897bb4425bd8
2022-01-27 10:11:01 -08:00
James E. Blair d55ea477de Fix AWS driver equality check
A recent refactor of the config objects (which simplified the
repetitive __eq__ methods) missed updating the AWS driver.  This
could lead to infinite recursion (as the AWS driver explicitly
called super().__eq__ which itself called __eq__).

This updates the driver to use the new framework, and it also adds
a unit test which exercises it.

Change-Id: I3c612e2784de1ffd1642587ba6017e36bebd8d67
2021-06-23 12:07:25 -07:00