Mixing On-Demand and Spot instances with AutoSpotting - and one Issue that I've Found

On AWS there are two different lifecycle types for servers: On-demand and spot. On-demand servers are normal servers, as long as you pay for them they will run (unless hardware breaks). Spot instances on the other hand are cheaper, but can be taken away at any point in time. Usually you receive a notification a few minutes in advance. With Amazon Auto Scaling Groups (ASG) you can create a fleet of servers consisting of on-demand and spot instances to have a fixed baseline size and scale up when there are cheaper spot instances available. However, there is one use-case ASGs currently cannot handle: You cannot run on spot instances when spot instances are available and use on-demand instances when no spot instances are available.

Mixing on-demand and spot

Let’s imagine the following use case: You run a fleet of processing servers and for good performance of your service a certain number of servers needs to be running. For example you might be running an automated text translation service and want to guarantee that each translation is finished within at most 5 minutes. Each individual translation does not take long, maybe a few seconds, thus you don’t care if servers get taken away from you with a 1 or 2 minutes notice period. However, you do care about the number of servers, because you need to make sure that no translation sits there unprocessed for more than 5 minutes.

With the current capabilities of Amazon’s ASGs this forces you to use on-demand instances. Otherwise, there can be times when you cannot get your hand on spot instances and your service will be impaired. You could argue that there are tons of different server types (t2.*, t3.*, …) available so that there is always enough spot capacity available for any of them. And for such a simple example you will be right, but believe me there are use cases where one is tied to a small set of compatible server types and thus runs into troubles from time to time.

AutoSpotting

To the rescue comes AutoSpotting. AutoSpotting is a Go program that checks your ASGs and tries to replace on-demand instances with spot instances. I personally like their concept: You run everything with vanilla ASGs setting them to 100% on-demand. AutoSpotting then tries to replace these on-demand instances with spot instances. As it is a Go program it’s also easy to run locally for testing and debugging, on a remote server or as an AWS Lambda function (as they recommend).

At the time of writing this blog post their state of the master branch did not work for me, so I had to build an older commit.

mkdir autospotting-build
cd autospotting-build
go mod init local/build
go get -d -v github.com/AutoSpotting/AutoSpotting@751b4cc1b0cdd5f6523f2828d92f31c3c5e223f2
mkdir bin
go build -o bin/AutoSpotting github.com/AutoSpotting/AutoSpotting

Let’s go through a short example of how AutoSpotting works: You have an ASG with a desired capacity of 5 and 5 on-demand instances are running. AutoSpotting now might be able to request 3 spot instances. It then replaces 3 on-demand instances with 3 spot instances. The ASG now consists of 2 on-demand and 3 spot instances. The auto scaling mechanism is still happy, because there are a total of 5 instances. And as soon as one or more of these spot instances get taken down, the ASG auto scaling will kick in and start new on-demand instances. When spot instances are available again, AutoSpotting can replace some of these on-demand instances with spot instances again. This all works by setting a simple tag called spot-enabled on your ASGs. AutoSpotting will check all ASGs that have this tag enabled and ignore all ASGs without the tag.

I recently set this up for some ASGs and recognized one problem with all of this. I am a careful guy and fortunately AutoSpotting has an option to keep a specific base capacity of on-demand instances. I set this to one, because I was worried that during a spot shortage the group might be left without spot instances until the on-demand instances have been booted and are ready to process data. Still, soon enough the alerts fired up and there was a brief outage. What had happened?

The Debugging

The group was running with a desired capacity of two and initially had two on-demand instances. AutoSpotting was able to request one spot instance and attached it to the ASG, removing one on-demand instance. By default, AutoSpotting is setup to handle AWS Rebalance Recommendations. These are notifications by Amazon that there might be a shortage of spot instances at some future point in time (it’s not the few minutes shutdown notification for your specific server). So, AutoSpotting dutifully handled the recommendation and took down the running spot instance. ASG recognized there is only one server running, booted a second on-demand instance and of course AutoSpotting kicked in and booted another spot instance. Unfortunately, it did replace the one remaining on-demand instance that was healthy (the second one was still initializing) and the group as left without any healthy servers.

What was the reason for this? I checked the AutoSpotting FAQ and code and found out that they do have a healthiness check in place. When they boot an instance they wait for the auto scaling grace period before attaching the spot instance to the scaling group. I checked my ASG and verified that the grace period was not too low. I also checked the AutoSpotting logs in Cloudwatch Logs and saw that this check was never executed.

AutoSpotting has two modes of execution: Cron mode and event notification mode. In cron mode it runs every few minutes and checks for actions to perform. In event notification mode it reacts to ASG and instance notifications. A quick assumption after reading the logs was (and it seems to be right) that the grace period check is not performed in event notification mode.

Looking at the code this seems to be the case. I do not know much about Go, but it seems easy enough to read. The entry point for a Go program is its main function. AutoSpottings main function executes the function eventHandler which in turn executes AutoSpotting.EventHandler.

This function does two things: It either executes ProcessCronEvent if there was no event attached or processEvent otherwise. ProcessCronEvent calls AutoSpotting.processRegions, which calls region.processRegion which calls region.processEnabledAutoScalingGroups, which calls autoScalingGroup.cronEventAction which correctly executes the check for spotInstance.isReadyToAttach(a).

If an event is processed the following chain of actions can happen: processEvent can call AutoSpotting.processEventInstance, which on an instance state change event calls AutoSpotting.handleNewInstanceLaunch. This function calls (among others) AutoSpotting.handleNewOnDemandInstanceLaunch which according to my understanding is supposed to replace a newly started on-demand instance with a spot instance. This seems to be a quite linear execution: It checks if there already is an unattached spot instance running, if not it tries to start one. It then waits until this instance is in status running. Status running is a status determined by Amazon for EC2 instances, it has nothing to do with the load balancing health checks of whatever your code has to do to initialize. When the instance is running it replaces one of the on-demand instances with this instance.

And that seems to be exactly what has happened.

I currently do not know what the best fix would be. It’s probably wrong to wait in this function for the full grace period because this can be several minutes long. Signalling an event on the other hand would not work either, as the event will be fired immediately and then AutoSpotting would be started again immediately. So probably there’s no way around a cron job execution to wait for spot instances that are still below the grace period. But in my opinion handleNewOnDemandInstanceLaunch should also include the grace period check so that it will never attach a yet unhealthy instance to the ASG.

For now I will enable cron-only mode of AutoSpotting and probably file a bug report on Github.

I do not maintain a comments section. If you have any questions or comments regarding my posts, please do not hesitate to send me an e-mail to blog@stefan-koch.name.