On AWS there are two different lifecycle types for servers: On-demand and spot. On-demand servers are normal servers, as long as you pay for them they will run (unless hardware breaks). Spot instances on the other hand are cheaper, but can be taken away at any point in time. Usually you receive a notification a few minutes in advance. With Amazon Auto Scaling Groups (ASG) you can create a fleet of servers consisting of on-demand and spot instances to have a fixed baseline size and scale up when there are cheaper spot instances available. However, there is one use-case ASGs currently cannot handle: You cannot run on spot instances when spot instances are available and use on-demand instances when no spot instances are available.
Mixing on-demand and spot
Let’s imagine the following use case: You run a fleet of processing servers and for good performance of your service a certain number of servers needs to be running. For example you might be running an automated text translation service and want to guarantee that each translation is finished within at most 5 minutes. Each individual translation does not take long, maybe a few seconds, thus you don’t care if servers get taken away from you with a 1 or 2 minutes notice period. However, you do care about the number of servers, because you need to make sure that no translation sits there unprocessed for more than 5 minutes.
With the current capabilities of Amazon’s ASGs this forces you to use on-demand instances.
Otherwise, there can be times when you cannot get your hand on
spot instances and your service will be impaired. You could argue that
there are tons of different server types (
t3.*, …) available so
that there is always enough spot capacity available for any of them. And for
such a simple example you will be right, but believe me there are use cases
where one is tied to a small set of compatible server types and thus runs into troubles
from time to time.
To the rescue comes AutoSpotting. AutoSpotting is a Go program that checks your ASGs and tries to replace on-demand instances with spot instances. I personally like their concept: You run everything with vanilla ASGs setting them to 100% on-demand. AutoSpotting then tries to replace these on-demand instances with spot instances. As it is a Go program it’s also easy to run locally for testing and debugging, on a remote server or as an AWS Lambda function (as they recommend).
At the time of writing this blog post their state of the master branch did not work for me, so I had to build an older commit.
mkdir autospotting-build cd autospotting-build go mod init local/build go get -d -v github.com/AutoSpotting/AutoSpotting@751b4cc1b0cdd5f6523f2828d92f31c3c5e223f2 mkdir bin go build -o bin/AutoSpotting github.com/AutoSpotting/AutoSpotting
Let’s go through a short example of how AutoSpotting works:
You have an ASG with a desired capacity of 5 and 5 on-demand instances
are running. AutoSpotting now might be able to request 3 spot instances. It then
replaces 3 on-demand instances with 3 spot instances. The ASG now consists of
2 on-demand and 3 spot instances. The auto scaling mechanism is still happy,
because there are a total of 5 instances. And as soon as one or more of these
spot instances get taken down, the ASG auto scaling will kick in and start new on-demand
instances. When spot instances are available again, AutoSpotting can replace
some of these on-demand instances with spot instances again.
This all works by setting a simple tag called
your ASGs. AutoSpotting will check all ASGs that have this tag enabled and
ignore all ASGs without the tag.
I recently set this up for some ASGs and recognized one problem with all of this. I am a careful guy and fortunately AutoSpotting has an option to keep a specific base capacity of on-demand instances. I set this to one, because I was worried that during a spot shortage the group might be left without spot instances until the on-demand instances have been booted and are ready to process data. Still, soon enough the alerts fired up and there was a brief outage. What had happened?
The group was running with a desired capacity of two and initially had two on-demand instances. AutoSpotting was able to request one spot instance and attached it to the ASG, removing one on-demand instance. By default, AutoSpotting is setup to handle AWS Rebalance Recommendations. These are notifications by Amazon that there might be a shortage of spot instances at some future point in time (it’s not the few minutes shutdown notification for your specific server). So, AutoSpotting dutifully handled the recommendation and took down the running spot instance. ASG recognized there is only one server running, booted a second on-demand instance and of course AutoSpotting kicked in and booted another spot instance. Unfortunately, it did replace the one remaining on-demand instance that was healthy (the second one was still initializing) and the group as left without any healthy servers.
What was the reason for this? I checked the AutoSpotting FAQ and code and found out that they do have a healthiness check in place. When they boot an instance they wait for the auto scaling grace period before attaching the spot instance to the scaling group. I checked my ASG and verified that the grace period was not too low. I also checked the AutoSpotting logs in Cloudwatch Logs and saw that this check was never executed.
AutoSpotting has two modes of execution: Cron mode and event notification mode. In cron mode it runs every few minutes and checks for actions to perform. In event notification mode it reacts to ASG and instance notifications. A quick assumption after reading the logs was (and it seems to be right) that the grace period check is not performed in event notification mode.
Looking at the code this seems to be the case. I do not know much about Go, but
it seems easy enough to read. The entry point for a Go program is its
AutoSpottings main function executes the function
eventHandler which in
This function does two things: It either executes
there was no event attached or
AutoSpotting.processRegions, which calls
region.processEnabledAutoScalingGroups, which calls
autoScalingGroup.cronEventAction which correctly executes the check for
If an event is processed the following chain of actions
processEvent can call
AutoSpotting.processEventInstance, which on
an instance state change event calls
This function calls (among others)
which according to my understanding is supposed to replace a newly started
on-demand instance with a spot instance. This seems to be a quite
linear execution: It checks if there already is an unattached spot instance
running, if not it tries to start one. It then waits until this instance
is in status running. Status running is a status determined by Amazon for
EC2 instances, it has nothing to do with the load balancing health checks of
whatever your code has to do to initialize. When the instance is running it
replaces one of the on-demand instances with this instance.
And that seems to be exactly what has happened.
I currently do not know what the best fix would be. It’s probably wrong to
wait in this function for the full grace period because this can be several
minutes long. Signalling an event on the other hand would not work either, as
the event will be fired immediately and then AutoSpotting would be started
again immediately. So probably there’s no way around a cron job execution to
wait for spot instances that are still below the grace period. But in my opinion
handleNewOnDemandInstanceLaunch should also include the grace period check
so that it will never attach a yet unhealthy instance to the ASG.
For now I will enable cron-only mode of AutoSpotting and probably file a bug report on Github.I do not maintain a comments section. If you have any questions or comments regarding my posts, please do not hesitate to send me an e-mail to email@example.com.