Home » News » Amazon Spot Instances – Minimise instance terminations

Amazon Spot Instances – Minimise instance terminations

AWS Spot instances are a great way to run large CFD and FEA simulation clusters.....
AWS Spot instances are a great way to run large CFD and FEA simulation clusters with up to 90% saving on the cost of running standard “On Demand” AWS instances.   This has a lot of benefits in terms of low cost high performance computing .
The only downside is that spot instances are spare capacity on Amazon’s infrastructure, so they may demand back their instances at any time. In practice this is unlikely however – and as long as you are using checkpointing, you can resume from where your job last failed.  The spot instances are set up in such a way that the hard drive remains even after the instance is terminated, so if you have checkpointed, your job is saved up to that point. It is of course inconvenient when a job fails. However, there are several tips that can minimise the chance of failure.  Using the following tips should help you in choosing where and how to run your simulation.
AWS have centres all around the world. In the era of the GDPR, and as location is paramount, most of our European customers want to run in the EU only.  However not all data centres are created equal. The Dublin region has one of the largest AWS centres in the world, one of only a few with 3 availability zones1.  We have also seen that the smaller data centres don’t have the full range or the latest new instance types that the larger ones have. So you are more likely to have issues in the other European regions than in Dublin.
Previous generation instance types tend to be less in demand for the later types, so for example if m5 instances come out they will be in high demand whereas the m4 instances less so. In a similar way to the way buying a new car works, if you go for the outgoing previous model you often get a significant discount because there is less demand for it. Similarly it can often make sense to choose an instance type one behind the newest generation – since, unless there is some miraculous breakthrough to revive the now long dead Moore’s law, any performance increase will mostly be evolutionary rather than revolutionary.
Another valuable insight is that the less machines you use the better. So 4 36 core machines is less likely to have issues than 12 9 core machines. This is because you are increasing your probability of one of the machines going down based on how many you take.
Let’s say that a machine has a 1% probability of going down over 5 days, independent to the other machines i.e. 0.99 probability of remaining running.
P(at least one defective)=1P(all 4 not defective)
So if you have 4 machines, this works out to be

1 – (0.99)4 = 0.04 = 4%

Now if you have 12 machines, the calculation becomes

1 – (0.99)12 = 0.11 = 11%

For HPC you should be using as few large machines as can do the job anyway as connectivity on a chip is always going to be faster than over a network. We are limited to run in a placement group in a single Availability Zone (AZ)  i.e. in a single Amazon facility due to needing very fast networking.  Any failure to a single node brings down the whole cluster. So checkpointing becomes very important.
It is worthwhile doing some sort of optimisation based on how long a checkpoint takes to save – vs how much time we lose if a simulation goes down.  We could work out a “sweet spot” for the frequency of checkpointing vs the speed of simulation. A starting point could be the maximum extra time we are happy for the simulation to run vs the time it would cost to reset if we lost part of our simulation.
Finally AWS sends a 2 minute warning before an instance goes down.  If it is it possible to do a checkpoint in this 2 minute window we will look at the feasibility of adding this into the self service portal.

Fig 1: List of  AWS regions and their availability zones

S.NoAWS region codeAWS region nameNumber Of Availability ZonesAvailability Zone Names
3us-west-1N. California3us-west-1a
10sa-east-1Sao Paulo3sa-east-1a
11cn-north-1China (Beijing)N/AN/A
12ap-south-1India (Mumbai)2ap-south-1a


Select Country: