How to save up to 80% in cloud HPC costs?

1. Executive Summary

Cloud computing offers an appealing option to the world of high performance computing. The combination of reduced fixed costs and no upfront fees means that anyone can afford to set up their own personal computing cluster using cloud resources.

However, in situations where substantial throughput is required, the amount of cloud instances will need to be increased. This can lead to increased cloud costs, which may become unappealing when trying to fit everything into your budget.

Google preemptible instances are similar to AWS spot instances, with the exception of a flat pricing approach. Using preemptible instances can reduce the Google instance costs by up to 80%, making them a very appealing way to reduce overall costs (to approx. $0.01 / CPU Core hour). The drawback of using preemptible instances for computational purposes is that the instances may be terminated at any given time, which can lead to data loss and the termination of the entire computation.

Techila Distributed Computing Engine (TDCE) is a patented solution that enables using preemptible instances for processing computational workloads in a secure and efficient manner. TDCE adds error tolerance and automatic configuration features. This ensures that the amount of computing power will automatically be kept on the desired level, all the while benefitting from the cheaper preemptible capacity.

2. Testing Overall Suitability

In order to illustrate how well suited preemptible instances are for computational purposes when using Techila Distributed Computing Engine, a series of tests were performed. These tests were designed to measure the Availability, Reliability, Performance and Cost of preemptible instances when using Techila Distributed Computing Engine.

2.1. Availability

To test the availability of preemptible instances, a Techila Distributed Computing Engine computing system of 1000 n1-highcpu-2 Linux Techila Worker instances was deployed and the time required to start the instances was measured and logged. Each instance had 2 CPU cores, meaning the system consisted of a total of 2000 CPU cores. The time required to start the Techila Worker instances is illustrated in the image below.

Figure 1. The deployment test shows that Google Cloud Platform is able to provision a large amount of preemptible instances.

The orange line represents the amount of Techila Workers that are ready for computations. In the deployment test, 99% of the capacity was online in under 3 minutes. The remaining 1% of the instances were terminated by preemption events during start-up, which increased the total start-up time as new instances had to be started.

2.2. Reliability

To test the effect of preemptible instance terminations on the amount of computing power available, a deployment consisting of 50 n1-highcpu-2 Linux Techila Worker instances was kept running for a time of 24 hours. Each instance had 2 CPU cores, meaning the system consisted of a total of 100 CPU cores. These tests were repeated several times on different dates. During these tests, the number of preemption events varied greatly, ranging from 1 to 401 events. More general information about preemption selection is provided by Google and can be found here.

Each time an instance is preempted, the instance will be terminated. However, because TDCE uses a managed instance group to run the instances, any terminated instances will be automatically replaced with new instances. The self-configuration features in TDCE will also mean that the new instances will be automatically configured to process the previous Techila Workers' workload. This minimizes the impact of the termination to the user.

The graph below illustrates the preemption events observed during one of our tests and contains two key metrics:

Cumulative amount of preemption terminations
Total amount of Techila Workers available for computations

Figure 2. The reliability test shows that by using Techila Distributed Computing Engine, the amount of computing power remains unchanged despite large amounts of preemptible instance terminations. The test was performed in europe-west1-d and started on July 17th, 2018 at 10:42:42 (UTC +0000).

The orange line shows the cumulative amount of preemption events during the 24-hour test. There was a total of 168 events, meaning a total of 168 instance terminations took place during the test. Each time an instance was terminated, a replacement instance was automatically started and reconfigured by Techila Distributed Computing Engine.

The blue line shows the amount of available Techila Workers. Each time a preemption event takes place, there is a corresponding short downtime period for a Techila Worker. Despite there being a total of 168 termination events during the test, the average amount of Techila Workers online is close to 100% all the time, never dropping below 90%.

2.3. Performance

As preemptible instances use the same underlying cloud hardware as normal instances, the main cause for performance differences will be the preemption terminations, which can happen at any time. Compute Engine always terminates preemptible instances after they run for 24 hours, assuming the instance has not been terminated in another preemption event prior to that.

To measure the effect of these preemption terminations on computational workloads, an identical run was repeated using a set of normal instances and preemptible instances. The workload consisted of 36,000 MATLAB Jobs, which were processed using 50 x n1-highcpu-2 Techila Worker Linux instances. Each Job took 5 minutes to complete.

The duration of each run was just over 30 hours. In short, the run executed on the preemptible instances was only 8 minutes 17 seconds slower (0,45%) than the run executed on the normal instances, which was a pleasant surprise for the test team.

The total execution times of the runs are shown below:

Capacity Type	Total Execution Time
Preemptible	1d 06h 35m 26s
Normal	1d 06h 27m 09s

Capacity Type

Total Execution Time

Preemptible

1d 06h 35m 26s

Normal

1d 06h 27m 09s

The image below shows how the performance of preemptible instances (orange line) compares with normal instances (blue line). As can be seen, the orange line is overlapping the blue line during the majority of the Project duration. This means that both preemptible and normal instances are processing workloads at the same rate. This is possible because Techila Distributed Computing Engine keeps the amount of computing power on the desired level despite preemption terminations and reconfigures any replacement instances, so they automatically rejoin the computations.

Figure 3. As illustrated by the mostly overlapping orange and blue lines, performance difference between preemptible and normal instances is minimal.

There was a total of 57 preemption termination events during the test, 37 of which occurred at the 24-hour mark. The reason why only 37 instances of a total of 50 were terminated at the 24-hour mark, is that all other instances had been preempted by an unexpected preemption event (which resets the timer). Each time a preemption termination occurred, a replacement instance was automatically provisioned and configured by Techila Distributed Computing Engine.

The most visible effect of these preemption terminations can be seen around the 24-hour mark, where 37 instances were terminated. Even though the interrupted Jobs were automatically restarted after replacement instances had been provisioned, the performance impact of these interruptions resulted in the orange line (preemptible capacity) trailing the blue line (normal capacity) in the amount of completed Jobs for the remainder of the Project.

There was also one unexpected instance termination when using normal instances. This instance was automatically replaced and reconfigured to process the MATLAB workload, but still delayed the Project completion by approximately 5 minutes.

2.4. Cost

To compensate for the unexpected preemptible terminations, preemptible instances are priced competitively, allowing you to save up to 80% on instance infrastructure cost.

The table below shows how the cost of the computations can be calculated for normal and preemptible instances. These calculations are for the computations that were discussed in the Performance Chapter.

Capacity Type	Normal	Preemptible
Operating System	Linux	Linux
Instance Type	n1-highcpu-2	n1-highcpu-2
Cost per Instance per Hour	$0,0736 / hour	$0,0177 / hour
Instance Count	50	50
Deployment duration (in hours)	30,4525	30,5905
Total Cost	$0,0736 * 50 * 30,4525 = $112,13	$0,0177 * 50 * 30,5905 = $27,13

Capacity Type

Normal

Preemptible

Operating System

Linux

Instance Type

n1-highcpu-2

Cost per Instance per Hour

$0,0736 / hour

$0,0177 / hour

Instance Count

Deployment duration (in hours)

30,4525

30,5905

Total Cost

$0,0736 * 50 * 30,4525 = $112,13

$0,0177 * 50 * 30,5905 = $27,13

In this particular use case, we were able to save over 75% on the cost by using preemptible instances.

Figure 4. Costs computed by using a Techila Distributed Computing Engine Enterprise License. If using Techila Distributed Computing Engine Advanced Edition in Google Cloud Platform Marketplace, additional Techila License fees will apply.

3. Summary

By using Techila Distributed Computing Engine, large amounts of preemptible instances can be easily deployed in a short amount of time. This was highlighted by the deployment test, where we were able spin up an environment consisting of 2000 CPU cores and to get 99% of the capacity online and ready for computations in under 3 minutes.

Despite a large amount of premptible instance terminations, the amount of available computing capacity remains stable with zero human intervention when using Techila Distributed Computing Engine. This was illustrated by the fact that the amount of computing power available never dropped below 90%, despite a total of 168 preemption terminations taking place in 24 hours.

Preemptible instances also enabled us to save over 75% on the cost of computations, while being only 0,45% slower than normal instances.

Techila Distributed Computing Engine makes using preemptible capacity simple and mitigates the effect of unexpected instance terminations by automatically starting replacement instances and reconfiguring them to process computational workloads.

4. Try it Yourself

To try out Techila Distributed Computing Engine, you can easily set up a computing environment in Google Cloud Marketplace.

Techila Distributed Computing Engine Advanced Edition in Google Cloud Platform Marketplace enables anyone to set up a secure and private high-performance computing environment using cloud resources. Cloud resources used for computational purposes can be scaled according to needs by using an easy-to-use graphical interface. This interface allows the user to specify all key aspects of the deployment, such as:

Instance count. Only limited by your Google quota, TDCE allows you to start as many instances as your quota allows.
Instance type. Supported instance types range from single core instances to instances with 160 CPU cores each.
Instance operating system. Windows and Linux are both supported, enabling processing workloads which have OS requirements.
Preemptible or normal instances. Can be easily toggled with a couple of mouse clicks.

All instances will automatically be placed in managed instance groups. This means that in situations where preemptible instances are terminated, the instance groups will automatically restart similar instances in an effort to keep the amount of available computing power as steady as possible. In situations where the terminated instance was in the middle of processing computational workloads, TDCE’s self-healing features will automatically also reconfigure this new instance to process the workload (which can be e.g. MATLAB, Python or R) and reschedule the workload so it will be processed on the replacement instance.

More information about the setup process and available free trials can be found here:

http://www.techilatechnologies.com/techila-in-gcp-marketplace/