Total Cost of Ownership of HPC Clusters On-premises versus Hosted in the Cloud

About

The primary goal is to develop and distribute a usable tool for calculating costs for HPC jobs. Understanding your on-premise costs and also the various cloud platform costs will assist researchers to determine the best place to run a given workload. The secondary goal is to run various benchmarks on the several popular cloud platforms: Amazon Web Services; Azure Cloud; Rescale and Google Cloud Platform.

Daniel Reed reports a significant financial interest in Microsoft, a publicly traded company whose program will be evaluated in this study.

Funded in part by the National Science Foundation (NSF)

Award Number 1812786
Project Title: “CYBER-INSIGHT: Evaluating Cyberinfrastructure Total Cost of Ownership
Principal Investigator: Daniel Reed (University of Utah)

Objectives

Designing a comprehensive total cost of ownership (TCO) analysis for arbitrary high-performance computing (HPC) centers is challenging due to the heterogeneous nature of HPC. Every HPC center has their own way of organizing resources across domains such as computation, networking, storage, datacenter infrastructure, personnel, etc. Creating a one-size-fits-all schema for HPC centers is not practical. Therefore, we chose and implemented a flexible JSON-based hierarchical representation of HPC resource TCO. This allows our users--HPC administrators--to structure their TCO financials within the CYBERINSIGHT tool in a manner that best fits the idiosyncrasies of their HPC center.

Analysis Tool

The new CYBERINSIGHT tool is a single Jupyter Notebook that includes multiple sections, or modules, that compose the HPC TCO analysis pipeline from data import and entry to analysis and visualization to export. We also leverage the Jupyter technology to provide tools for experienced Python users to go beyond the GUI and interact programmatically with the TCO data within the notebook.

The hosted version of the tool is currently disabled due to lack of activity, but it can be run locally with Python on any system.

Local Install Instructions (GitHub repo)
Tutorial
For questions, feedback, or bug reports, please send email to: cyberinsight-feedback@lists.utah.edu

Results

The CYBERINSIGHT tool has completed initial development and is now in the testing phase. Specifically, the benchmark module for the application was completed where data from benchmarks and costing can be entered by the user. The storage module for the application was completed where costs and performance of storage may be entered, and finally, the graphing module for the application completed to display comparisons between various selected data.

Benchmark data were generated on various platforms: On premise HPC clusters; Amazon Web Services (AWS); Rescale; Google Cloud Services (GCP); and Azure (AZ). The following benchmarks we’re run on the various platforms include:

High-Performance Linpack Benchmark (HPL)
High-Performance Conjugate Gradients Benchmark (HPCG)
Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS)
Vienna Ab Initio Simulation Package (VASP)
Amber18
Machine Learning (ML) Applications/Benchmarks
- AI-Benchmark (based on Tensorflow)
- PyTorch-Benchmark
- The use of Convolutional Neural Networks (CNN) to discern different lymphoma types.

All the aforementioned benchmarks were run in the on-prem environment. The HPL and LAMMPS benchmarks were run on AWS. Considerable time was spent on studying the behavior of the LAMMPS code on AWS. The initial observed anomalies found its origins in 2 different bugs in the AWS environment. The LAMMPS benchmarks were also run in the Rescale environment, which allows users to perform a wide range of calculations (mainly HPC) \using a graphical user interface (GUI). However, the GUI is not the most efficient way to deal with a large amount of jobs. A Python package was developed (which is public available) to fulfill this task. Some of the VASP calculations were also performed on rescale. The HPCG benchmarks were finished (besides on-Prem) on AWS and AZ. The Amber18 simulations were finished as well on-Prem and on AWS. We also finished the ML applications on-Prem and on AZ. The specific benchmark data are to be found in the addendum.

We are currently finishing up our benchmarks on GCP where we are using HashiCorp’s Terraform framework (Infrastructure through Code) to spin up HPC clusters. On the side, we are researching the use of the Terraform (combined with automation tools such as Ansible, Puppet, Spack) to set up a combined on-prem/Cloud environment.

Papers

Benchmark Results
Paper 1 submitted to PEARC’21: CYBER-INSIGHT: Evaluating Cyberinfrastructure Total Cost of Ownership

Appendix: First Screen of the Tool
Appendix: Edit TCO data module (on-premises resource)
Appendix: Analyze TCO data module

CHPC - Research Computing Support for the University

University Information Technology