Home

HPC in the Cloud

Mark Seager, Intel Fellow, CTO, the Intel Technical Computing Group Ecosystem

Mark Seager, Intel Fellow, CTO, the Intel Technical Computing Group Ecosystem

What is HPC? High Performance Computing (HPC) is a demanding workload that is typically very computationally, communication and/or IO intensive. The intent of HPC is commonly predictive simulation and modeling or High Performance Data Analytics (HPDA) to gain insight out of vast quantities of data. Historically HPC had its origins in government and/or academic funded science and engineering research and development. These R&D efforts included extending the scientific method from Theory and their attendant mathematical models and experiment with their attendant collection of observational data to include scientific simulation and modeling and HPDA. With recent advances in predictive simulation and HPDA, the scientific method has fundamentally changed to now being a triad of theory, experiment and predictive modeling and simulation and HPDA. Examples of this include using predictive simulations that are used to model experiments before they are built in order to engineer the designs. Once the experiments are conducted, these results are then compared against the predictive simulations to improve the models and/or scientific data (e.g., equations of state) that the models depend on. In addition, new fields including biology and medical sciences are emerging as exciting new areas where HPC and HPDA are being successfully applied. “Many of HPC's scientific modeling codes are based on solving a set of time dependent partial differential equation (PDE) on a grid and geometric model of the objet being modeled” In the 1990's HPC workloads were parallelized and drove the development of massively parallel processing technologies. MPP's combined multiple shared memory multi-processors (or node) each with a separate operating system with a high speed, low latency interconnect and parallel file system and parallel job gang scheduler. Many of HPC's scientific modeling codes are based on solving a set of time dependent partial differential equation (PDE) on a grid and geometric model of the objet being modeled. An example of this is compressible air flow around a wing in aeronautics. The HPC

application parallelization strategy was based on domain decomposition with high speed, low latency application driven Message Passing Interface (MPI) for communications across domain boundaries with each domain being an MPI task and a set of MPI tasks on each node. Most HPC applications parallelized with MPI increase the size of the problem as the number of MPI tasks increases. This corresponds to enhanced grid resolution and/or more physics (more complicated PDE) and produces a better predictive result with smaller error estimates. However, the communication patterns between MPI tasks for domain decomposition is very tightly coupled with short compute and communication sections interspersed. This causes this type of scale up HPC application to require dedicated resources and low latency, high bandwidth interconnects. Economic Benefits of HPC As the predictive modeling and simulation have improved and become incorporated as an integral part of the scientific method, the economic impact has also dramatically improved. For example, today's weather simulations now routinely predict seven day forecasts with a precision that rivals what was capable 40 years ago for next day forecasts. Other examples include car crash simulations that are now used to do 90 percent or more of car structural verification. Final crash tests are used more to verify and validate the codes than as a mechanism for vehicle crash certification. In addition, new areas of personalized medicine are now opening up as we can now quickly compare the genome of individuals with cancer to compare their cancer cell genome against other patients matching histories and treatments/ outcomes. Direct simulation of cancer drug therapies can now be performed for individual patients within days to weeks that increases the success rates for many forms of cancers. Drivers for HPC in the Cloud Cloud computing has emerged as a enterprise data center model that enables multiple Enterprise applications, services and base infrastructure to be dynamically allocated, migrated for load balancing and cross data center optimization. In addition, vast quantities of data are being collected from the Internet of Things (IoT) that are driving new end-user services and revenue. Similar drivers are driving HPC HPDA workloads into the Cloud, but adoption of HPC in the Cloud has lagged behind the Enterprise adoption of Cloud technologies. This is due to two issues: 1) historical HPC buying patterns tend to keep HPC purchases within the R&D organization and out of the CIO organization; 2) HPC workloads in the Cloud require specialized services and infrastructure to be fully enabled. Some portion of HPC and HPDA can be run as a throughput workload with individual application runs that are serial, but the data analysis requires a huge number of application runs as is done with gene sequencing. In this case, Cloud infrastructures tuned for Enterprise workloads is sufficient. However, to support scalable applications with 100's to 100,000's of MPI tasks, several additional technologies must be available within the cloud: 1) high performance, low latency interconnect with a highly structured topology such as Infiniband with a Fat Tree topology; 2) all of the MPI tasks must be scheduled to run together simultaneously without contending for resources with other workloads on the same systems the MPI tasks are scheduled on. This means no multi-tenancy of workloads on HPC enable systems. Lastly, a scalable parallel file system, such as Lustre, is required to provide scalable IO performance as the parallelism in the HPC and HPDA jobs increases. There are several providers for HPC in the cloud technologies and the list is growing rapidly. The future adoption of HPC in the cloud will continue to grow. However, the technical hurdles described above need to be addressed to truly enable both throughputs HPC and scalable HPC.