Guide to High-Performance Computing
If you are fascinated by the word “parallel” or “concurrent”, by the programs that run on supercomputers, by random jargons like “linearizable”, or anything high performant, then this is the guide you’re looking for!
In this guide, we lay down the pre-requisites and the foundational knowledge resources. Finally, we explain some interesting sub-fields in High-Performance Computing along with resources to get started in them.
Pre-Requisites
High-Performance computing is a relatively tricky field to get into, mainly due to its pre-requisites. While this field is not math-heavy (assuming you’re not interested in scientific simulations), it requires you to have a decent grasp over Computer Architecture and Operating Systems concepts.
Computer Architecture is especially crucial if you’re tackling any optimization problem since your application requires you to employ as many hardware benefits as possible. So, before getting started with this field, we highly recommend reading Structured Computer Organization by A. Tannenbaum, T. Austin 6th Edition. A bare minimum is a reading of chapters 2, 3.3–3.6, 4.1, 4.4–4.6, 5.1, 5.4 and 5.5.
A course on Operating Systems introduces a student with the fundamentals of multiprocessing, after all, your PC can run chrome and code at the same time (but can it run Crysis?). Therefore, you must go through the basic concepts of OS. We recommend you read Chapters 1 and 2 of Modern Operating systems by A. Tannenbaum, H. Bos 4th Edition. Conversely, if you want to take a more traditional and detailed approach to OS, you can read Chapters 1–5 of Operation Systems Concepts by A. Silberschatz, P. B. Galvin, G. Gagne 10th Edition. We recommend you try out questions given in the book for practice.
If you don’t feel like going through the books (perhaps you don’t find books amusing?), we suggest you study the following broad topics that are of interest with respect to HPC:
Processors:
- Micro-architecture: Data/Execution path, Pipelining, ALU design, Optimizations (Caching)
- ISA: Registers, memory model, addressing
- Architecture types: x86, ARM architecture (aarch)
- Miscellaneous: SIMD, MIMD, NUMA, Cache coherence
OS:
- Process, thread, fork/join, POSIX API, Race conditions, Mutual Exclusion
Finally, you must brush up your programming skills in C or C++. The field deals with significant amounts of code writing, so you can’t afford to get away with a hello world program.
Getting Started
Now that you are well acquainted with the pre-requisites, it is time to get into “the thing”. Before we start with the actual coding or subfields, we must go over the theory. Since the pre-requisites only introduce the notions of multiprocessing and threading, we now suggest you read Chapters 6–9 of Operation Systems Concepts by A. Silberschatz, P. B. Galvin, G. Gagne 10th Edition. These chapters will introduce you to the general concepts of parallel programming and synchronization problems. Make sure to try the coding problems at the end of each chapter.
Once you’ve read that, we recommend you to do the course CS 0368–4061–01 from MIT. The course is inspired by the book The Art of Multiprocessing by M. Herlihy, N. Shavit. The book teaches you the concepts of parallel programming along with the general concepts like linearizability and wait-freedom, to name a few. This book is especially important for anyone wanting to pursue any of the research problems mentioned in the book. Alternatively, you can go through the SPTCC 2017 lectures. They teach you all the fields relating to parallel and distributed computing broadly. It is not necessary (but recommended) to go through all the lectures to proceed further.
With all the theoretical work done, you can now begin with some coding.
Enter HPC
It is finally time to learn parallel algorithms and implement them in your language of choice, and then optimize your codes for the underneath architecture. Start with a course on HPC. We recommend you go through the following course:
- LSU’s CSC 7600 by Prof. Thomas Sterling and Dr. Hartmut Kaiser: High Performance Computing: Models, Methods and Means (CSC 7600)
If you’re stuck anywhere within the course, we recommend reading more on the topic. For example, if you are having trouble understanding OpenMP, we suggest you go through its tutorials. If that’s not enough, you can find the documentation here. You can approach MPI the same way. You can find the tutorial and documentation here and here.
By this point, you should be familiar with Parallel and Distributed programming paradigms. If you’re interested in how operating systems are designed in distributed settings and the general concepts of distributed deadlocks, distributed clock synchronizations, distributed mutual exclusions, etc., then we suggest you give Advanced Concepts in Operating Systems by M. Singhal, N. G. Shivaratri, N. Shivaratri a read.
Interesting sub-fields
Runtimes
If you’ve gone through the course in-depth, you must have realized the inherent shortcomings of an OpenMP + MPI application. A lot of research is, therefore, done on runtimes. New parallel programming paradigms are on their way. Many are trying to optimize the more recent paradigms. If you are interested in working on runtimes, you will have to understand the addressing models employed by them. Some of the popular addressing models that we currently have are MPI, PGAS, and AGAS. You have already learned MPI, so let’s understand what PGAS and AGAS are.
- AGAS Paper: Hartmut Kaiser et al., HPX — A Task Based Programming Model in a Global Address Space, PGAS 2014: The 8th International Conference on Partitioned Global Address Space Programming Models (2014)
- PGAS Paper: Saraswat, Vijay, et al. “The Asynchronous Partitioned Global Address Space Model.” The First Workshop on Advances in Message Passing. 2010.
Several runtimes/languages/frameworks based on these models exist. It is usually recommended to have in-depth knowledge of one while knowing the existence of others. Feel free to pick any one of the runtimes/languages/frameworks of your choice. Some popular ones are:
- HPX
- Legion
- Habanero C
- UPC++
- Chapel
- Kokkos
- StarPU
- charm++
- Julia
We’ve missed on a lot of frameworks, so if you find an interesting one, go ahead and give it a shot!
You can also go through the course CS315B by Stanford on programming supercomputers. Furthermore, Programming Models for Parallel Computing by P. Balaji is an excellent read if you want to learn about different programming models and how they differ in their functioning.
Fault-Tolerance
All hardware can fail. This can be a faulty floating-point operation, a bit-flipped memory fetch or a complete meltdown of a processor or other hardware component. In case of such errors, you do not want to rerun your 5-day long simulation. This is both a loss of time and resources. Fault-tolerance plays a crucial role in deciding “what to do” approach in case of such crashes. Unfortunately listing all papers is out of scope for this blog so the following papers are a good read on the topic.
- Coordinated checkpoint/restart process fault tolerance for MPI applications on HPC systems, by Hursey, Joshua, Ph.D., Indiana University, 2010, 202; 3423687 (for MPI)
- K. Teranishi et al., “ASC CSSE level 2 milestone #6362: Resilient asynchronous many-task programming model,” Sandia National Laboratories, Tech. Rep. SAND2018–9672, 2018. (for AMTs)
To get a detailed understanding of Fault-tolerance, you must read Fault-Tolerance Techniques for High-Performance Computing by T. Herault and Y. Robert.
Performance Modelling and Optimizations
So you finally learned some parallel programming paradigms and decided on the best one for your application. You go ahead and write your application with everything you know, but your application still performs poorly. How do you assess where you went wrong? Is it your fault? Is the hardware at fault?
If you find observing a high FLOPS count thrilling, this field will provide you with ample opportunities to turn a non-performant code to an exascale one. It is a good time to understand data parallelism employed by SIMD and learn some SIMD programming (also this). You may also want to look into SIMD libraries and presentations. For performance modelling, this workshop is a great place to start. You should also explore more based on the citations of the workshop presentations.
While there are multiple optimization problems to look at, one famous problem set is stencil codes. These codes are useful in solving a variety of linear differential equations. However, one flaw of stencil codes is their inherent memory-bound nature. This means that an application will not perform at CPU’s peak performance, bottlenecked by the memory bandwidth. Since memory bandwidth for DRAMs can’t increase significantly, it remains as an optimization problem. A good way to start with cracking stencil optimizations is to read the book Space-Filling Curves by M. Bader. For example, Chapel makes use of chapel iterators to perform diamond tiling (one such optimization) over a grid without loss in the visibility of stencil operation. C++ (OpenMP) brethren, however, fail to show such smooth transitions between the naive stencil and diamond-based tiling solutions and requires special compilers to do so. Such optimizations without loss in the overall code visibility are the key to this field.
Non-Blocking Algorithms & Data Structures
Traditionally and predominantly, locks have been used in parallel programming. Locks are used to synchronize threads/processes and prevent race conditions. Blocking can be undesirable, as it halts the thread’s progress, which can be an issue if the thread was performing a high priority task or real-time task. It reduces parallelism opportunities for the system. Blocking can also lead to conditions like deadlock and priority inversion.
Non-blocking data structures and algorithms provide many benefits over their synchronized counterparts, such as providing guarantees on liveness, which is a property on how threads make progress throughout a system. These properties in increasing order of the guarantees offered are:
- Obstruction-freedom, which states that threads are guaranteed to complete their operation so long as they are not obstructed by some other thread.
- Lock-freedom, which states that at least one thread is guaranteed to progress and succeed in a bounded number of steps.
- Wait-freedom, in which, all threads are guaranteed to progress and succeed in a bounded number of steps.
Non-blocking algorithms are a big research field. Researchers are exploring new non-blocking algorithms, as well as non-blocking variants of existing blocking algorithms and data structures. Achieving true lock-freedom is extremely difficult, and may not always be possible or feasible.
Some lock-free data structures are:
- Treiber’s Lock-free Stack: Hendler, Danny, Nir Shavit, and Lena Yerushalmi. “A scalable lock-free stack algorithm.” Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures. ACM, 2004.
- Michael & Scott’s Lock-free Queue: Michael, Maged M., and Michael L. Scott. Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms. No. TR-600. ROCHESTER UNIV NY DEPT OF COMPUTER SCIENCE, 1995.
Networking Communications and Models
HPC is all about performance, and so, a high amount of research goes into improving the processing hardware and software. However, with the onset of Distributed Computing, most HPC applications are expected to run on clusters and supercomputers, which are a set of loosely or tightly connected computers that work together to achieve a common goal. Network bandwidth, therefore, limits the peak performance expected from a CPU. This led to the evolution of networking communication standards specifically for use in HPC. These feature extremely high throughput and low latency. Some examples are Cray Aries, Mellanox InfiniBand, Cray Gemini, Intel Omni-Path, etc.
Remote direct memory access (RDMA) is a direct memory access from the memory of one computer into that of another without involving either one’s operating system. This permits high-throughput, low-latency networking, which is especially useful in massively parallel computer clusters. You can read more about RDMA here.
Computational Science Library Backends
Ever felt that you should’ve chosen Astrophysics as a major? Or perhaps Math or Biology or Chemistry?
These fields are of interest to those who want to work in cross-disciplinary research. The simulations running on supercomputers come under computational sciences. These are codes that simulate a certain behaviour (eg: stellar mergers). Since most scientists don’t deal with the coding portions of simulations, you fill in the gap of understanding the science and programming it on the supercomputer. Some existing applications are:
- Octo-Tiger: Octo-Tiger is an astrophysics program simulating the evolution of star systems based on the fast multipole method on adaptive Octrees. Link: https://github.com/STEllAR-GROUP/octotiger
- FleCSI: FleCSI is a compile-time configurable framework designed to support multi-physics application development. Link: https://github.com/laristra/flecsi
- OpenAtom: OpenAtom is a software for studying atomic, molecular, and condensed phase materials systems based on quantum chemical principles. Link: http://charm.cs.illinois.edu/OpenAtom/
- CHGL: The Chapel Hypergraph Library (CHGL) is a library for hypergraph computation in the Chapel language. Link: https://github.com/pnnl/chgl
We recommend reading Chapter 4 and Section II (Applications) of Introduction to High-Performance Scientific Computing to begin with computational science.
Written by Garvit Dewan and Nikunj Gupta.