Trusting the Computation
When putting together an accelerated HPC cluster it is critical that the computation is accurate. As such, there are several elements to consider:
Numerical Accuracy
- Every computer user requires their answers to be correct, as Intel discovered to their cost in 1992. In the HPC arena correctness is partially a function of the precision that calculations are made to.
- When a computer stores a real number it does it to some level of precision; more or fewer bits are allocated to store the number. For integer mathematics the choice is typically 8bits or multiples thereof, up to 64bits for the biggest number range (±2,147,483,647). For floating point mathematics, the choice is between Double Precision (DP: 64bits) and Single Precision (SP: 32bits). Although both these formats can store a vast range of numbers, DP is approximately 10^10 times more accurate than SP. While this accuracy is not necessarily significant for any individual calculation, the errors that are introduced when you do billions of calculations add up quickly.
Numerical Standards
- Floating point numbers, the computer format that is used for highly precise calculations, are governed by an international standard, IEEE 754. How this standard is implemented has a great bearing on how accurate, and how repeatable, your calculations are.
- The standard for storing floating point numbers in a computer was created in 1985 as is known as IEEE Std 754-1985. This standard defines, amongst other things, how numbers are stored (Double Precision, Single Precision, zero, infinity), how calculations are performed, and how imprecise calculations are flagged. A floating point unit in a computer can support little, some or all of the specification. Depending on what features of the standard the unit supports, different results will be returned and different conditions may or may not be flagged to the user (for example “when is a number not a number?”).
Reliability
- Many HPC calculations take hours, if not days, to complete. That means that your cluster has to stay up and running for the required hours or days, in addition to guaranteeing that all the billions of operations that are executed during that time are done correctly.
- Banks manage portfolios and risk of billions of dollars; drug research companies have to undertake incredibly thorough analysis to reduce the risk of a new compound having negative side effects; oil companies have to look deeper and harder to find the reserves that need to be tapped. All of these business goals, and the many more that are the hallmark of High Performance Computing, are becoming more complex and difficult every day, and are thereby taking longer and longer to compute. If your calculation takes 10 seconds but your competitor takes 8 then you lose the deal; if it takes 10 days to complete and it fails half way through then the opportunity loss is enormous, and no matter how long it takes, if the result is incorrect, and you’re not made aware of this, it is unacceptable.