The Solution

Better by design

The new frontier in chip design is finding clever ways to compute with less energy. Perhaps the most visible ray of hope is in the emergence of multi-core processor chips running at lower clock speeds. Every time a microprocessor vendor doubles the number of processing elements but lowers the clock speed, it drops the power consumption yet raises the effective performance.

Despite these impressive advances in microprocessor technology, a tradeoff still persists between designing a computer for the full range of applications and designing it for technical applications that make heavy use of floating-point arithmetic. The extra hardware demanded by high performance computing (HPC) raises the price of a general-purpose processor too much to incorporate a high-performance Floating Point Unit into a general-purpose chip.

The rise of graphics accelerators such as those from ATI and nVIDIA®illustrates this situation. A microprocessor can certainly perform graphics functions, albeit at a lower speed than the graphics card. Designers could allocate some of the transistors on a chip that currently go to caches and other general performance enhancements to graphics processing; however, many microprocessor applications involve little or no graphics, whereas almost all applications benefit from caches, look-ahead features, etc. Relegating both graphics and HPC functions to separate hardware moves the choice from the chip engineer to the user, avoiding the tradeoff issues.

Alternative approaches to delivering more performance

It is well understood that the HPC community covers an extremely diverse range of requirements. There are few better markets to disprove the adage that one size fits all.

Floating-point operations (FLOPS) and double precision accuracy are often cited as the key metrics for measuring the performance of HPC systems. In reality they are only one part of the equation. Some HPC applications are dependent primarily on integer performance.

Others are constrained by memory or network latency and bandwidth, data access or other bottlenecks.

Double precision floating point performance is highlighted not because it is the most important attribute, but because it is one of the most demanding system attributes to satisfy and is a requirement for the majority of the HPC disciplines.

An understanding of the characteristics of HPC workloads and how they relate to the available alternative approaches is critical to architecting a system that delivers usable real world performance.

Deploy more standard servers in the cluster

For many users this is the best solution. Industry standard servers are cost effective and are applicable to a wide range of problems. They have a large software ecosystem available which makes them extremely versatile.

It is only when considerations such as energy costs or facilities constraints make deploying more standard systems problematic that most accelerator technologies are a better choice. There are a number of technologies available that support the data parallelism required to accelerate floating-point operations.

Programmable Logic Devices

Programmable Logic Devices (PLDs) such as Field Programmable Gate Arrays (FPGAs) are frequently used as the core component of accelerator options. They are relatively inexpensive, moderately low power devices that are essentially blank application specific integrated circuits. While very effective for their intended use, their programming models are focused on the needs of circuit designers rather than those of software engineers, which can increase development time and costs. Current generation FPGAs can be adapted to accelerate single precision applications effectively, but double precision performance remains limited. Effectively, the use of FPGAs shifts the development effort for the implementation of a Floating Point Accelerator from the manufacturer to the user.

Graphics Processing Units

GPUs are the most ubiquitous of the accelerator technologies with the potential to accelerate general purpose floating-point calculations. As a result of their high volume production they are inexpensive. Focused on the uncompromising demands of the gaming market high end GPUs deliver performance but frequently compromise other requirements of the HPC market. They are typically optimized for single precision, may not conform to IEEE 754 floating point arithmetic rounding conventions and frequently consume several times the energy used by a modern standard general-purpose processor. Until very recently they had to be programmed through graphics oriented interfaces such as OpenGL which resulted in a limited range of suitable development tools. Only with the advent of DirectX 9 have they become fully programmable. However, the programming model requires the programmer and the user to manually extract and exploit parallelism in the algorithm and the data structures.

Game Processors

The Cell Broadband Engine™ jointly developed by Sony, Toshiba, and IBM (STI) is the best known games processor targeted at HPC application acceleration. With the potential for high volume production and corresponding low cost, game processors have generated significant interest in the community.

They share many of the same design criteria as GPUs and consequently exhibit many of the same attributes including a focus on single precision performance and similar power consumption characteristics. The programming model of game processors is that of a MIMD or SMP-type coprocessor array, and to fully exploit their potential, the programmer and user has to be very aware of all the going-ons in the hardware.

ClearSpeed Technology CSX600

The ClearSpeed CSX600 processor array is currently the most radical example of energy-efficient, multi-core processor designs with 96 processing elements running at only 210 MegaHertz. The ClearSpeed CSX600 is a true coprocessor, and depends on a general-purpose processor for a host. The result is a chip that is simultaneously the fastest at 64-bit floating point speed yet one of the lowest for power consumption averaging approximately 10 watts. Of the currently available accelerator technologies it is the only one specifically designed for the needs of the HPC community.

Accelerating into the Future

Each of the available technologies excels at its intended purpose and can be deployed to good effect in HPC applications that are well matched to their specifications. Using accelerators optimized for an application regime in combination with standard, generic processors is something we can do now to mitigate facilities costs.

While satisfying an insatiable demand for compute cycles is by definition impossible, it is also clear that shifting from traditional supercomputing architectures to new hybrid approaches offers a way for the HPC community to overcome the challenges presented by floor space, energy consumption limits and ease of use.