ClearSpeed's CSX600 is an embedded low power data parallel coprocessor. It provides 33 GFLOPS of sustained single or double precision floating point performance, while dissipating an average of 10 Watts. Using 64-bit addressing, each CSX600 can support multi-gigabyte DDR2 SDRAMs via a local ECC protected memory interface.
The CSX600 processor is actually a system-on-a-chip (SoC), based around the combination of ClearSpeed's patented multi-threaded array processor (MTAP) and ClearConnect™ Network on Chip (NoC) technology. The MTAP architecture has been designed to provide unparalleled performance-per-watt, while the low-power ClearConnect NoC provides straightforward system-wide concurrent bandwidth.
The CSX600 comprises of an MTAP processor core, external DRAM interface, high-speed inter-processor I/O ports and embedded SRAM integrated onto a single chip. All subsystems on the chip are interconnected via the ClearConnect on-chip network. The MTAP contains an array of 96 Processing Elements (PEs) or cores. Each PE includes multiple processing units and has a high level of internal instruction and data parallelism. Each PE also has its own local memory providing high-bandwidth to frequently used data.
Interconnect on the CSX600 is achieved using the ClearConnect NoC, a packet switched on-chip network (NoC). All memory based data transactions are converted into packets and then transmitted over the network. It also supports multiple concurrent transfers, for example, the processor can access data in the on-chip SRAM at the same time as data is transferred to the DDR2 interface from one of the bridge ports. This enables extremely high aggregate bandwidth with low power consumption. The ClearConnect NoC is also used, via bridge ports, to provide communication between CSX processors.
External memory is connected via a 64-bit DDR2 DRAM interface. When used with 72-bit wide DRAM modules, the interface can support Error Checking and Correction (ECC). Each processor supports up to 4 Gbytes of local DRAM. The processor supports 64-bit addressing so that large data sets can be processed. The 64-bit address space is flexibly mapped into a 48-bit physical address space distributed across multiple processors. For embedded systems and backward compatibility a simple 32-bit addressing mode is provided.
The on-chip DMA controller can be programmed to transfer data to and from the external memory interface and any other device on the ClearConnect NoC. On-chip SRAM is included for frequently accessed code and data.
The Interrupt and Semaphore Unit (ISU) supports low latency synchronization between threads and external events such as memory to memory communications. Both pin and message signalled interrupts are supported for flexible support of multiple devices in various host environments.
A host interface allows the CSX600 to communicate with, and be controlled by, the system's host processor. This port can also be used as a hardware and software debug port as it provides full access to all the internal registers on the device. Finally, an IEEE 1149.1 Test Access Port (TAP) supports boundary scan for system test.
ClearSpeed's CSX600 Advance™ X620 and Advance e620 accelerators provide application acceleration without impacting power, cooling or space requirements. In a PCI form factor one or more boards can be easily added to a workstation or server. And because it operates at the standard math library level, application users see only the performance gain with none of the hassle of changing their code.
Requiring no more space than a free PCI-X or PCIe slot for each ClearSpeed Advance board, adding acceleration to your workstations or servers is easy. Drawing on average only 25 watts it won't increase your power or cooling requirements.
Adding a single Advance board to your workstation gives you up to an additional 66 GFLOPS sustained performance - that's like having several workstations under your desk. There's no need to port code for many applications; it accelerates standard math libraries including Level 3 BLAS used by many applications such as Mathematica and MATLAB.
Using multiple boards to accelerate your cluster enables even bigger breakthroughs in performance per node. Combined with IEEE 754 compliant 64bit floating point, teams are freed to advance their science, tackling ever bigger problems with greater accuracy.
The Advance board works by offloading compute-intensive math library routines called by applications running on the host processor.
When a call is made by an application to a ClearSpeed supported standard math library, it is intercepted by CSXL, ClearSpeed's accelerated math library, which calculates if the function call is worth off-loading. When it is, the CSXL transfers the required data to the board to compute the function. The answer is calculated on the board and the results read back into host memory before returning to the application.
Throughout this process, the only perceivable difference between a function running on the host system, or a function running on the Advance board, is the speed. The acceleration is transparent to the end user and the application.
The hardware consists of a single-slot PCI-X or PCIe board with two CSX600 processors that can be used to accelerate a single desktop machine or nodes of a cluster. Multiple boards may be used in one system. The two CSX600s and an FPGA are daisy-chained together via (high speed) ClearConnect™ bridges, ClearSpeed's high speed network-on-chip. The ClearConnect network is also extended into the FPGA enabling a bridge to the host system to be implemented as a hardware block . This provides an efficient universal memory architecture between the CSX600 processors and the host's memory system. Each processor has 512 Mbytes of DDR2 SDRAM local memory. This memory also forms part of the universal memory architecture with DMA engines providing automated movement of data between local and host memory. The ClearSpeed CSX600 is a high-performance coprocessor based around an advanced multi-threaded array processor (MTAP) with 96 Processing Elements (PEs). Each PE has a dual 64-bit FPU and 6 Kbytes of local memory. This architecture provides very high performance combined with extremely low power.
The board is provided with supporting software in the form of standard software libraries used in a variety of applications. As well as standard libraries and application software, the CSX600 board is supported by ClearSpeed's Software Development Kit (SDK). This includes a C compiler, a debugger based on gdb, and a full suite of supporting tools and libraries.
