Soft Errors
The CSX700 processors are designed to ensure system reliability. This includes features for:
- Protection from soft errors
- Ensuring low mean time between failure(MTBF)
- Bynamic power and temperature management
One of the most important is the use of Error Correcting Codes (ECC) on all internal and external memory. This ensures that soft errors are detected and corrected.
What are "soft errors" and why do they matter?
A transient, single bit corruption in a circuit is termed a soft error.
Soft errors happen all the time and are caused by:
- High energy cosmic particle strikes
- Power supply fluctuations
- Crosstalk between electronic components
- Other random noise.
Processors are becoming more susceptible to soft errors due to:
- Die size increases
- Process geometry shrinks
- Increasing clock rates
This is a huge problem for GPGPUs as they are some of the most susceptible chips that have ever been made:
- They are extremely large chips, e.g. 22x22mm for latest Nvidia GPU
- They are on very small process geometries such as 65nm
- They have high clock rates in the GHz range
Protecting against soft errors
All modern CPUs protect themselves from soft errors by using Error Correcting Codes (ECC).
ClearSpeed protects its processors from soft errors by using Error Correcting Codes (ECC) on all memories, on and off-chip:
- ECC detects, fixes and/or reports soft errors
- All on-chip memories and off-chip DRAMs are protected this way
- This is exactly the same method used by x86 CPUs.
Nvidia and AMD/ATI GPGPUs do not have ECC. Thus they will be unreliable and silently and randomly generate incorrect answers.
GPGPUs don’t have ECC protection because soft errors do not matter in visual computing, their main market.
Adding ECC is an unnecessary overhead for visual computing thus GPGPUs do not want to add this feature.







