D. J. Bernstein
Computer hardware

Memory errors and SECDED

According to memory manufacturer Corsair, a typical computer with 256MB of non-ECC memory has several memory errors every year. These hardware failures cause computers to crash, destroy data, etc.

ECC memory, when properly used, eliminates this problem. It stores 64 bits of data in 72 bits of physical memory using what mathematicians call a ``distance-4 code.'' If one of those 72 bits is flipped by a cosmic ray, the motherboard will automatically fix the error, letting the computer continue operating with the correct data. In the unlikely event that two bits are flipped, the motherboard will at least detect the error and halt the computer, rather than allowing data to be corrupted.

(Of course, software failures also cause computers to crash, destroy data, etc. In a world of ludicrously fragile 20th-century software design, software failures are extremely common, and most hardware failures are incorrectly blamed on software. Making computers work means making the hardware work and making the software work.)

Error correction is not free. ECC memory is inherently 72/64=1.125 times as expensive as non-ECC memory; it uses 9 memory chips where non-ECC memory uses 8. However, given the fact that one can buy 512 megabytes of Kingston ECC memory for under $100 as of October 2001, it seems rather silly to worry about this difference.

A more serious problem is the circuitry required for the motherboard to actually correct errors. This isn't expensive to build, but it requires engineering expertise that, apparently, most hardware manufacturers don't have.

Abit, for example, produces motherboards that don't actually correct errors, although you can still plug ECC memory into them. Even worse, Abit habitually makes fraudulent claims of ECC support. I will never buy another Abit motherboard.

The Linux ECC page has a list of chipsets that can correct errors. Note, however, that many motherboards don't actually correct errors, even though they use chipsets capable of correcting errors; the Asus A7M266 motherboard, using the AMD 761 chipset, is an example.

If you're a motherboard manufacturer, and you have a motherboard that actually corrects errors, please say so! This is a major plus for your motherboard. Don't just say ``ECC'' in your data sheets. Say ``SECDED: corrects single-bit RAM errors, detects double-bit RAM errors.'' SECDED is a standard acronym for single-error correction and double-error detection.

If you're a computer purchaser, feel free to copy this link to your web pages: I buy PCs with SECDED.