Modern computer chips have an incredibly high density of memory cells.

Because of this, they are vulnerable to a few things, including alpha particles and cosmic rays.

Tip:Alpha particles are a potent form of radiation.

Article image

They are essentially the nucleus of a helium atom and are emitted from nuclei during radioactive decay.

They are relatively large and have a high electric charge giving them a concise range.

However, within that range, they can have a considerable effect.

Cosmic rays are high-energy protons and other atomic nuclei that come from astrophysical sources and bombard the planet constantly.

The actual risk is generally from secondaries the debris from a cosmic ray collision with the upper atmosphere.

Because of outside factors, corrupted memory cells are considered soft errors and they can be fixed.

ECC stands for error correction code.

These modules are used where potentially corrupted data could cause expensive or fatal issues.

Airplanes, nuclear reactors, spacecraft, scientific data sets, and financial models must be reliable.

ECC RAM stores data redundantly enough for errors to be corrected.

The data isnt duplicated.

However, a checksum is available that can be compared to the current data.

If the checksum is inaccurate, then an error has occurred and can be corrected.

If the checksum is accurate, then no correctable error has occurred.

The phrasing there is essential.

No correctable error has occurred is not the same as no error has occurred.

This concept only works if the computer very frequently checks for issues.

That example is oversimplified.

It helps to demonstrate why its essential that these rare errors are caught as soon as possible, though.

The two errors can be corrected as long as one is fixed before the other occurs.

Given the low incidence rate, theres relatively little urgency.

Random events, however, can happen close together.

Implementation in Practice

Once identified, the errors need to be fixed.

The process of finding and fixing memory errors is called memory scrubbing.

To prevent any performance impact, memory scrubbing doesnt happen when the CPU requests data from the RAM.

Instead, the memory controller runs the scrubbing process while the RAM is idle.

When the CPU requests data, the memory controller opportunistically checks for errors.

This will require ECC RAM, which most computers do not support.

CPU caches, which are SRAM-based, can also be scrubbed.

However, they are usually far smaller than the main memory and dont need to be scrubbed often.

Main memory is typically DRAM-based.

DRAM offers far more storage density and uses more physical space, which means more opportunities for soft errors.

Therefore, they also need to be scrubbed more frequently.

Scrubbing happens in two ways.

The first is opportunistic; when the CPU requests data, this is typically referred to as demand scrubbing.

The other method is called patrol scrubbing.

This is where the memory controller performs automatically across the whole RAM while the RAM is otherwise idle.

Regular and efficient scrubbing of storage devices helps contribute to their reliability.

Memory scrubbing, however, requires ECC RAM and is not possible on standard RAM.

Tip:DDR5 memory includes on-die ECC.

This allows correcting bit flips while the data is at rest.

On-die ECC cannot do this and nor as effective as true ECC memory.