Re: Huge uptimes & cosmic rays

John G. Alvord (
Fri, 11 Jul 1997 04:17:55 GMT

On 10 Jul 1997 00:31:47 -0700, Daniel Quinlan <>

>Pavel Machek <> writes:
>> Huge amount of memory & huge uptime is calling for trouble: remember
>> that once or twice a year, random bit in your machine is selected &
>> toggled. If it hits kernel or libc or long-lived daemon...
>Good reason to use ECC DRAM. I have seen computations go wrong because
>of one-off bit errors. In once case, a machine had ECC accidentally
>disabled. Multiple runs of an intensive computation caused one-off bit
>errors in different places, which disappeared when ECC was turned on.
In the early 1980s, I was working at IBM Yorktown Research. A scientist
there had been studying the effect of cosmic rays for quite a while.
Cosmic rays are the result of high energy particles which strike the
earth's atmosphere and cause secondary streams of particles to burst
down to the earth's surface.

With 64K Dram, he estimated that about one bit a year would change value
at sea level. Several bits a year in Denver, Colorado, which is about a
mile above sea level. The bit change is not always relevent. If it does
not contain allocated memory, or if it as buffer memory which is
destined to be written to next, then the bit switch is never actually
noticed. The rate of bit switches depended on how many electrons were in
the well of energy which defines the on/off condition.

This result was theoretical until it actually showed up on a mainframe
which had a portion of memory (I/O channel buffers) which was parity
checked but did not have ECC... correction. Errors were reported from
the field but the parts almost always tested out perfect when they were
replaced and returned for testing. And the rates were several times
higher in Denver! Unless the machine was installed in a building which
had a lot of concrete above it physically.. like low down in a tall

Eventually a test was performed by installing a mainframe in a very high
ghost town in Colorado. Recording was also made at a radio telescope
which would record when bursts of cosmic rays arrived at the earth's
surface... there was a precise and exact correlation.

I never heard how the story came out... presumeable the design was
changed to use ECC memory in that area. Sounds like it is still
important for high availability computers.

john alvord