2.0.34: Jiffies wraparound caused immediate solid hang

Ville Herva (vherva@niksula.hut.fi)
Fri, 1 Oct 1999 20:50:47 +0300


I'm not sure whether discussing this here has any value, but here it goes:

After 497 days of uptime, our ever trusted server (PII333, uniprosessor,
256MB, SCSI) crashed. Since jiffies wrap around at that uptime, we could
expect some funny things, but perhaps not as bad as this.

We were running a script that did sync once a second. I was watching the
jiffies to approach 2^32, and at the exact second that happened, the
machine went offline. The machine is located at ISP's machine room, so we
had to go boot it in the middle of the night.

The console had locked up, so we could not scroll to see the whole crash
message and we where in hurry to boot the machine, but it said something
about illegal division by zero, and the usual aiee, killing the interrupt
handler stuff.

As another data point, cron began eating all the CPU few minutes before
wraparound, and the last load I saw was >4 just before the crash, so
something else did that too.

Of course, this was not the newest 2.0, but after 492d 6h 50m 48.39s I
hope to be able to tell you whether this affects 2.0.38 as well. It may
still be the newest 2.0 then... ;).

What I'm actually asking is that is this problem familiar to anyone, and
has it been fixed in later 2.0. I know people have had much longer uptimes
(wonder what's the longest) with 2.0, so this obviously does not happen
every time. It's propably quite impossible to find the bug with this
description, but if anyone is interestedm I'll be happy to post the
.config or any further info.

Somebody suggested making jiffies to start from 2^32-10*60*HZ in one of
the pre- or ac-kernels. This way these kind of issues would propapbly get
caught in no time.

-- v --

v@iki.fi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/