Re: 1.3.95 is not stable

Jes Degn Soerensen (jds@kom.auc.dk)
27 Apr 1996 18:18:17 +0200


>>>>> "Linus" == Linus Torvalds <torvalds@cs.helsinki.fi> writes:

Linus> On 25 Apr 1996, Steven L Baur wrote:
>> Here are 3 crash traces from syslog. The symptoms were all
>> similar in that init died each time making shutdown problematic.
>> I've downgraded to .93 out of self-defense. Presumably the
>> constant network crashes are going to get fixed before 2.0, right?

Linus> The strange thing is that the 94/95 patches didn't really
Linus> change much of the kernel: they contain mostly m68k stuff and
Linus> spelling fixes.

Linus> The crash you see is due to memory corruption in the kernel
Linus> (the function "handle_signal()" to be exact): the code sequence
Linus> _should_ be

Linus> addl $0x10,%esp cmpl $0x0,0x8(%ebx) jnl <handle_signal+106>

Linus> which is "0x83 0xc4 0x10 0x83 0x7b 0x08 0x00 0x7d 0x06". Your
Linus> panic reports "0x83 0x84 0x10 0x83 0x7b 0x08 0x00 0x7d 0x06".

Linus> Note the _one_bit_ error in the second byte... (0xc4 has become
Linus> a 0x84).

Linus> Now, one-bit errors might be due to flaky memory, and the
Linus> reason you see the problems with some kernels and not others
Linus> _may_ be because the small differences in the kernels move the
Linus> code around a bit, and then the errors show up in different
Linus> places (or fail to show up at all).

Linus> The error may well be due to a bogus kernel pointer being used
Linus> for bitmap operations too, of course. That's the more likely
Linus> explanation (bogus pointers _usually_ result in byte or word
Linus> corruption rather than just a single bit error, but this can be
Linus> unlucky).

I don't know if this might be related, but I have received a report,
from Andreas Schwab, about something similar for the m68k port.

Andreas told me that he experiences a bug randomly changing bits in
user pages. I haven't noticed this myself though and I'm afraid we got
no cure for it ;(.

Jes