Re: 1.3.95 is not stable

Steven L Baur (steve@miranova.com)
26 Apr 1996 10:56:26 -0700


>>>>> "Linus" == Linus Torvalds <torvalds@cs.Helsinki.FI> writes:

Linus> On 25 Apr 1996, Steven L Baur wrote:
>> Presumably the constant network crashes
>> are going to get fixed before 2.0, right?

Linus> Is this 100% network-related? Any correlation with anything
Linus> else?

Crashes have appeared from rpc.mountd and rpc.nfsd quite frequently,
this machine does a lot of NFS, and quite frequently it does not
survive over night when all it is doing is networking. On my more
stable machines the (far less frequent) crashes appear nearly always
to be network related.

Linus> The crash you see is due to memory corruption in the kernel
Linus> (the function "handle_signal()" to be exact): the code sequence
Linus> _should_ be

Linus> addl $0x10,%esp
Linus> cmpl $0x0,0x8(%ebx)
Linus> jnl <handle_signal+106>

Linus> which is "0x83 0xc4 0x10 0x83 0x7b 0x08 0x00 0x7d 0x06". Your panic
Linus> reports "0x83 0x84 0x10 0x83 0x7b 0x08 0x00 0x7d 0x06".

Linus> Note the _one_bit_ error in the second byte... (0xc4 has become
Linus> a 0x84).

Linus> Now, one-bit errors might be due to flaky memory, and the
Linus> reason you see the problems with some kernels and not others
Linus> _may_ be because the small differences in the kernels move the
Linus> code around a bit, and then the errors show up in different
Linus> places (or fail to show up at all).

Linus> The error may well be due to a bogus kernel pointer being used
Linus> for bitmap operations too, of course. That's the more likely
Linus> explanation (bogus pointers _usually_ result in byte or word
Linus> corruption rather than just a single bit error, but this can be
Linus> unlucky).

Good explanation. Thank you.

I replaced all the main memory on this system with memory from a
system that has been happy with .95 (and restored the .95 kernel) and
got the same error:
(This was triggered by an NFS copy of the libc 5.3.12 source code).
Apr 26 09:30:20 deanna kernel: general protection: 0000
Apr 26 09:30:20 deanna kernel: CPU: 0
Apr 26 09:30:20 deanna kernel: EIP: 0010:[handle_signal+91/120]
Apr 26 09:30:20 deanna kernel: EFLAGS: 00010202
Apr 26 09:30:20 deanna kernel: eax: 40061d8c ebx: 01f0e8e4 ecx: 00000000 edx: 00002000
Apr 26 09:30:20 deanna kernel: esi: 0000000e edi: ffffffff ebp: 00002000 esp: 01f08f68
Apr 26 09:30:20 deanna kernel: ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
Apr 26 09:30:20 deanna kernel: Process update (pid: 11, process nr: 11, stackpage=01f08000)
Apr 26 09:30:20 deanna kernel: Stack: 01f0e8e4 01f08fbc 0000000e 00002000 0000000e 01f08fbc 0010a255 0000000e
Apr 26 09:30:20 deanna kernel: 01f0e8e4 00002000 01f08fbc 00002000 01f08fbc bffffe18 00000005 00109ba7
Apr 26 09:30:20 deanna kernel: 00002000 01f08fbc 01f0e414 00000000 0010a379 00000000 00000000 00000000
Apr 26 09:30:20 deanna kernel: Call Trace: [do_signal+529/604] [sys_sigsuspend+59/76] [system_call+89/160]
Apr 26 09:30:20 deanna kernel: Code: 83 84 10 83 7b 08 00 7d 06 c7 03 00 00 00 00 a1 0c 47 1f 00
Apr 26 09:30:20 deanna kernel: Unable to handle kernel paging request at virtual address c80d0ebb

[There's more, but it looks like more of the same thing]. It also
crashed immediately on reboot, but did not log anything.

O.K. It's not the RAM. I've now tried turning off all caching and it
doesn't appear to be crashing immediately. It is not usable though
either. :-(

Regards,

-- 
steve@miranova.com baur
Unsolicited commercial e-mail will be proofread for $250/hour.
Andrea Seastrand: For your vote on the Telecom bill, I will vote for anyone
except you in November.