Re: Frequent oops in shrink_mmap

David desJardins (desj@google.com)
Mon, 29 Nov 1999 17:09:12 -0800


Stephen C. Tweedie <sct@redhat.com> writes:
> This has the footprint of hardware problems stamped all over it.
>
>> Nov 15 15:50:42 s17 kernel: EIP: 0010:[try_to_free_buffers+18/136]
>> Nov 15 15:50:42 s17 kernel: EFLAGS: 00010202
>> Nov 15 15:50:42 s17 kernel: eax: 40000000 ebx: c029b320 ecx: 00000006 edx: 00020000
>> Nov 15 15:50:42 s17 kernel: esi: 40000000 edi: 40000000 ebp:
>> c029b320 esp: c000dfa8
>
> So the immediate problem is that there is a page in the page map which
> has a page_map->buffers pointer of 0x40000000. That's one bit away from
> a legal value of zero. That sort of single-bit error is usually a sign
> of hardware trouble. It's not guaranteed, but that's the best diagnosis
> just from looking at one dump.

Thanks very much; this is really helpful.

I looked at 56 of these oops messages in try_to_free_buffers, from 10
machines. 50 messages (4 machines) have %eax=80000000, and 6 messages
(6 machines) have %eax=40000000. Is this consistent with the single-bit
memory error, or not? If it's purely a hardware problem, should I be
seeing 20000000 and 10000000 and other one-bit patterns? And should I
be seeing one-bit differences from valid nonzero pointers? Or is it the
case that only memory errors in the top two bits will trigger this oops,
and other memory errors might remain undetected, and that the great
majority of entries will be zero so that all of the errors are likely to
occur on those pages?

-- David desJardins

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/