Re: discriminate single bit error hardware failure from slab corruption.

From: Avi Kivity
Date: Thu Feb 02 2006 - 19:43:19 EST


Dave Jones wrote:

In the case where we detect a single bit has been flipped, we spew
the usual slab corruption message, which users instantly think
is a kernel bug. In a lot of cases, single bit errors are
down to bad memory, or other hardware failure.

This patch adds an extra line to the slab debug messages in those
cases, in the hope that users will try memtest before they report a bug.

000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Single bit error detected. Possibly bad RAM. Please run memtest86.

Signed-off-by: Dave Jones <davej@xxxxxxxxxx>

--- linux-2.6.15/mm/slab.c~ 2006-01-09 13:25:17.000000000 -0500
+++ linux-2.6.15/mm/slab.c 2006-01-09 13:26:01.000000000 -0500
@@ -1313,8 +1313,11 @@ static void poison_obj(kmem_cache_t *cac
static void dump_line(char *data, int offset, int limit)
{
int i;
+ unsigned char total=0;
printk(KERN_ERR "%03x:", offset);
for (i = 0; i < limit; i++) {
+ if (data[offset+i] != POISON_FREE)
+ total += data[offset+i];


how about

total += hweight8(data[offset+i] ^ POISON_FREE);

printk(" %02x", (unsigned char)data[offset + i]);
}
printk("\n");
@@ -1019,6 +1023,18 @@ static void dump_line(char *data, int of
}
}
printk("\n");
+ switch (total) {
+ case 0x36:
+ case 0x6a:
+ case 0x6f:
+ case 0x81:
+ case 0xac:
+ case 0xd3:
+ case 0xd5:
+ case 0xea:
+ printk (KERN_ERR "Single bit error detected. Possibly bad RAM. Please run memtest86.\n");
+ return;
+ }


and a

if (total == 1)
printk(...);

here? it seems more readable and more correct as well.

}
#endif





--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/