Re: discriminate single bit error hardware failure from slab corruption.

From: Avi Kivity
Date: Thu Feb 02 2006 - 21:03:37 EST


Dave Jones wrote:

On Fri, Feb 03, 2006 at 02:44:52AM +0200, Avi Kivity wrote:

> total += hweight8(data[offset+i] ^ POISON_FREE);
> > > printk(" %02x", (unsigned char)data[offset + i]);
> > }
> > printk("\n");
> >@@ -1019,6 +1023,18 @@ static void dump_line(char *data, int of
> > }
> > }
> > printk("\n");
> >+ switch (total) {
> >+ case 0x36:
> >+ case 0x6a:
> >+ case 0x6f:
> >+ case 0x81:
> >+ case 0xac:
> >+ case 0xd3:
> >+ case 0xd5:
> >+ case 0xea:
> >+ printk (KERN_ERR "Single bit error detected. > >Possibly bad RAM. Please run memtest86.\n");
> >+ return;
> >+ }
> > > >
> and a
> > if (total == 1)
> printk(...);
> > here? it seems more readable and more correct as well.

More readable ? Are you kidding ?
What I wrote is smack-you-in-the-face-obvious what it's doing.
With your variant, I have to sit down and think it through.


Looks like we have mirror image brains :) - I had to scratch my scalp to figure out where all the magic numbers in the switch came from.

Perhaps well named variables will help:

unsigned char modified_bits = data[offset+i] ^ POSION_FREE;
int modified_bits_count = hweight8(modified_bits);
total += modified_bits_count;

wrt correctness, what do you see wrong with my approach?


Your code will generate a false positive 8 times in 256 runs, or 1 in 32. A 3% false positive rate seems excessive, It's also sensitive to changes to POISON_FREE.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/