Re: kernel cache mem bug(?)

From: kernel
Date: Tue Mar 21 2006 - 10:34:57 EST


> > The RAM is 100% OK, both according to the initial BIOS memory check and
> > memtest86 running for more than four (4) hours straight.
> >
> > One more thing that we just noticed: it seems the cache corruption (or
> > whatever it is) only occurs when X is running.
>
> X is special in a couple of ways - it accesses hardware directly and it
> uses AGP. See if setting the NoAccel X option makes any difference to
> the failures - it'll make your desktop slower but should help validate X
> is the problem. If that does help then try disabling DRI (3D) support if
> your box has it, and see what that does. If it works with noaccel and
> fails with DRI then finally try with no AGP support built into your
> kernel. That should identify the problem as X, DRI, AGP or other.

I gave it a try, running the same reboot & test script for every change:

X with "noaccel" - no change (i.e. corruption still occur)
X w/o kernel fb support - no change
X without "dri" - no change
w/o AGP support in kernel - no change
X without "fgl" - no change
X without "glx" - no change
X without "fglrx" - no change (ATI:s driver, used vesa instead)
X without "dbe" - no change
X with only 1 screen - no change
X w/o Dev PCI:1... - no change
X without "dri" - no change
w/o kernel DRM support - no change
w/PCI acess=direct - no change
w/o kernel SMP HT support - no change

and finally...

w/o kernel SMP support - 100% ok!

(No corruption at all after 12 x 100 x 35 x 20 megabyte writing and
reading to ram cache)

I.e, the only thing that seem to fix this is by removing SMP (and by
doing that also HT) support :(

The processor is an Intel P4 3.4GHz w/1MB L2 cache
The graphics card is an ATI X800 using the PCI Express slot
(see my first post for more detailed info).

Perhaps worth noting: when doing repeated md5sum runs on the cached files
and discrepancies are found, it is "always" different files. I.e., the
same file will never show a bad checksum twice. A subsequent run will show
a good checksum for a file which just seconds ago had a bad one.

Of course no other programs were accessing the files during these tests.

Even though disabling kernel SMP support entirely will make this problem
go away, I seriously do not like that kind of workarounds...

Please let me know if there is more system/hardware information I could
provide to help sort this out.

Regards,
//D

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/