sporadic "freezes" on amd64 (GA K8NF)

From: Jaco Kroon
Date: Fri Aug 05 2005 - 15:37:50 EST


Hello all,

I'm absolutely stumped with this one. We are still having problems
deciding whether this is a software problem or a hardware problem. This
particular box (specs lower down) just freezes up sporadically when in
Linux.

Normally it just stops responding entirely. As in one moment it's still
outputting and the next there is nothing. Then once, (twice actually),
we actually got a kernel panic, I've taken a picture which can be found
at http://www.kroon.co.za/images/kernel_panic_amd64.jpg (Apologies for
the quality - phones aren't good at taking them). From this panic (and
the other which I had no way of capturing at the time) it looks like a
bug somewhere when accessing the hard drive. The one here was on
reiserfs the other was on ext3.

Hardware specs:

2GB RAM
Gigabyte K8NF
AMD 3500+ processor
Ge force 6200 graphics card

We've tried at least three different distributions (Mandrake, SuSE and
Gentoo) with both ext3 and reiserfs as file systems. Mandrake and SuSE
was 32-bit versions and we tried both a 32 and 64 bit Gentoo.

I've tried various kernels, from 2.6.10, 2.6.11.8, 2.6.11.11, 2.6.12,
2.6.12.3 - all to no avail. Unfortunately I don't have the kernel
config that was in use when we captured the trace any more. We are
using the sata_nv module for the sata controller though.

Now for the truly odd thing: When we down the RAM to 1GB it works fine.
So we suspected that something might be wrong with the RAM controller
and instead of 4 x 512MB we asked for 2 x 1GB, apparently this crashed
as well.

And for those who want to ask, yes, we've left it doing memtest for a
week, we have tried different combinations of the 4 chips when going
down to 1GB (all the combinations we tried - about 10 - worked). And
yes, all the burn-in tests (all of the ones on the ultimate boot CD) as
well as some burn-in tests from the suppliers (under Windows) worked
perfectly. We also ran some benchmarking tools on Windows (Suppliers
said if we can consistently crash Windows they'll swap out, to quote "It
runs Windows - it performs within spec"). Needless to say - we're not
going back to them for future purchases.

And no, we are not using the binary nvidia module :).

Thanks in advance for any and all suggestions.

Jaco

PS: A text-only version of the stack trace (minus a lot of numbers):
Call Trace:<IRQ> {as_remove_queued_request+288}{as_move_to_dispatch+342}
{as_next_request+941}{elv_next_request+277}
{scsi_request_fn+89}{blk_run_queue+40}
{scsi_end_request+252}{scsi_io_completion+484}
{sd_rw_intr+598}{scsi_sofirq+53}
{__do_softirq+83}{do_softirq+53}
{irq_exit+76}{do_IRQ+71}
{ret_from_intr+0} <EOI> {system_call+126}

Code: 83 79 88 01 75 09 e9 a7 00 00 00 48 8b 4f 10 48 85 c9 66 90
RIP <ffffffff{rb_erase+384} RSP <ffffffff804379d0>
CR2: 0000.0002e8
<0>Kernel panic - not synching: Aiee, killing interrupt handler!

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature