Re: sporadic "freezes" on amd64 (GA K8NF)

From: Jaco Kroon
Date: Sat Aug 06 2005 - 03:08:09 EST


Mike Galbraith wrote:
> At 10:33 PM 8/5/2005 +0200, you wrote:
>
>> Hello all,
>>
>> I'm absolutely stumped with this one. We are still having problems
>> deciding whether this is a software problem or a hardware problem.
>
> Given the number of kernels it freezes under, I'd say hardware.

That is what we thought too, initially. I'm still tending towards this
option but we just don't know any more. Considering that Windows runs
just fine. Even under heavy workload. Although - compiling tends to be
some of the most rigurous hardware chowing work there is (Used to use
compiling glibc as a test on how stable my old pentium 90 was - usually
had to cross-compile it in the end). Anyway, we're unable to get
Windows to compile for extended periods of time, mostly because we
simply do not have any projects available to us that is big enough. Can
always try installing cygwin and using one or another opensource project
(MySQL for example).

>> This
>> particular box (specs lower down) just freezes up sporadically when in
>> Linux.
>
> You failed to describe the workload.

Of course I did. Compiling the kernel. It's probably the most reliable
way of crashing it. make mrproper && make allyesconfig && make - give
it about 2 to 5 minutes and it's dead. We also got in a couple of times
to die within about 10 seconds, as in KSYM was the only thing that
remained that still had to pass, so we would type make, it would do a
couple of checks (CHK I assume is for check) and the KSYM, at which
point it would just die.

>> Normally it just stops responding entirely. As in one moment it's still
>> outputting and the next there is nothing. Then once, (twice actually),
>> we actually got a kernel panic, I've taken a picture which can be found
>> at http://www.kroon.co.za/images/kernel_panic_amd64.jpg (Apologies for
>> the quality - phones aren't good at taking them).
>
> Try a serial console for capture.

I never did manage to get that right. I've still got the cable from my
previous attempts, guess I'll be trying again.

>> From this panic (and
>> the other which I had no way of capturing at the time) it looks like a
>> bug somewhere when accessing the hard drive. The one here was on
>> reiserfs the other was on ext3.
>
>
> The first thing than comes to mind is cooling. The next thing I'd
> suspect is the power supply. After verifying that these are in fact not
> the problem, I'd try a different disk controller.

I doubt it's cooling - after resetting the BIOS claims <40 degree
celcius temperatures so I'll try another power supply. Although, I
suspect it already is a 400W in there.

Jaco

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature