Re: linux SMP stability or lack thereof

Doug Ledford (dledford@dialnet.net)
Wed, 30 Sep 1998 00:13:11 -0500


Ricardo Galli Granada wrote:

> > There is definitely at least one 2.0.x SMP lockup problem I've seen
> > several times and we have EIP traces off. It however isnt a blank screen
> > dead machine lockup, its a "Deadlock detected on ....." message in each
> > of the reports it tallies to or alternatively its a "live machine
> nothing
> > happening" livelock. Again predictable.
> > Finally try with 2.0.29 kernel images. The only "major" SMP change in
> > recent history is about 2.0.30 when Leonard added some IRQ
> forwardingfacilities.
>
> As I posted in previous messages, there are definively some reproductable
> lockup problems with 2.0 and SMP on some motherboards.
>
> I was having ones with a RC440LX motherboard (2xPII300MHz), the machine
> locked-up, blank screen, no noise...
>
> According to Doug, the AIC7xxx is SMP safe but I tried the driver with
> the following different combinations (SCSI and NIC) (with standard,
> clean, no modules kernel, no forwarding, just plain TCP/IP and libc5,
> gcc 2.7.2.1):

You aren't paying attention to what I said. Your writing this up like the
problem has to be a driver issue somewhere, either NIC or SCSI or whatever.
The exact EIP traces I have from this 2.0.35 SMP problem point not to any
driver but to the core kernel code. Skip testing the drivers, it's a waste
of time as long as the kernel can go into a deadlock with a fork(). There
is a race in the core kernel code, and if you are getting hit by it, then
you might as well recompile your kernel without SMP.

> 7880UW+Intel EtherExpress
> 7880UW+3Com905
> 2940UW+Intel EtherExpress
> 2940UW+3Com905
>
> on a 10 (ten) Mbits LAN.
>
> I could lockup the server *always*, just doing a ping flood from another
> Linux (clean Pentium 133) while compiling the kernel.
>
> I tried with different SCSI option (BIOS, no BIOS, reset, no reset) and
> the results are equivalent. Then I tried reducing the amount the memory
> available to the kernel (via mem=xxxMB) in lilo.conf. I reduced the memory
> to 246 MB (the machine has 256) and the machine died anyway.
>
> Finally I was tired (and scared, it was a production machine, before
> putting it on production and tested everything during ten days on a
> private network and it never died) and disabled SMP in the Makefile, so
> now it's crippled but alive.
>
> gallir@star:/home/people/gallir > w
> 12:07am up 42 days, 6:32, 2 users, load average: 0.19, 0.12, 0.04
>
> I must say that with 2.0.34 was more stable than with 2.0.35. With 2.0.33
> and squid the machine died every 2 days.
>
> So, I may bet that is not:
>
> - temperature problem nor
> - memory problem nor
> - network card driver.
>
> Doug believes that the 5.0.19 aic7xxx driver is SMP safe and perhaps is
> something deeper in the kernel (I tried 5.1.0pre10 but does not work,
> reset problems, seems to be automatic termination issue).
>
> I am going to try with s Buslogic 958, but they are delivered in "few"
> months in Europe (I requested for one to the spanish buslogic distributor
> one month ago, still waiting).
>
> SMP FAQ maintainers, you may probably like to put this motherboard in
> the bad list (another one...)

There's nothing wrong with your motherboard. Your assuming that because you
can't get your particular configuration to work, that it A) must be a driver
issue and B) must be related to your particular motherboard. More likely,
it's factors of CPU speed versus disk/network sub-system speeds. Get faster
processors and maybe the race conditions will no longer be a problem (my own
system got much worse when my disks got much faster).

My system can reproduce the problem quite easily, but I don't pretend to be
the least bit expert on the fork() portions of the kernel. All I can say is
I can reliably reproduce this problem and the EIP traces say it happens
during a fork() and that it was very likely aggravated by the changes
between 2.0.33 and 2.0.35 that enabled the swapping of shared-COW pages.
But, as Stephen has pointed out, if that made the problem worse (which it
did) then the real issue was just being hidden when we weren't swapping
those pages out. Any experts out there on the fork() code are more than
welcome to get with me and see if we can't track the problem down, I'll put
my machine into test mode to try and get rid of the problem.

-- 

Doug Ledford <dledford@dialnet.net> Opinions expressed are my own, but they should be everybody's.

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/