Re: Linux SMP problems: wait_on_bh

Fredrik Lindgren (fli@unix.edu.sollentuna.se)
Sat, 6 Feb 1999 02:17:20 +0100 (MET)


On Fri, 5 Feb 1999, Simon Kirby wrote:

> On Fri, 5 Feb 1999, Fredrik Lindgren wrote:
>
> > This is the data from the latest crash (it was repeated atleast 5 times):
> >
> > wait_on_bh, CPU 0:
> > irq: 0 [0 0]
> > bh: 1 [0 1]
> > <[c010a385] [c016881e] [c0162e6b] [c0163108] [c016f91e] [c01533cb]
> > [c01537c5] [c012611d]>
> >
> > These are the mappings from System.map:
> >
> > c010a385: synchronize_bh
> > c016881e: tcp_v4_unhash
> > c0162e6b: tcp_close_state
> > c0163108: tcp_close
> > c016f91e: inet_release
> > c01533cb: sock_release
> > c01537c5: sock_close
> > c012611d: __fput
>
> We had our mail server freeze up twice with almost exactly the same trace
> when we were using a motherboard that was having odd problems (Tyan
> S1836 Thunder 100). Both times it said CPU #0 was waiting, and both times
> it was tcp_v4_unhash trying to get the lock while the other CPU seemed to
> be holding it forever.
>
> The other problem we had with the board were that with all 4 DIMMs being
> used (512MB, 4 x 128MB DIMMs), it would get random parity errors.
> Disabling the last 128MB with mem=384M at boot would make the problem
> stop, but the "crash" as you mentioned above was still happening, so we
> decided to switch boards...
>
> We're now using an ASUS P2B-DS board with the same RAM and same CPUs and
> we're having no problems whatsoever (up for over a week now).

Ah, ok. The annoying thing is that the same machine used to have NT4
on it and that never crashed... Well, not "hard" like this anyway. Also,
we've been seeing some reasonable uptimes, like 40+ days, but then it
happens three our four times within a week. There is no real pattern.
If we put something CPU intensive in there, like a distributed.net or a
GIMPS client it locks up within a few days.

> I'm not sure how this relates to what's happening with your server, but my
> first guess would be that it's still something to do with hardware.
> Vanilla 2.2.1 here, but 2.2.0 was on the machine when the wait_on_bh
> lockups were happening.

Yeah, I just wish we had the budget to get a new machine to try that
out... Might be in a better position now that Compaq is starting to
support Linux :) I just hope this includes these "old" machines.

Anyway, thanks for your tips!

Regards,
Fredrik Lindgren
SolNET, Sweden

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/