RE: Linux SMP problems: wait_on_bh

Terry Katz (katz@advanced.org)
Fri, 5 Feb 1999 21:23:14 -0500


We're running a Dell PowerEdge 6100 and are having the same problem... We
haven't had the problem until just recently (2.2.0), but its also that just
recently our system usage has shot up 200%... Its not convienient for us to
swap motherboards, unless its a recent defect on the board .. :/

It would be nice if we could find out 'exactly' what is causing this problem
.. defect on the m/b? incompatibility with the m/b? bug in kernel? It's
becoming really frustrating since I talked my fellow employees into
switching the system over from NT to Linux.. Sure the system runs much
faster now, but having crashes like this every other day doesn't do good for
Linux's rep in this case... They agree that linux is a much better system
than NT, but the fact is, NT didn't crash like this .. (I'm sure you can see
my frustration)

So my question is .. does anyone know 'why' this wait_on_bh occurs, and what
a possible fix is ??

-Terry
katz@advanced.org

> -----Original Message-----
> From: owner-linux-kernel@vger.rutgers.edu
> [mailto:owner-linux-kernel@vger.rutgers.edu]On Behalf Of Simon Kirby
> Sent: Friday, February 05, 1999 6:57 PM
> To: Fredrik Lindgren
> Cc: linux-kernel@vger.rutgers.edu; linux-smp@vger.rutgers.edu
> Subject: Re: Linux SMP problems: wait_on_bh
>
>
> On Fri, 5 Feb 1999, Fredrik Lindgren wrote:
>
> > This is the data from the latest crash (it was repeated atleast
> 5 times):
> >
> > wait_on_bh, CPU 0:
> > irq: 0 [0 0]
> > bh: 1 [0 1]
> > <[c010a385] [c016881e] [c0162e6b] [c0163108] [c016f91e] [c01533cb]
> > [c01537c5] [c012611d]>
> >
> > These are the mappings from System.map:
> >
> > c010a385: synchronize_bh
> > c016881e: tcp_v4_unhash
> > c0162e6b: tcp_close_state
> > c0163108: tcp_close
> > c016f91e: inet_release
> > c01533cb: sock_release
> > c01537c5: sock_close
> > c012611d: __fput
>
> We had our mail server freeze up twice with almost exactly the same trace
> when we were using a motherboard that was having odd problems (Tyan
> S1836 Thunder 100). Both times it said CPU #0 was waiting, and both times
> it was tcp_v4_unhash trying to get the lock while the other CPU seemed to
> be holding it forever.
>
> The other problem we had with the board were that with all 4 DIMMs being
> used (512MB, 4 x 128MB DIMMs), it would get random parity errors.
> Disabling the last 128MB with mem=384M at boot would make the problem
> stop, but the "crash" as you mentioned above was still happening, so we
> decided to switch boards...
>
> We're now using an ASUS P2B-DS board with the same RAM and same CPUs and
> we're having no problems whatsoever (up for over a week now).
>
> I'm not sure how this relates to what's happening with your server, but my
> first guess would be that it's still something to do with hardware.
> Vanilla 2.2.1 here, but 2.2.0 was on the machine when the wait_on_bh
> lockups were happening.
>
> Simon-
>
> | Simon Kirby | Systems Administration |
> | mailto:sim@netnation.com | NetNation Communications |
> | http://www.netnation.com/ | Tech: (604) 684-6892 |
>
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.rutgers.edu
> Please read the FAQ at http://www.tux.org/lkml/
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/