Re: [2.2.1] Crash (wait_on_bh)

Simon Kirby (sim@netnation.com)
Wed, 24 Feb 1999 12:24:21 -0800 (PST)


On Wed, 24 Feb 1999 hamster@lspace.org wrote:

> Got a problem with a twin PII (Klamath) it's been dying about once
> every one to two days since being brought into service (mail relay/pop
> server). Finally got something intelligent out of it (screen blanking
> is a complete arse at times).
>
> --[ error ]--
> wait_on_bh, CPU1
> irq: 0 [0 0]
> bh: 1 [1 0]
>
> followed by a seemingly random set of numbers
>
> ... repeat many times...
>
> --[ error ]--
>
> The machine is running with ide and scsi controllers in SMP mode, with
> an Intel Express Pro 10/100 network card.

We've been having _exactly_ the same problem with our mail relay/pop
server that's also got close to the same hardware as you do (Dual
Deschutes, EEPRO100, aic7890 SCSI). The only difference is that evey time
it happens with ours it seems to be CPU 0 waiting for CPU 1 instead of how
you have CPU 1 waiting for CPU 0...Probably just a fluke, though.

The numbers you see after it is a stack trace for the waiting processor --
try matching them up with the System.map for that kernel and see where
it's waiting.

There's one report of the problem happening with bus logic SCSI and
another report of the problem happening with a 3com 3c905, so it sounds
like a generic problem rather than a problem specific to a certain driver.
I've decoded the trace twice and each time it was in tcp_v4_hash(), so I
tried to replicate the problem by opening and closing a lot of TCP sockets
on my dual celeron at home, but I didn't seem to get anywhere.

The machine has been running 2.2.2pre5 for a while now and it doesn't seem
to have shown the problem, so perhaps something in 2.2.2 has fixed it.

The only thing I can find that's odd about it is that the frequency at
which it occurs seems to change slightly depending on the kernel version
or on the boot -- it ran for almost two weeks without a problem once on an
older kernel and then suddenly barfed twice in the same day. On another
kernel, it happened usually after the first day.

We have another server with identical hardware that doesn't show the
problem at all -- it does stats compilation endlessly all day, so it has a
higher CPU load, but it doesn't have as much TCP networking going on
(quite a bit of UDP (NFS), though)).

Simon-

| Simon Kirby | Systems Administration |
| mailto:sim@netnation.com | NetNation Communications |
| http://www.netnation.com/ | Tech: (604) 684-6892 |

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/