Hiya, list.
Think i've found a rather nasty bug in the kernel, and I need some clues
as to where to look for the issue.
Stats:
Quad Xeon (PIII core) 700mhz machine (1mb cache on each)
4gb RAM
5x36gb SCSI disks - on a DAC1100 RAID controller
3 EEPro 100 cards
The box functions as a database server that runs at about 40% load on
each CPU, and about 1.5gb memory usage.
Kernel: 2.2.18pre11-va2.0smp (Although completely reproducible on stock
2.2.18)
If dmesg output from kernel - or any other info is required, i'd be more
than happy to provide it.
Problem:
Box appears to stop responding to network requests for 30 seconds at a
time. it appears to be happening when we get this error:
wait_on_bh, CPU 0:
irq: 0 [0 0]
bh: 1 [0 0]
<[c010bb29]> <[c011d07b]> <[c011d1ed]> <[c0116658]> <[c01099fc]>
it LOOKS like the virtaddr's provided are a call trace, however, I can't
be sure - as SOME of the addresses don't show up as a ksym..
the problem LOOKS to be, from my perspective (And light code reading) -
that the function synchronize_bh() is called SOMEWHERE, and then,
wait_on_bh() is called from that. It also appears that wait_on_bh() loops
through MAXCOUNT times (100000000 times), and fails - therefore exiting
the function. (it also appears that the 1 global interrupt that is OPEN
is a TIMER interrupt.)
Can SOMEONE give me a clue as to where to start looking for this problem -
and/or perhaps some input from people who work on this code? (The really
confusing thing about this, is that the wait_for_bh() function should NOT
take 30 seconds to jump through the maximum loop count.)
Thanks in advance,
Chad
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
This archive was generated by hypermail 2b29 : Sun Apr 15 2001 - 21:00:14 EST