Hi,Sure, I will continue moving forward with it.
On Tue, Jan 23, 2024 at 4:12 AM Bitao Hu <yaoma@xxxxxxxxxxxxxxxxx> wrote:
When the watchdog determines that the current soft lockup is due
to an interrupt storm based on CPU utilization, reporting the
top three most time-consuming hardirq can be quite useful for
further troubleshooting.
Below is an example of interrupt storm. The call tree does not
provide useful information, but we can analyze which interrupt
caused the soft lockup by using the time-consuming information
of hardirq.
[ 67.714044] watchdog: BUG: soft lockup - CPU#9 stuck for 28s! [swapper/9:0]
[ 67.714548] CPU#9 Utilization every 4s during lockup:
[ 67.714549] #1: 0.0% system, 0.0% softirq, 95.0% hardirq, 0.0% idle
[ 67.714551] #2: 0.0% system, 0.0% softirq, 90.0% hardirq, 0.0% idle
[ 67.714553] #3: 0.0% system, 0.0% softirq, 90.0% hardirq, 0.0% idle
[ 67.714555] #4: 0.0% system, 0.0% softirq, 95.0% hardirq, 0.0% idle
[ 67.714556] #5: 0.0% system, 0.0% softirq, 90.0% hardirq, 0.0% idle
[ 67.714570] CPU#9 Detect HardIRQ Time exceeds 50% since 45s. Most time-consuming HardIRQs:
[ 67.714571] #1: 99.9% irq#7(IPI)
[ 67.714575] #2: 0.0% irq#10(arch_timer)
[ 67.714578] #3: 0.0% irq#2(IPI)
...
[ 67.714654] Call trace:
[ 67.714656] __do_softirq+0xa8/0x364
Signed-off-by: Bitao Hu <yaoma@xxxxxxxxxxxxxxxxx>
---
include/linux/irq.h | 9 ++
include/linux/irqdesc.h | 2 +
kernel/irq/irqdesc.c | 9 +-
kernel/watchdog.c | 186 ++++++++++++++++++++++++++++++++++++++++
4 files changed, 205 insertions(+), 1 deletion(-)
Just like on patch #1, I'd overall like to say that I'm very excited
about this patch and think it'll help track down a lot of issues, so
please continue moving forward with it!
I think there are two issues with using IRQ counts in determining interrupt storm:
That being said, my gut says that this can be a lot simpler / lower
overhead if we just use the existing IRQ counts that the framework is
already keeping track of for us. We could look for the IRQs that had
the most counts on the soft locked CPU over the period of time.
It might not be quite as good/accurate compared to what you have here
but it seems like it would be "good enough". Normally when there's an
IRQ storm that's so bad that it's causing a softlockup then each IRQ
isn't actually that slow but they're going off nonstop. It would catch
that case perfectly.
I guess maybe (?) the case it wouldn't catch so well would be if you
had a loop that looked like:
* CPU from "storming" device takes ~10 ms and finishes.
* After the "storming" device finishes, we somehow manage to service a
whole pile of normal interrupts from non-storming devices.
* After the non-storming interrupts finished, then we went back to
servicing the slow storming interrupt.
I haven't dug into the bowels of the Linux IRQ handling well enough to
know if that would be possible. However, even if it was possible it
doesn't feel likely to happen. If those "normal" interrupts aren't
storming then it seems unlikely we'd see more than one of each of them
between the "storming" interrupts. Thus it feels like the "storming"
interrupt would still have the highest (or tied for the highest) count
on the locked up CPU. Maybe you could print the top 4 or 5 (instead of
the top 3) and you'd be pretty certain to catch even this case?
In any case, I'll wait before doing a more thorough review for now and
hope the above sounds right to you.
-Doug