Hi, I now understand your suggestion. Using interrupt time as a
I guess maybe (?) the case it wouldn't catch so well would be if you
had a loop that looked like:
* CPU from "storming" device takes ~10 ms and finishes.
* After the "storming" device finishes, we somehow manage to service a
whole pile of normal interrupts from non-storming devices.
* After the non-storming interrupts finished, then we went back to
servicing the slow storming interrupt.
I haven't dug into the bowels of the Linux IRQ handling well enough to
know if that would be possible. However, even if it was possible it
doesn't feel likely to happen. If those "normal" interrupts aren't
storming then it seems unlikely we'd see more than one of each of them
between the "storming" interrupts. Thus it feels like the "storming"
interrupt would still have the highest (or tied for the highest) count
on the locked up CPU. Maybe you could print the top 4 or 5 (instead of
the top 3) and you'd be pretty certain to catch even this case?
In any case, I'll wait before doing a more thorough review for now and
hope the above sounds right to you.
-Doug