Re: [PATCHv12 4/4] watchdog/softlockup: report the most frequent interrupts
From: Doug Anderson
Date: Mon Apr 01 2024 - 12:42:11 EST
Hi,
On Mon, Mar 25, 2024 at 2:48 AM Bitao Hu <yaoma@xxxxxxxxxxxxxxxxx> wrote:
>
> Hi, Thomas
>
> On 2024/3/24 04:43, Thomas Gleixner wrote:
> > On Wed, Mar 06 2024 at 20:52, Bitao Hu wrote:
> >> + if (__this_cpu_read(snapshot_taken)) {
> >> + for_each_active_irq(i) {
> >> + count = kstat_get_irq_since_snapshot(i);
> >> + tabulate_irq_count(irq_counts_sorted, i, count, NUM_HARDIRQ_REPORT);
> >> + }
> >> +
> >> + /*
> >> + * We do not want the "watchdog: " prefix on every line,
> >> + * hence we use "printk" instead of "pr_crit".
> >> + */
> >
> > You are not providing any justification why the prefix is not
> > wanted. Just saying 'We do not want' does not cut it and who is 'We'. I
> > certainly not.
> >
> > I really disagree because the prefixes are very useful for searching log
> > files. So not having it makes it harder to filter out for no reason.
> >
>
>
> Regarding the use of printk() instead of pr_crit(), I have had a
> discussion with Liu Song and Douglas in PATCHv1:
> https://lore.kernel.org/all/CAD=FV=WEEQeKX=ec3Gr-8CKs2K0MaWN3V0-0yOsuret0qcB_AA@xxxxxxxxxxxxxx/
>
> Please allow me to elaborate on my reasoning. The purpose of the
> report_cpu_status() function I implemented is similar to that of
> print_modules(), show_regs(), and dump_stack(). These functions are
> designed to assist in analyzing the causes of a soft lockup, rather
> than to report that a soft lockup has occurred. Therefore, I think
> that adding the "watchdog: " prefix to every line is redundant and
> not concise. Besides, the existing pr_emerg() in the watchdog.c file
> is already sufficient for searching useful information in the logs.
> The information I added, along with the call tree and other data, is
> located near the line with the "watchdog: " prefix.
>
> Are the two reasons I've provided reasonable?
FWIW I don't feel super strongly about this, but I'm leaning towards
agreeing with Bitao. The sample output from the commit message looks
like this:
[ 638.870231] watchdog: BUG: soft lockup - CPU#9 stuck for 26s! [swapper/9:0]
[ 638.870825] CPU#9 Utilization every 4s during lockup:
[ 638.871194] #1: 0% system, 0% softirq, 100% hardirq,
0% idle
[ 638.871652] #2: 0% system, 0% softirq, 100% hardirq,
0% idle
[ 638.872107] #3: 0% system, 0% softirq, 100% hardirq,
0% idle
[ 638.872563] #4: 0% system, 0% softirq, 100% hardirq,
0% idle
[ 638.873018] #5: 0% system, 0% softirq, 100% hardirq,
0% idle
[ 638.873494] CPU#9 Detect HardIRQ Time exceeds 50%. Most frequent HardIRQs:
[ 638.873994] #1: 330945 irq#7
[ 638.874236] #2: 31 irq#82
[ 638.874493] #3: 10 irq#10
[ 638.874744] #4: 2 irq#89
[ 638.874992] #5: 1 irq#102
..and in my mind the "watchdog: BUG: soft lockup - CPU#9 stuck for
26s! [swapper/9:0]" line is enough to grep through the dmesg. Having
all the following lines start with "watchdog:" feels like overkill to
me, but if you feel strongly that they should then it wouldn't bother
me too much for them all to have the "watchdog:" prefix.
Could you clarify how strongly you feel about this and whether Bitao
should spin a v13?
I believe that this is the only point of contention on the patch
series right now and otherwise it could be ready to land. I know in
the past Petr has wanted ample time to comment though perhaps the fact
that it's been ~1 month is enough. Petr: do you have anything that
needs saying before this patch series lands?
Thanks!
-Doug