Re: NMI Reported with console_blast.sh

From: Petr Mladek
Date: Thu Feb 29 2024 - 06:19:55 EST


On Thu 2024-02-29 12:15:30, Petr Mladek wrote:
> On Thu 2024-02-22 00:21:19, John B. Wyatt IV wrote:
> > On Mon, Feb 12, 2024 at 04:23:04PM -0500, John B. Wyatt IV wrote:
> > >
> > > Red Hat's conservers are having an issue with the machine I was testing
> > > on. It may take me a while to get back to you with more test results.
> > >
> >
> > Found a work-around with conserver. I was able to follow up on the printk-caller
> > info you requested.
> >
> > I found 2 additional NMIs for a total of 3. Number 2 is very
> > large-please feel free to let me know what specific information you
> > wanted if it was unnecessary.

[...]

> > Compared to the two NMIs with throughput-performance (no preemption)
> >
> > <NMI>
> > cpus=0
> > .runnable_avg : 3072
> > kthread (kernel/kthread.c:388)
> > .util_est_enqueued : 0
> > stack:0 pid:1733 tgid:1733 ppid:2 flags:0x00004000
> > .min_vruntime : 2084315.290254
> > .removed.load_avg : 0
> > .avg_vruntime : 2084315.290254
> > console_blast.s 3497 34770.405603 N 34773.405603 3.000000 34764.898340 4 120
> > .util_avg : 1024
> > .util_avg : 1024
>
> It looks like messages from more (many) CPUs are mixed. I guess that they
> are printed by print_cfs_rq(). But the order looks random.
>
> Also I wonder why it is printed from NMI context. Maybe, it is from
> some perf event, similar to hardlockup detector?

I have realized that we most likely see only small part of the mixed
output. I wonder if it is because it is printed from the emergency
context. Here the messages are flushed when leaving the context
and many might be lost.

Best Regards,
Petr