Re: [PATCH] watchdog/hardlockup: Avoid large stack frames in watchdog_hardlockup_check()

From: Petr Mladek
Date: Thu Aug 03 2023 - 04:18:06 EST


On Wed 2023-08-02 07:12:29, Doug Anderson wrote:
> Hi,
>
> On Wed, Aug 2, 2023 at 12:27 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
> >
> > On Tue 01-08-23 08:41:49, Doug Anderson wrote:
> > [...]
> > > Ah, I see what you mean. The one issue I have with your solution is
> > > that the ordering of the stack crawls is less ideal in the "dump all"
> > > case when cpu != this_cpu. We really want to see the stack crawl of
> > > the locked up CPU first and _then_ see the stack crawls of other CPUs.
> > > With your solution the locked up CPU will be interspersed with all the
> > > others and will be harder to find in the output (you've got to match
> > > it up with the "Watchdog detected hard LOCKUP on cpu N" message).
> > > While that's probably not a huge deal, it's nicer to make the output
> > > easy to understand for someone trying to parse it...
> >
> > Is it worth to waste memory for this arguably nicer output? Identifying
> > the stack of the locked up CPU is trivial.
>
> I guess it's debatable, but as someone who has spent time staring at
> trawling through reports generated like this, I'd say "yes", it's
> super helpful in understanding the problem to have the hung CPU first.
> Putting the memory usage in perspective:

nmi_trigger_cpumask_backtrace() has its own copy of the cpu mask.
What about changing the @exclude_self parameter to @exclude_cpu
and do:

if (exclude_cpu >= 0)
cpumask_clear_cpu(exclude_cpu, to_cpumask(backtrace_mask));


It would require changing also arch_trigger_cpumask_backtrace() to

void arch_trigger_cpumask_backtrace(const struct cpumask *mask,
int exclude_cpu);

but it looks doable.


It might theoretically change the behavior because
nmi_trigger_cpumask_backtrace() does

this_cpu = get_cpu();

I guess that it was used to make sure that smp_processor_id()
the same. But it is too late. It should ignore the callers CPU.
So, this might actually fix a possible race.

Note that get_cpu() also disabled preemption. IMHO, it is not strictly needed.
But it might make sense to keep it because the backtraces are printed
when something goes wrong and it should print backtraces from the current state.

Best Regards,
Petr