Re: [PATCH] watchdog/hardlockup: Avoid large stack frames in watchdog_hardlockup_check()

From: Michal Hocko
Date: Thu Aug 03 2023 - 04:34:07 EST

Next message: Neil Armstrong: "Re: [PATCH] drm: bridge: dw_hdmi: Add cec suspend/resume functions"
Previous message: Vlastimil Babka: "Re: [PATCH 04/24] PM: hibernate: move finding the resume device out of software_resume"
In reply to: Petr Mladek: "Re: [PATCH] watchdog/hardlockup: Avoid large stack frames in watchdog_hardlockup_check()"
Next in thread: Doug Anderson: "Re: [PATCH] watchdog/hardlockup: Avoid large stack frames in watchdog_hardlockup_check()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu 03-08-23 10:12:12, Petr Mladek wrote:
> On Wed 2023-08-02 07:12:29, Doug Anderson wrote:
> > Hi,
> >
> > On Wed, Aug 2, 2023 at 12:27 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
> > >
> > > On Tue 01-08-23 08:41:49, Doug Anderson wrote:
> > > [...]
> > > > Ah, I see what you mean. The one issue I have with your solution is
> > > > that the ordering of the stack crawls is less ideal in the "dump all"
> > > > case when cpu != this_cpu. We really want to see the stack crawl of
> > > > the locked up CPU first and _then_ see the stack crawls of other CPUs.
> > > > With your solution the locked up CPU will be interspersed with all the
> > > > others and will be harder to find in the output (you've got to match
> > > > it up with the "Watchdog detected hard LOCKUP on cpu N" message).
> > > > While that's probably not a huge deal, it's nicer to make the output
> > > > easy to understand for someone trying to parse it...
> > >
> > > Is it worth to waste memory for this arguably nicer output? Identifying
> > > the stack of the locked up CPU is trivial.
> >
> > I guess it's debatable, but as someone who has spent time staring at
> > trawling through reports generated like this, I'd say "yes", it's
> > super helpful in understanding the problem to have the hung CPU first.
> > Putting the memory usage in perspective:
>
> nmi_trigger_cpumask_backtrace() has its own copy of the cpu mask.
> What about changing the @exclude_self parameter to @exclude_cpu
> and do:
>
> if (exclude_cpu >= 0)
> cpumask_clear_cpu(exclude_cpu, to_cpumask(backtrace_mask));
>
>
> It would require changing also arch_trigger_cpumask_backtrace() to
>
> void arch_trigger_cpumask_backtrace(const struct cpumask *mask,
> int exclude_cpu);
>
> but it looks doable.

Yes, but sparc is doing its own thing so it would require changing that
as well. But this looks reasonable as well.

--
Michal Hocko
SUSE Labs

Next message: Neil Armstrong: "Re: [PATCH] drm: bridge: dw_hdmi: Add cec suspend/resume functions"
Previous message: Vlastimil Babka: "Re: [PATCH 04/24] PM: hibernate: move finding the resume device out of software_resume"
In reply to: Petr Mladek: "Re: [PATCH] watchdog/hardlockup: Avoid large stack frames in watchdog_hardlockup_check()"
Next in thread: Doug Anderson: "Re: [PATCH] watchdog/hardlockup: Avoid large stack frames in watchdog_hardlockup_check()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]