Re: [PATCH,RFC] smp,csd: throw an error if a CSD lock is stuck for too long

From: Rik van Riel
Date: Wed Sep 13 2023 - 16:18:09 EST


On Wed, 2023-09-13 at 18:17 +0200, Peter Zijlstra wrote:
> On Wed, Sep 13, 2023 at 10:33:51AM -0400, Rik van Riel wrote:
> > >
> > It's more fun than that. We're seeing this on bare metal.
>
> Oh, 'fun' indeed, *groan*.
>
> > Unfortunately, when a system gets wedged that way currently,
> > it ends up being power cycled automatically, and we aren't
> > getting crash dumps with clues on what causes the issue.
> >
> > Doing a BUG_ON() + panic, followed by a kexec into the kdump
> > kernel will hopefully give us some clues on what might be
> > causing the issue.
>
> I'm conflicted on the need to push such a debug patch upstream, otoh.
> given the amount of debug code already in csd, why not.
>
> But yeah, curious hear what comes out of this.
>
Oh, there's more to it than just debugging the issue.

This will also help recover systems faster, since they
will end up panicking, kdumping, and rebooting, faster
than the "hey, that system looks like it's stuck"
power cycling scripts can get to it.

--
All Rights Reversed.