Re: commit 894d1b3db41c leads to frequent hangs when booting

From: John Stultz
Date: Thu Dec 12 2024 - 02:40:44 EST


On Wed, Dec 11, 2024 at 6:41 PM John Stultz <jstultz@xxxxxxxxxx> wrote:
>
> On Wed, Dec 11, 2024 at 4:17 PM Bert Karwatzki <spasswolf@xxxxxx> wrote:
> > Am Mittwoch, dem 11.12.2024 um 15:14 -0800 schrieb John Stultz:
> > > On Wed, Dec 11, 2024 at 2:46 PM Bert Karwatzki <spasswolf@xxxxxx> wrote:
> > > >
> > > > Am Mittwoch, dem 11.12.2024 um 22:35 +0100 schrieb Bert Karwatzki:
> > > > > I have confirmed that I that linux-next-20241210 is fixed by the same revert
> > > > > as v6.13-rc2 (ten boots without incident is the criterion for a good commit)
> > > > >
> > > > >
> > > > > Bert Karwatzki
> > > >
> > > > Also this bug only occurs with CONFIG_PREEMPT_RT=y, I've just checked v6.13-rc2
> > > > without the revert and the following preempt settings and got 10 clean boots:
> > > >
> > >
> > > Hrm. That matches the case where the fix here helped:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=82f9cc094975240885c93effbca7f4603f5de1bf
> > >
> > >
> > > I'm still working on getting a minimal delta from your config booting
> > > in my environment, but in the meantime, I'd be curious if the
> > > following reduced revert helps?
> > > https://github.com/johnstultz-work/linux-dev/commit/60c60f85670fb1f4708adbe55e15ab918d96f9f0
> > >
> > > Basically just trying to clarify if the problem is moving the wakeup
> > > to the wake_q or if some other interaction (maybe with the
> > > preempt_disables) could be causing the issue.
> > > (I'm assuming you have 82f9cc094975 applied when you test with that change)
> > >
> >
> > I tested linux-next-20241210 (which includes 82f9cc094975) with your patch
> >
> > diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
> > index 37f025a096c9..724d4c871cf6 100644
> > --- a/kernel/locking/ww_mutex.h
> > +++ b/kernel/locking/ww_mutex.h
> > @@ -284,7 +284,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER
> > *waiter,
> > #ifndef WW_RT
> > debug_mutex_wake_waiter(lock, waiter);
> > #endif
> > - wake_q_add(wake_q, waiter->task);
> > + wake_up_process(waiter->task);
> > }
> >
> > return true;
> > @@ -332,7 +332,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
> > * wakeup pending to re-read the wounded state.
> > */
> > if (owner != current)
> > - wake_q_add(wake_q, owner);
> > + wake_up_process(owner);
> >
> > return true;
> > }
> >
> > and this fixes the issue for me.
>
> Ok, thanks for validating that! Hrm. Ok, I suspect I'm missing a path
> in rtmutex.c that calls into __schedule() before we pop up to the
> point where we actually wake the wake_q.
>
> I may have found one or two spots that look likely, and I believe I've
> also managed to reproduce your problem using the test-ww_mutex driver.
>
> So I'm working up a proper patch and will share for testing here soon.

Ok, could you try the patch here:
https://github.com/johnstultz-work/linux-dev/commit/3c902b92c88122cd034937e8d40930bac254a7c5

I've repushed to the test branch I shared earlier, so its against
6.13-rc2 + fix that has landed upstream + test fix above ( + another
patch that makes stressing the ww_mutexes easier):
https://github.com/johnstultz-work/linux-dev/commits/debug/894d1b3db41c-hang-bert/

Let me know if you still have issues with this.
thanks
-john