Re: commit 894d1b3db41c leads to frequent hangs when booting

From: John Stultz
Date: Wed Dec 11 2024 - 21:41:22 EST


On Wed, Dec 11, 2024 at 4:17 PM Bert Karwatzki <spasswolf@xxxxxx> wrote:
> Am Mittwoch, dem 11.12.2024 um 15:14 -0800 schrieb John Stultz:
> > On Wed, Dec 11, 2024 at 2:46 PM Bert Karwatzki <spasswolf@xxxxxx> wrote:
> > >
> > > Am Mittwoch, dem 11.12.2024 um 22:35 +0100 schrieb Bert Karwatzki:
> > > > I have confirmed that I that linux-next-20241210 is fixed by the same revert
> > > > as v6.13-rc2 (ten boots without incident is the criterion for a good commit)
> > > >
> > > >
> > > > Bert Karwatzki
> > >
> > > Also this bug only occurs with CONFIG_PREEMPT_RT=y, I've just checked v6.13-rc2
> > > without the revert and the following preempt settings and got 10 clean boots:
> > >
> >
> > Hrm. That matches the case where the fix here helped:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=82f9cc094975240885c93effbca7f4603f5de1bf
> >
> >
> > I'm still working on getting a minimal delta from your config booting
> > in my environment, but in the meantime, I'd be curious if the
> > following reduced revert helps?
> > https://github.com/johnstultz-work/linux-dev/commit/60c60f85670fb1f4708adbe55e15ab918d96f9f0
> >
> > Basically just trying to clarify if the problem is moving the wakeup
> > to the wake_q or if some other interaction (maybe with the
> > preempt_disables) could be causing the issue.
> > (I'm assuming you have 82f9cc094975 applied when you test with that change)
> >
>
> I tested linux-next-20241210 (which includes 82f9cc094975) with your patch
>
> diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
> index 37f025a096c9..724d4c871cf6 100644
> --- a/kernel/locking/ww_mutex.h
> +++ b/kernel/locking/ww_mutex.h
> @@ -284,7 +284,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER
> *waiter,
> #ifndef WW_RT
> debug_mutex_wake_waiter(lock, waiter);
> #endif
> - wake_q_add(wake_q, waiter->task);
> + wake_up_process(waiter->task);
> }
>
> return true;
> @@ -332,7 +332,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
> * wakeup pending to re-read the wounded state.
> */
> if (owner != current)
> - wake_q_add(wake_q, owner);
> + wake_up_process(owner);
>
> return true;
> }
>
> and this fixes the issue for me.

Ok, thanks for validating that! Hrm. Ok, I suspect I'm missing a path
in rtmutex.c that calls into __schedule() before we pop up to the
point where we actually wake the wake_q.

I may have found one or two spots that look likely, and I believe I've
also managed to reproduce your problem using the test-ww_mutex driver.

So I'm working up a proper patch and will share for testing here soon.

thanks
-john