Re: commit 894d1b3db41c leads to frequent hangs when booting
From: Bert Karwatzki
Date: Thu Dec 12 2024 - 05:32:13 EST
Am Mittwoch, dem 11.12.2024 um 23:40 -0800 schrieb John Stultz:
> On Wed, Dec 11, 2024 at 6:41 PM John Stultz <jstultz@xxxxxxxxxx> wrote:
> >
> > On Wed, Dec 11, 2024 at 4:17 PM Bert Karwatzki <spasswolf@xxxxxx> wrote:
> > > Am Mittwoch, dem 11.12.2024 um 15:14 -0800 schrieb John Stultz:
> > > > On Wed, Dec 11, 2024 at 2:46 PM Bert Karwatzki <spasswolf@xxxxxx> wrote:
> > > > >
> > > > > Am Mittwoch, dem 11.12.2024 um 22:35 +0100 schrieb Bert Karwatzki:
> > > > > > I have confirmed that I that linux-next-20241210 is fixed by the same revert
> > > > > > as v6.13-rc2 (ten boots without incident is the criterion for a good commit)
> > > > > >
> > > > > >
> > > > > > Bert Karwatzki
> > > > >
> > > > > Also this bug only occurs with CONFIG_PREEMPT_RT=y, I've just checked v6.13-rc2
> > > > > without the revert and the following preempt settings and got 10 clean boots:
> > > > >
> > > >
> > > > Hrm. That matches the case where the fix here helped:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=82f9cc094975240885c93effbca7f4603f5de1bf
> > > >
> > > >
> > > > I'm still working on getting a minimal delta from your config booting
> > > > in my environment, but in the meantime, I'd be curious if the
> > > > following reduced revert helps?
> > > > https://github.com/johnstultz-work/linux-dev/commit/60c60f85670fb1f4708adbe55e15ab918d96f9f0
> > > >
> > > > Basically just trying to clarify if the problem is moving the wakeup
> > > > to the wake_q or if some other interaction (maybe with the
> > > > preempt_disables) could be causing the issue.
> > > > (I'm assuming you have 82f9cc094975 applied when you test with that change)
> > > >
> > >
> > > I tested linux-next-20241210 (which includes 82f9cc094975) with your patch
> > >
> > > diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
> > > index 37f025a096c9..724d4c871cf6 100644
> > > --- a/kernel/locking/ww_mutex.h
> > > +++ b/kernel/locking/ww_mutex.h
> > > @@ -284,7 +284,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER
> > > *waiter,
> > > #ifndef WW_RT
> > > debug_mutex_wake_waiter(lock, waiter);
> > > #endif
> > > - wake_q_add(wake_q, waiter->task);
> > > + wake_up_process(waiter->task);
> > > }
> > >
> > > return true;
> > > @@ -332,7 +332,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
> > > * wakeup pending to re-read the wounded state.
> > > */
> > > if (owner != current)
> > > - wake_q_add(wake_q, owner);
> > > + wake_up_process(owner);
> > >
> > > return true;
> > > }
> > >
> > > and this fixes the issue for me.
> >
> > Ok, thanks for validating that! Hrm. Ok, I suspect I'm missing a path
> > in rtmutex.c that calls into __schedule() before we pop up to the
> > point where we actually wake the wake_q.
> >
> > I may have found one or two spots that look likely, and I believe I've
> > also managed to reproduce your problem using the test-ww_mutex driver.
> >
> > So I'm working up a proper patch and will share for testing here soon.
>
> Ok, could you try the patch here:
> https://github.com/johnstultz-work/linux-dev/commit/3c902b92c88122cd034937e8d40930bac254a7c5
>
> I've repushed to the test branch I shared earlier, so its against
> 6.13-rc2 + fix that has landed upstream + test fix above ( + another
> patch that makes stressing the ww_mutexes easier):
> https://github.com/johnstultz-work/linux-dev/commits/debug/894d1b3db41c-hang-bert/
>
> Let me know if you still have issues with this.
> thanks
> -john
I cherry-picked your two patches on top of linux-next-20241210:
2029cb31eae9 (HEAD -> master) MAYBEFIX: Make sure we wake anything on the wake_q
when we release the lock->wait_lock
b525e4779098 test-ww_mutex: Allow test to be run (and re-run) from userland
1b2ab8149928 (tag: next-20241210) Add linux-next specific files for 20241210
This fixes the issue for me.
Bert Karwatzki