Re: Crash with PREEMPT_RT on aarch64 machine

From: Mel Gorman
Date: Wed Nov 30 2022 - 15:23:43 EST


On Mon, Nov 28, 2022 at 04:58:30PM +0100, Sebastian Andrzej Siewior wrote:
> How about this?
>
> - The fast path is easy???
>
> - The slow path first sets the WAITER bits (mark_rt_mutex_waiters()) so
> I made that one _acquire so that it is visible by the unlocker forcing
> everyone into slow path.
>
> - If the lock is acquired, then the owner is written via
> rt_mutex_set_owner(). This happens under wait_lock so it is
> serialized and so a WRITE_ONCE() is used to write the final owner. I
> replaced it with a cmpxchg_acquire() to have the owner there.
> Not sure if I shouldn't make this as you put it:
> | e.g. by making use of dependency ordering where it already exists.
> The other (locking) CPU needs to see the owner not only the WAITER
> bit. I'm not sure if this could be replaced with smp_store_release().
>
> - After the whole operation completes, fixup_rt_mutex_waiters() cleans
> the WAITER bit and I kept the _acquire semantic here. Now looking at
> it again, I don't think that needs to be done since that shouldn't be
> set here.
>
> - There is rtmutex_spin_on_owner() which (as the name implies) spins on
> the owner as long as it active. It obtains it via READ_ONCE() and I'm
> not sure if any memory barrier is needed. Worst case is that it will
> spin while owner isn't set if it observers a stale value.
>
> - The unlock path first clears the waiter bit if there are no waiters
> recorded (via simple assignments under the wait_lock (every locker
> will fail with the cmpxchg_acquire() and go for the wait_lock)) and
> then finally drop it via rt_mutex_cmpxchg_release(,, NULL).
> Should there be a wait, it will just store the WAITER bit with
> smp_store_release() (setting the owner is NULL but the WAITER bit
> forces everyone into the slow path).
>
> - Added rt_mutex_set_owner_pi() which does simple assignment. This is
> used from the futex code and here everything happens under a lock.
>
> - I added a smp_load_acquire() to rt_mutex_base_is_locked() since I
> *think* want to observe a real waiter and not something stale.
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>

include/linux/rtmutex.h needs to also include asm/barrier.h to resolve
some build problems. Once that was resolved, 10 iterations of the dbench
work completed successfully and without the patch, 1 iteration could not
complete.

Review is trickier as I'm not spent any reasonable amount of time on locking
primitives. I'd have to defer to Peter in that regard but I skimmed it at
least before wrapping up for the evening.

> ---
> include/linux/rtmutex.h | 2 +-
> kernel/locking/rtmutex.c | 26 ++++++++++++++++++--------
> kernel/locking/rtmutex_api.c | 4 ++--
> 3 files changed, 21 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
> index 7d049883a08ac..4447e01f723d4 100644
> --- a/include/linux/rtmutex.h
> +++ b/include/linux/rtmutex.h
> @@ -41,7 +41,7 @@ struct rt_mutex_base {
> */
> static inline bool rt_mutex_base_is_locked(struct rt_mutex_base *lock)
> {
> - return READ_ONCE(lock->owner) != NULL;
> + return smp_load_acquire(&lock->owner) != NULL;
> }
>

I don't believe this is necessary. It's only needed when checking if a
lock is acquired or not and it's inherently race-prone. It's harmless if
a stale value is observed and it does not pair with a release. Mostly it's
useful for debugging checks.

> extern void rt_mutex_base_init(struct rt_mutex_base *rtb);
> diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> index 7779ee8abc2a0..e3cc673e0c988 100644
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -97,7 +97,7 @@ rt_mutex_set_owner(struct rt_mutex_base *lock, struct task_struct *owner)
> if (rt_mutex_has_waiters(lock))
> val |= RT_MUTEX_HAS_WAITERS;
>
> - WRITE_ONCE(lock->owner, (struct task_struct *)val);
> + WARN_ON_ONCE(cmpxchg_acquire(&lock->owner, RT_MUTEX_HAS_WAITERS, val) != RT_MUTEX_HAS_WAITERS);
> }
>
> static __always_inline void clear_rt_mutex_waiters(struct rt_mutex_base *lock)
> @@ -106,6 +106,17 @@ static __always_inline void clear_rt_mutex_waiters(struct rt_mutex_base *lock)
> ((unsigned long)lock->owner & ~RT_MUTEX_HAS_WAITERS);
> }
>
> +static __always_inline void
> +rt_mutex_set_owner_pi(struct rt_mutex_base *lock, struct task_struct *owner)
> +{

What does pi mean in this context? I think the naming here might
misleading. rt_mutex_set_owner_pi is used when initialising and when
clearing the owner. rt_mutex_set_owner is set when acquiring the lock.

Consider renaming rt_mutex_set_owner_pi to rt_mutex_clear_owner. The init
could still use rt_mutex_set_owner as an extra barrier is not a big deal
during init if the straight assignment was unpopular. The init could also
do a plain assignment because it cannot have any waiters yet.

What is less obvious is if rt_mutex_clear_owner should have explicit release
semantics to pair with rt_mutex_set_owner. It looks like it might not
matter because at least some paths end up having release semantics anyway
due to a spinlock but I didn't check all cases and it's potentially fragile.

> + unsigned long val = (unsigned long)owner;
> +
> + if (rt_mutex_has_waiters(lock))
> + val |= RT_MUTEX_HAS_WAITERS;
> +
> + WRITE_ONCE(lock->owner, val);
> +}
> +
> static __always_inline void fixup_rt_mutex_waiters(struct rt_mutex_base *lock)
> {
> unsigned long owner, *p = (unsigned long *) &lock->owner;
> @@ -173,7 +184,7 @@ static __always_inline void fixup_rt_mutex_waiters(struct rt_mutex_base *lock)
> */
> owner = READ_ONCE(*p);
> if (owner & RT_MUTEX_HAS_WAITERS)
> - WRITE_ONCE(*p, owner & ~RT_MUTEX_HAS_WAITERS);
> + cmpxchg_acquire(p, owner, owner & ~RT_MUTEX_HAS_WAITERS);
> }
>
> /*
> @@ -196,17 +207,16 @@ static __always_inline bool rt_mutex_cmpxchg_release(struct rt_mutex_base *lock,
> }
>
> /*
> - * Callers must hold the ->wait_lock -- which is the whole purpose as we force
> - * all future threads that attempt to [Rmw] the lock to the slowpath. As such
> - * relaxed semantics suffice.
> + * Callers hold the ->wait_lock. This needs to be visible to the lockowner,
> + * forcing him into the slow path for the unlock operation.
> */
> static __always_inline void mark_rt_mutex_waiters(struct rt_mutex_base *lock)
> {
> unsigned long owner, *p = (unsigned long *) &lock->owner;
>
> do {
> - owner = *p;
> - } while (cmpxchg_relaxed(p, owner,
> + owner = READ_ONCE(*p);
> + } while (cmpxchg_acquire(p, owner,
> owner | RT_MUTEX_HAS_WAITERS) != owner);
> }
>

Not 100% sure although I see it's to cover an exit path from
try_to_take_rt_mutex. I'm undecided if rt_mutex_set_owner having acquire
semantics and rt_mutex_clear_owner having clear semantics would be
sufficient. try_to_take_rt_mutex can still return with release semantics
but only in the case where it fails to acquire the lock.

> @@ -1218,7 +1228,7 @@ static void __sched mark_wakeup_next_waiter(struct rt_wake_q_head *wqh,
> * slow path making sure no task of lower priority than
> * the top waiter can steal this lock.
> */
> - lock->owner = (void *) RT_MUTEX_HAS_WAITERS;
> + smp_store_release(&lock->owner, (void *) RT_MUTEX_HAS_WAITERS);
>
> /*
> * We deboosted before waking the top waiter task such that we don't

This is within a locked section and would definitely see a barrier if
rt_mutex_wake_q_add_task calls wake_q_add but it's less clear if the
optimisation in rt_mutex_wake_q_add_task could race so I'm undecided if
it's necessary or not.

> diff --git a/kernel/locking/rtmutex_api.c b/kernel/locking/rtmutex_api.c
> index 900220941caac..9acc176f93d38 100644
> --- a/kernel/locking/rtmutex_api.c
> +++ b/kernel/locking/rtmutex_api.c
> @@ -249,7 +249,7 @@ void __sched rt_mutex_init_proxy_locked(struct rt_mutex_base *lock,
> * recursion. Give the futex/rtmutex wait_lock a separate key.
> */
> lockdep_set_class(&lock->wait_lock, &pi_futex_key);
> - rt_mutex_set_owner(lock, proxy_owner);
> + rt_mutex_set_owner_pi(lock, proxy_owner);
> }
>
> /**
> @@ -267,7 +267,7 @@ void __sched rt_mutex_init_proxy_locked(struct rt_mutex_base *lock,
> void __sched rt_mutex_proxy_unlock(struct rt_mutex_base *lock)
> {
> debug_rt_mutex_proxy_unlock(lock);
> - rt_mutex_set_owner(lock, NULL);
> + rt_mutex_set_owner_pi(lock, NULL);
> }
>
> /**
> --
> 2.38.1
>

--
Mel Gorman
SUSE Labs