Re: Question on smp_mb__before_spinlock
From: Paul E. McKenney
Date: Mon Sep 05 2016 - 06:37:27 EST
On Mon, Sep 05, 2016 at 11:37:53AM +0200, Peter Zijlstra wrote:
> Hi all,
>
> So recently I've had two separate issues that touched upon
> smp_mb__before_spinlock().
>
>
> Since its inception, our understanding of ACQUIRE, esp. as applied to
> spinlocks, has changed somewhat. Also, I wonder if, with a simple
> change, we cannot make it provide more.
>
> The problem with the comment is that the STORE done by spin_lock isn't
> itself ordered by the ACQUIRE, and therefore a later LOAD can pass over
> it and cross with any prior STORE, rendering the default WMB
> insufficient (pointed out by Alan).
>
> Now, this is only really a problem on PowerPC and ARM64, the former of
> which already defined smp_mb__before_spinlock() as a smp_mb(), the
> latter does not, Will?
>
> The second issue I wondered about is spinlock transitivity. All except
> powerpc have RCsc locks, and since Power already does a full mb, would
> it not make sense to put it _after_ the spin_lock(), which would provide
> the same guarantee, but also upgrades the section to RCsc.
>
> That would make all schedule() calls fully transitive against one
> another.
>
>
> That is, would something like the below make sense?
Looks to me like you have reinvented smp_mb__after_unlock_lock()...
Thanx, Paul
> (does not deal with mm_types.h and overlayfs using
> smp_mb__before_spnlock).
>
> ---
> arch/arm64/include/asm/barrier.h | 2 ++
> arch/powerpc/include/asm/barrier.h | 2 +-
> include/linux/spinlock.h | 41 +++++++++++++++++++++++++++++---------
> kernel/sched/core.c | 5 +++--
> 4 files changed, 38 insertions(+), 12 deletions(-)
>
> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
> index 4eea7f618dce..d5cc8b58f942 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -104,6 +104,8 @@ do { \
> VAL; \
> })
>
> +#define smp_mb__after_spinlock() smp_mb()
> +
> #include <asm-generic/barrier.h>
>
> #endif /* __ASSEMBLY__ */
> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> index c0deafc212b8..23d64d7196b7 100644
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -74,7 +74,7 @@ do { \
> ___p1; \
> })
>
> -#define smp_mb__before_spinlock() smp_mb()
> +#define smp_mb__after_spinlock() smp_mb()
>
> #include <asm-generic/barrier.h>
>
> diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
> index 47dd0cebd204..284616dad607 100644
> --- a/include/linux/spinlock.h
> +++ b/include/linux/spinlock.h
> @@ -118,16 +118,39 @@ do { \
> #endif
>
> /*
> - * Despite its name it doesn't necessarily has to be a full barrier.
> - * It should only guarantee that a STORE before the critical section
> - * can not be reordered with LOADs and STOREs inside this section.
> - * spin_lock() is the one-way barrier, this LOAD can not escape out
> - * of the region. So the default implementation simply ensures that
> - * a STORE can not move into the critical section, smp_wmb() should
> - * serialize it with another STORE done by spin_lock().
> + * This barrier must provide two things:
> + *
> + * - it must guarantee a STORE before the spin_lock() is ordered against a
> + * LOAD after it, see the comments at its two usage sites.
> + *
> + * - it must ensure the critical section is RCsc.
> + *
> + * The latter is important for cases where we observe values written by other
> + * CPUs in spin-loops, without barriers, while being subject to scheduling.
> + *
> + * CPU0 CPU1 CPU2
> + *
> + * for (;;) {
> + * if (READ_ONCE(X))
> + * break;
> + * }
> + * X=1
> + * <sched-out>
> + * <sched-in>
> + * r = X;
> + *
> + * without transitivity it could be that CPU1 observes X!=0 breaks the loop,
> + * we get migrated and CPU2 sees X==0.
> + *
> + * Since most load-store architectures implement ACQUIRE with an smp_mb() after
> + * the LL/SC loop, they need no further barriers. Similarly all our TSO
> + * architectures imlpy an smp_mb() for each atomic instruction and equally don't
> + * need more.
> + *
> + * Architectures that can implement ACQUIRE better need to take care.
> */
> -#ifndef smp_mb__before_spinlock
> -#define smp_mb__before_spinlock() smp_wmb()
> +#ifndef smp_mb__after_spinlock
> +#define smp_mb__after_spinlock() do { } while (0)
> #endif
>
> /**
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 556cb07ab1cf..b151a33d393b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2006,8 +2006,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> * reordered with p->state check below. This pairs with mb() in
> * set_current_state() the waiting thread does.
> */
> - smp_mb__before_spinlock();
> raw_spin_lock_irqsave(&p->pi_lock, flags);
> + smp_mb__after_spinlock();
> if (!(p->state & state))
> goto out;
>
> @@ -3332,8 +3332,9 @@ static void __sched notrace __schedule(bool preempt)
> * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
> * done by the caller to avoid the race with signal_wake_up().
> */
> - smp_mb__before_spinlock();
> raw_spin_lock(&rq->lock);
> + smp_mb__after_spinlock();
> +
> cookie = lockdep_pin_lock(&rq->lock);
>
> rq->clock_skip_update <<= 1; /* promote REQ to ACT */
>