Re: Question on smp_mb__before_spinlock

From: Nicholas Piggin
Date: Wed Sep 07 2016 - 08:17:56 EST


On Mon, 5 Sep 2016 11:37:53 +0200
Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:

> Hi all,
>
> So recently I've had two separate issues that touched upon
> smp_mb__before_spinlock().
>
>
> Since its inception, our understanding of ACQUIRE, esp. as applied to
> spinlocks, has changed somewhat. Also, I wonder if, with a simple
> change, we cannot make it provide more.
>
> The problem with the comment is that the STORE done by spin_lock isn't
> itself ordered by the ACQUIRE, and therefore a later LOAD can pass over
> it and cross with any prior STORE, rendering the default WMB
> insufficient (pointed out by Alan).
>
> Now, this is only really a problem on PowerPC and ARM64, the former of
> which already defined smp_mb__before_spinlock() as a smp_mb(), the
> latter does not, Will?
>
> The second issue I wondered about is spinlock transitivity. All except
> powerpc have RCsc locks, and since Power already does a full mb, would
> it not make sense to put it _after_ the spin_lock(), which would provide
> the same guarantee, but also upgrades the section to RCsc.
>
> That would make all schedule() calls fully transitive against one
> another.
>
>
> That is, would something like the below make sense?
>
> (does not deal with mm_types.h and overlayfs using
> smp_mb__before_spnlock).
>
> ---
> arch/arm64/include/asm/barrier.h | 2 ++
> arch/powerpc/include/asm/barrier.h | 2 +-
> include/linux/spinlock.h | 41 +++++++++++++++++++++++++++++---------
> kernel/sched/core.c | 5 +++--
> 4 files changed, 38 insertions(+), 12 deletions(-)
>
> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
> index 4eea7f618dce..d5cc8b58f942 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -104,6 +104,8 @@ do { \
> VAL; \
> })
>
> +#define smp_mb__after_spinlock() smp_mb()
> +
> #include <asm-generic/barrier.h>
>
> #endif /* __ASSEMBLY__ */
> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> index c0deafc212b8..23d64d7196b7 100644
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -74,7 +74,7 @@ do { \
> ___p1; \
> })
>
> -#define smp_mb__before_spinlock() smp_mb()
> +#define smp_mb__after_spinlock() smp_mb()
>
> #include <asm-generic/barrier.h>
>
> diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
> index 47dd0cebd204..284616dad607 100644
> --- a/include/linux/spinlock.h
> +++ b/include/linux/spinlock.h
> @@ -118,16 +118,39 @@ do { \
> #endif
>
> /*
> - * Despite its name it doesn't necessarily has to be a full barrier.
> - * It should only guarantee that a STORE before the critical section
> - * can not be reordered with LOADs and STOREs inside this section.
> - * spin_lock() is the one-way barrier, this LOAD can not escape out
> - * of the region. So the default implementation simply ensures that
> - * a STORE can not move into the critical section, smp_wmb() should
> - * serialize it with another STORE done by spin_lock().
> + * This barrier must provide two things:
> + *
> + * - it must guarantee a STORE before the spin_lock() is ordered against a
> + * LOAD after it, see the comments at its two usage sites.
> + *
> + * - it must ensure the critical section is RCsc.
> + *
> + * The latter is important for cases where we observe values written by other
> + * CPUs in spin-loops, without barriers, while being subject to scheduling.
> + *
> + * CPU0 CPU1 CPU2
> + *
> + * for (;;) {
> + * if (READ_ONCE(X))
> + * break;
> + * }
> + * X=1
> + * <sched-out>
> + * <sched-in>
> + * r = X;
> + *
> + * without transitivity it could be that CPU1 observes X!=0 breaks the loop,
> + * we get migrated and CPU2 sees X==0.
> + *
> + * Since most load-store architectures implement ACQUIRE with an smp_mb() after
> + * the LL/SC loop, they need no further barriers. Similarly all our TSO
> + * architectures imlpy an smp_mb() for each atomic instruction and equally don't
> + * need more.
> + *
> + * Architectures that can implement ACQUIRE better need to take care.
> */
> -#ifndef smp_mb__before_spinlock
> -#define smp_mb__before_spinlock() smp_wmb()
> +#ifndef smp_mb__after_spinlock
> +#define smp_mb__after_spinlock() do { } while (0)
> #endif

It seems okay, but why not make it a special sched-only function name
to prevent it being used in generic code?

I would not mind seeing responsibility for the switch barrier moved to
generic context switch code like this (alternative for powerpc reducing
number of hwsync instructions was to add documentation and warnings about
the barriers in arch dependent and independent code). And pairing it with
a spinlock is reasonable.

It may not strictly be an "smp_" style of barrier if MMIO accesses are to
be ordered here too, despite critical section may only be providing
acquire/release for cacheable memory, so maybe it's slightly more
complicated than just cacheable RCsc?

This would end up flushing the store queue while holding the spinlock on
POWER, as opposed to before acquiring the lock. I doubt that's ever going
to be noticable, but if you already add a special new primitive here, then
an arch wrapper for raw_spin_lock() can let archs do their own thing here.

Thanks,
Nick