Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()

From: Peter Hurley
Date: Mon Jul 13 2015 - 09:09:56 EST

Next message: Eric Auger: "Re: [v5 16/19] KVM: Add an arch specific hooks in 'struct kvm_kernel_irqfd'"
Previous message: Johannes Weiner: "Re: [PATCH 5/5] mm, memcontrol: use vma_is_anonymous() to check for anon VMA"
In reply to: Will Deacon: "[RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()"
Next in thread: Will Deacon: "Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 07/13/2015 08:15 AM, Will Deacon wrote:
> smp_mb__after_unlock_lock is used to promote an UNLOCK + LOCK sequence
> into a full memory barrier.
>
> However:
>
> - This ordering guarantee is already provided without the barrier on
> all architectures apart from PowerPC
>
> - The barrier only applies to UNLOCK + LOCK, not general
> RELEASE + ACQUIRE operations

I'm unclear what you mean here: do you mean
A) a memory barrier is not required between RELEASE M + ACQUIRE N when you
want to maintain distinct order between those operations on all arches
(with the possible exception of PowerPC), or,
B) no one is using smp_mb__after_unlock_lock() in that way right now.

Regards,
Peter Hurley

> - Locks are generally assumed to offer SC ordering semantics, so
> having this additional barrier is error-prone and complicates the
> callers of LOCK/UNLOCK primitives
>
> - The barrier is not well used outside of RCU and, because it was
> retrofitted into the kernel, it's not clear whether other areas of
> the kernel are incorrectly relying on UNLOCK + LOCK implying a full
> barrier
>
> This patch removes the barrier and instead requires architectures to
> provide full barrier semantics for an UNLOCK + LOCK sequence.
>
> Cc: Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx>
> Cc: Paul McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Signed-off-by: Will Deacon <will.deacon@xxxxxxx>
> ---
>
> This didn't go anywhere last time I posted it, but here it is again.
> I'd really appreciate some feedback from the PowerPC guys, especially as
> to whether this change requires them to add an additional barrier in
> arch_spin_unlock and what the cost of that would be.
>
> Documentation/memory-barriers.txt | 77 ++-----------------------------------
> arch/powerpc/include/asm/spinlock.h | 2 -
> include/linux/spinlock.h | 10 -----
> kernel/locking/mcs_spinlock.h | 9 -----
> kernel/rcu/tree.c | 21 +---------
> kernel/rcu/tree_plugin.h | 11 ------
> 6 files changed, 4 insertions(+), 126 deletions(-)
>
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 13feb697271f..fff21b632893 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1848,74 +1848,9 @@ RELEASE are to the same lock variable, but only from the perspective of
> another CPU not holding that lock. In short, a ACQUIRE followed by an
> RELEASE may -not- be assumed to be a full memory barrier.
>
> -Similarly, the reverse case of a RELEASE followed by an ACQUIRE does not
> -imply a full memory barrier. If it is necessary for a RELEASE-ACQUIRE
> -pair to produce a full barrier, the ACQUIRE can be followed by an
> -smp_mb__after_unlock_lock() invocation. This will produce a full barrier
> -if either (a) the RELEASE and the ACQUIRE are executed by the same
> -CPU or task, or (b) the RELEASE and ACQUIRE act on the same variable.
> -The smp_mb__after_unlock_lock() primitive is free on many architectures.
> -Without smp_mb__after_unlock_lock(), the CPU's execution of the critical
> -sections corresponding to the RELEASE and the ACQUIRE can cross, so that:
> -
> - *A = a;
> - RELEASE M
> - ACQUIRE N
> - *B = b;
> -
> -could occur as:
> -
> - ACQUIRE N, STORE *B, STORE *A, RELEASE M
> -
> -It might appear that this reordering could introduce a deadlock.
> -However, this cannot happen because if such a deadlock threatened,
> -the RELEASE would simply complete, thereby avoiding the deadlock.
> -
> - Why does this work?
> -
> - One key point is that we are only talking about the CPU doing
> - the reordering, not the compiler. If the compiler (or, for
> - that matter, the developer) switched the operations, deadlock
> - -could- occur.
> -
> - But suppose the CPU reordered the operations. In this case,
> - the unlock precedes the lock in the assembly code. The CPU
> - simply elected to try executing the later lock operation first.
> - If there is a deadlock, this lock operation will simply spin (or
> - try to sleep, but more on that later). The CPU will eventually
> - execute the unlock operation (which preceded the lock operation
> - in the assembly code), which will unravel the potential deadlock,
> - allowing the lock operation to succeed.
> -
> - But what if the lock is a sleeplock? In that case, the code will
> - try to enter the scheduler, where it will eventually encounter
> - a memory barrier, which will force the earlier unlock operation
> - to complete, again unraveling the deadlock. There might be
> - a sleep-unlock race, but the locking primitive needs to resolve
> - such races properly in any case.
> -
> -With smp_mb__after_unlock_lock(), the two critical sections cannot overlap.
> -For example, with the following code, the store to *A will always be
> -seen by other CPUs before the store to *B:
> -
> - *A = a;
> - RELEASE M
> - ACQUIRE N
> - smp_mb__after_unlock_lock();
> - *B = b;
> -
> -The operations will always occur in one of the following orders:
> -
> - STORE *A, RELEASE, ACQUIRE, smp_mb__after_unlock_lock(), STORE *B
> - STORE *A, ACQUIRE, RELEASE, smp_mb__after_unlock_lock(), STORE *B
> - ACQUIRE, STORE *A, RELEASE, smp_mb__after_unlock_lock(), STORE *B
> -
> -If the RELEASE and ACQUIRE were instead both operating on the same lock
> -variable, only the first of these alternatives can occur. In addition,
> -the more strongly ordered systems may rule out some of the above orders.
> -But in any case, as noted earlier, the smp_mb__after_unlock_lock()
> -ensures that the store to *A will always be seen as happening before
> -the store to *B.
> +However, the reverse case of a RELEASE followed by an ACQUIRE _does_
> +imply a full memory barrier when these accesses are performed as a pair
> +of UNLOCK and LOCK operations, irrespective of the lock variable.
>
> Locks and semaphores may not provide any guarantee of ordering on UP compiled
> systems, and so cannot be counted on in such a situation to actually achieve
> @@ -2158,7 +2093,6 @@ However, if the following occurs:
> RELEASE M [1]
> ACCESS_ONCE(*D) = d; ACCESS_ONCE(*E) = e;
> ACQUIRE M [2]
> - smp_mb__after_unlock_lock();
> ACCESS_ONCE(*F) = f;
> ACCESS_ONCE(*G) = g;
> RELEASE M [2]
> @@ -2176,11 +2110,6 @@ But assuming CPU 1 gets the lock first, CPU 3 won't see any of:
> *F, *G or *H preceding ACQUIRE M [2]
> *A, *B, *C, *E, *F or *G following RELEASE M [2]
>
> -Note that the smp_mb__after_unlock_lock() is critically important
> -here: Without it CPU 3 might see some of the above orderings.
> -Without smp_mb__after_unlock_lock(), the accesses are not guaranteed
> -to be seen in order unless CPU 3 holds lock M.
> -
>
> ACQUIRES VS I/O ACCESSES
> ------------------------
> diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
> index 4dbe072eecbe..523673d7583c 100644
> --- a/arch/powerpc/include/asm/spinlock.h
> +++ b/arch/powerpc/include/asm/spinlock.h
> @@ -28,8 +28,6 @@
> #include <asm/synch.h>
> #include <asm/ppc-opcode.h>
>
> -#define smp_mb__after_unlock_lock() smp_mb() /* Full ordering for lock. */
> -
> #ifdef CONFIG_PPC64
> /* use 0x800000yy when locked, where yy == CPU number */
> #ifdef __BIG_ENDIAN__
> diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
> index 0063b24b4f36..16c5ed5a627c 100644
> --- a/include/linux/spinlock.h
> +++ b/include/linux/spinlock.h
> @@ -130,16 +130,6 @@ do { \
> #define smp_mb__before_spinlock() smp_wmb()
> #endif
>
> -/*
> - * Place this after a lock-acquisition primitive to guarantee that
> - * an UNLOCK+LOCK pair act as a full barrier. This guarantee applies
> - * if the UNLOCK and LOCK are executed by the same CPU or if the
> - * UNLOCK and LOCK operate on the same lock variable.
> - */
> -#ifndef smp_mb__after_unlock_lock
> -#define smp_mb__after_unlock_lock() do { } while (0)
> -#endif
> -
> /**
> * raw_spin_unlock_wait - wait until the spinlock gets unlocked
> * @lock: the spinlock in question.
> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
> index fd91aaa4554c..2ea2fae2b477 100644
> --- a/kernel/locking/mcs_spinlock.h
> +++ b/kernel/locking/mcs_spinlock.h
> @@ -43,15 +43,6 @@ do { \
> #endif
>
> /*
> - * Note: the smp_load_acquire/smp_store_release pair is not
> - * sufficient to form a full memory barrier across
> - * cpus for many architectures (except x86) for mcs_unlock and mcs_lock.
> - * For applications that need a full barrier across multiple cpus
> - * with mcs_unlock and mcs_lock pair, smp_mb__after_unlock_lock() should be
> - * used after mcs_lock.
> - */
> -
> -/*
> * In order to acquire the lock, the caller should declare a local node and
> * pass a reference of the node to this function in addition to the lock.
> * If the lock has already been acquired, then this will proceed to spin
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 65137bc28b2b..6689fc0808c8 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1513,10 +1513,8 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
> * hold it, acquire the root rcu_node structure's lock in order to
> * start one (if needed).
> */
> - if (rnp != rnp_root) {
> + if (rnp != rnp_root)
> raw_spin_lock(&rnp_root->lock);
> - smp_mb__after_unlock_lock();
> - }
>
> /*
> * Get a new grace-period number. If there really is no grace
> @@ -1769,7 +1767,6 @@ static void note_gp_changes(struct rcu_state *rsp, struct rcu_data *rdp)
> local_irq_restore(flags);
> return;
> }
> - smp_mb__after_unlock_lock();
> needwake = __note_gp_changes(rsp, rnp, rdp);
> raw_spin_unlock_irqrestore(&rnp->lock, flags);
> if (needwake)
> @@ -1794,7 +1791,6 @@ static int rcu_gp_init(struct rcu_state *rsp)
>
> WRITE_ONCE(rsp->gp_activity, jiffies);
> raw_spin_lock_irq(&rnp->lock);
> - smp_mb__after_unlock_lock();
> if (!READ_ONCE(rsp->gp_flags)) {
> /* Spurious wakeup, tell caller to go back to sleep. */
> raw_spin_unlock_irq(&rnp->lock);
> @@ -1827,7 +1823,6 @@ static int rcu_gp_init(struct rcu_state *rsp)
> rcu_for_each_leaf_node(rsp, rnp) {
> rcu_gp_slow(rsp, gp_preinit_delay);
> raw_spin_lock_irq(&rnp->lock);
> - smp_mb__after_unlock_lock();
> if (rnp->qsmaskinit == rnp->qsmaskinitnext &&
> !rnp->wait_blkd_tasks) {
> /* Nothing to do on this leaf rcu_node structure. */
> @@ -1884,7 +1879,6 @@ static int rcu_gp_init(struct rcu_state *rsp)
> rcu_for_each_node_breadth_first(rsp, rnp) {
> rcu_gp_slow(rsp, gp_init_delay);
> raw_spin_lock_irq(&rnp->lock);
> - smp_mb__after_unlock_lock();
> rdp = this_cpu_ptr(rsp->rda);
> rcu_preempt_check_blocked_tasks(rnp);
> rnp->qsmask = rnp->qsmaskinit;
> @@ -1935,7 +1929,6 @@ static int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
> /* Clear flag to prevent immediate re-entry. */
> if (READ_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
> raw_spin_lock_irq(&rnp->lock);
> - smp_mb__after_unlock_lock();
> WRITE_ONCE(rsp->gp_flags,
> READ_ONCE(rsp->gp_flags) & ~RCU_GP_FLAG_FQS);
> raw_spin_unlock_irq(&rnp->lock);
> @@ -1956,7 +1949,6 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
>
> WRITE_ONCE(rsp->gp_activity, jiffies);
> raw_spin_lock_irq(&rnp->lock);
> - smp_mb__after_unlock_lock();
> gp_duration = jiffies - rsp->gp_start;
> if (gp_duration > rsp->gp_max)
> rsp->gp_max = gp_duration;
> @@ -1982,7 +1974,6 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
> */
> rcu_for_each_node_breadth_first(rsp, rnp) {
> raw_spin_lock_irq(&rnp->lock);
> - smp_mb__after_unlock_lock();
> WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp(rnp));
> WARN_ON_ONCE(rnp->qsmask);
> WRITE_ONCE(rnp->completed, rsp->gpnum);
> @@ -1998,7 +1989,6 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
> }
> rnp = rcu_get_root(rsp);
> raw_spin_lock_irq(&rnp->lock);
> - smp_mb__after_unlock_lock(); /* Order GP before ->completed update. */
> rcu_nocb_gp_set(rnp, nocb);
>
> /* Declare grace period done. */
> @@ -2246,7 +2236,6 @@ rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp,
> rnp_c = rnp;
> rnp = rnp->parent;
> raw_spin_lock_irqsave(&rnp->lock, flags);
> - smp_mb__after_unlock_lock();
> oldmask = rnp_c->qsmask;
> }
>
> @@ -2294,7 +2283,6 @@ static void rcu_report_unblock_qs_rnp(struct rcu_state *rsp,
> mask = rnp->grpmask;
> raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
> raw_spin_lock(&rnp_p->lock); /* irqs already disabled. */
> - smp_mb__after_unlock_lock();
> rcu_report_qs_rnp(mask, rsp, rnp_p, gps, flags);
> }
>
> @@ -2317,7 +2305,6 @@ rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp)
>
> rnp = rdp->mynode;
> raw_spin_lock_irqsave(&rnp->lock, flags);
> - smp_mb__after_unlock_lock();
> if ((rdp->passed_quiesce == 0 &&
> rdp->rcu_qs_ctr_snap == __this_cpu_read(rcu_qs_ctr)) ||
> rdp->gpnum != rnp->gpnum || rnp->completed == rnp->gpnum ||
> @@ -2544,7 +2531,6 @@ static void rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf)
> if (!rnp)
> break;
> raw_spin_lock(&rnp->lock); /* irqs already disabled. */
> - smp_mb__after_unlock_lock(); /* GP memory ordering. */
> rnp->qsmaskinit &= ~mask;
> rnp->qsmask &= ~mask;
> if (rnp->qsmaskinit) {
> @@ -2573,7 +2559,6 @@ static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
> /* Remove outgoing CPU from mask in the leaf rcu_node structure. */
> mask = rdp->grpmask;
> raw_spin_lock_irqsave(&rnp->lock, flags);
> - smp_mb__after_unlock_lock(); /* Enforce GP memory-order guarantee. */
> rnp->qsmaskinitnext &= ~mask;
> raw_spin_unlock_irqrestore(&rnp->lock, flags);
> }
> @@ -2771,7 +2756,6 @@ static void force_qs_rnp(struct rcu_state *rsp,
> cond_resched_rcu_qs();
> mask = 0;
> raw_spin_lock_irqsave(&rnp->lock, flags);
> - smp_mb__after_unlock_lock();
> if (rnp->qsmask == 0) {
> if (rcu_state_p == &rcu_sched_state ||
> rsp != rcu_state_p ||
> @@ -2843,7 +2827,6 @@ static void force_quiescent_state(struct rcu_state *rsp)
>
> /* Reached the root of the rcu_node tree, acquire lock. */
> raw_spin_lock_irqsave(&rnp_old->lock, flags);
> - smp_mb__after_unlock_lock();
> raw_spin_unlock(&rnp_old->fqslock);
> if (READ_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
> rsp->n_force_qs_lh++;
> @@ -2967,7 +2950,6 @@ static void __call_rcu_core(struct rcu_state *rsp, struct rcu_data *rdp,
> struct rcu_node *rnp_root = rcu_get_root(rsp);
>
> raw_spin_lock(&rnp_root->lock);
> - smp_mb__after_unlock_lock();
> needwake = rcu_start_gp(rsp);
> raw_spin_unlock(&rnp_root->lock);
> if (needwake)
> @@ -3810,7 +3792,6 @@ rcu_init_percpu_data(int cpu, struct rcu_state *rsp)
> rnp = rdp->mynode;
> mask = rdp->grpmask;
> raw_spin_lock(&rnp->lock); /* irqs already disabled. */
> - smp_mb__after_unlock_lock();
> rnp->qsmaskinitnext |= mask;
> rdp->gpnum = rnp->completed; /* Make CPU later note any new GP. */
> rdp->completed = rnp->completed;
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 013485fb2b06..79793a7647cf 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -164,7 +164,6 @@ static void rcu_preempt_note_context_switch(void)
> rdp = this_cpu_ptr(rcu_state_p->rda);
> rnp = rdp->mynode;
> raw_spin_lock_irqsave(&rnp->lock, flags);
> - smp_mb__after_unlock_lock();
> t->rcu_read_unlock_special.b.blocked = true;
> t->rcu_blocked_node = rnp;
>
> @@ -324,7 +323,6 @@ void rcu_read_unlock_special(struct task_struct *t)
> for (;;) {
> rnp = t->rcu_blocked_node;
> raw_spin_lock(&rnp->lock); /* irqs already disabled. */
> - smp_mb__after_unlock_lock();
> if (rnp == t->rcu_blocked_node)
> break;
> WARN_ON_ONCE(1);
> @@ -598,7 +596,6 @@ static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp,
> unsigned long mask;
>
> raw_spin_lock_irqsave(&rnp->lock, flags);
> - smp_mb__after_unlock_lock();
> for (;;) {
> if (!sync_rcu_preempt_exp_done(rnp)) {
> raw_spin_unlock_irqrestore(&rnp->lock, flags);
> @@ -616,7 +613,6 @@ static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp,
> raw_spin_unlock(&rnp->lock); /* irqs remain disabled */
> rnp = rnp->parent;
> raw_spin_lock(&rnp->lock); /* irqs already disabled */
> - smp_mb__after_unlock_lock();
> rnp->expmask &= ~mask;
> }
> }
> @@ -638,7 +634,6 @@ sync_rcu_preempt_exp_init1(struct rcu_state *rsp, struct rcu_node *rnp)
> struct rcu_node *rnp_up;
>
> raw_spin_lock_irqsave(&rnp->lock, flags);
> - smp_mb__after_unlock_lock();
> WARN_ON_ONCE(rnp->expmask);
> WARN_ON_ONCE(rnp->exp_tasks);
> if (!rcu_preempt_has_tasks(rnp)) {
> @@ -655,7 +650,6 @@ sync_rcu_preempt_exp_init1(struct rcu_state *rsp, struct rcu_node *rnp)
> if (rnp_up->expmask & mask)
> break;
> raw_spin_lock(&rnp_up->lock); /* irqs already off */
> - smp_mb__after_unlock_lock();
> rnp_up->expmask |= mask;
> raw_spin_unlock(&rnp_up->lock); /* irqs still off */
> }
> @@ -679,7 +673,6 @@ sync_rcu_preempt_exp_init2(struct rcu_state *rsp, struct rcu_node *rnp)
> unsigned long flags;
>
> raw_spin_lock_irqsave(&rnp->lock, flags);
> - smp_mb__after_unlock_lock();
> if (!rnp->expmask) {
> /* Phase 1 didn't do anything, so Phase 2 doesn't either. */
> raw_spin_unlock_irqrestore(&rnp->lock, flags);
> @@ -1007,7 +1000,6 @@ static int rcu_boost(struct rcu_node *rnp)
> return 0; /* Nothing left to boost. */
>
> raw_spin_lock_irqsave(&rnp->lock, flags);
> - smp_mb__after_unlock_lock();
>
> /*
> * Recheck under the lock: all tasks in need of boosting
> @@ -1195,7 +1187,6 @@ static int rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
> if (IS_ERR(t))
> return PTR_ERR(t);
> raw_spin_lock_irqsave(&rnp->lock, flags);
> - smp_mb__after_unlock_lock();
> rnp->boost_kthread_task = t;
> raw_spin_unlock_irqrestore(&rnp->lock, flags);
> sp.sched_priority = kthread_prio;
> @@ -1586,7 +1577,6 @@ static void rcu_prepare_for_idle(void)
> continue;
> rnp = rdp->mynode;
> raw_spin_lock(&rnp->lock); /* irqs already disabled. */
> - smp_mb__after_unlock_lock();
> needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
> raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
> if (needwake)
> @@ -2114,7 +2104,6 @@ static void rcu_nocb_wait_gp(struct rcu_data *rdp)
> struct rcu_node *rnp = rdp->mynode;
>
> raw_spin_lock_irqsave(&rnp->lock, flags);
> - smp_mb__after_unlock_lock();
> needwake = rcu_start_future_gp(rnp, rdp, &c);
> raw_spin_unlock_irqrestore(&rnp->lock, flags);
> if (needwake)
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Eric Auger: "Re: [v5 16/19] KVM: Add an arch specific hooks in 'struct kvm_kernel_irqfd'"
Previous message: Johannes Weiner: "Re: [PATCH 5/5] mm, memcontrol: use vma_is_anonymous() to check for anon VMA"
In reply to: Will Deacon: "[RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()"
Next in thread: Will Deacon: "Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]