Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath

From: Oleg Nesterov
Date: Wed May 14 2014 - 10:26:43 EST


On 05/13, Paul E. McKenney wrote:
>
> On Tue, May 13, 2014 at 08:57:42PM +0200, Oleg Nesterov wrote:
> > On 05/13, Paul E. McKenney wrote:
> > >
> > > On Tue, May 13, 2014 at 05:44:35PM +0200, Peter Zijlstra wrote:
> > > >
> > > > Ah, yes, so I'll defer to Oleg and Linus to explain that one. As per the
> > > > name: smp_mb__before_spinlock() should of course imply a full barrier.
> > >
> > > How about if I queue a name change to smp_wmb__before_spinlock()?
> >
> > I agree, this is more accurate, simply because it describes what it
> > actually does.
> >
> > But just in case, as for try_to_wake_up() it does not actually need
> > wmb() between "CONDITION = T" and "task->state = RUNNING". It would
> > be fine if these 2 STORE's are re-ordered, we can rely on rq->lock.
> >
> > What it actually needs is a barrier between "CONDITION = T" and
> > "task->state & state" check. But since we do not have a store-load
> > barrier, wmb() was added to ensure that "CONDITION = T" can't leak
> > into the critical section.
> >
> > But it seems that set_tlb_flush_pending() already assumes that it
> > acts as wmb(), so probably smp_wmb__before_spinlock() is fine.
>
> Except that when I go to make the change, I find the following in
> the documentation:
>
> Memory operations issued before the ACQUIRE may be completed after
> the ACQUIRE operation has completed. An smp_mb__before_spinlock(),
> combined with a following ACQUIRE, orders prior loads against
> subsequent loads and stores and also orders prior stores against
> subsequent stores. Note that this is weaker than smp_mb()! The
> smp_mb__before_spinlock() primitive is free on many architectures.
>
> Which means that either the documentation is wrong or the implementation
> is. Yes, smp_wmb() has the semantics called out above on many platforms,
> but not on Alpha or ARM.

Well, I think the documentation is wrong in any case. "prior loads
against subsequent loads" is not true. And it doesn't document that
the initial goal was "prior stores against the subsequent loads".
"prior stores against the subsequent stores" is obviously true for
the default implementation, but this is the "side effect" because
it uses wmb().


The only intent of wmb() added by 04e2f174 "Add memory barrier semantics
to wake_up() & co" (afaics at least) was: make sure that ttwu() does not
read p->state before the preceding stores are completed.

e0acd0a68e "sched: fix the theoretical signal_wake_up() vs schedule()
race" added the new helper for documentation, to explain that the
default implementation abuses wmb() to achieve the serialization above.

> So, as you say, set_tlb_flush_pending() only relies on smp_wmb().

The comment says ;) and this means that even if we suddenly have a new
load_store() barrier (which could work for ttwu/schedule) we can no
longer change smp_mb__before_spinlock() to use it.

> The comment in try_to_wake_up() seems to be assuming a full memory
> barrier. The comment in __schedule() also seems to be relying on
> a full memory barrier (prior write against subsequent read). Yow!

Well yes, but see above. Again, we need load_store() before reading
p->state, which we do not have. wmb() before spin_lock() can be used
instead.

But, try_to_wake_up() and __schedule() do not need a full barrier in
a sense that if we are going to wake this task up (or just clear its
->state), then "CONDITION = T" can be delayed till spin_unlock().

We do not care if that tasks misses CONDITION in this case, it will
call schedule() which will take the same lock. But if we are not going
to wake it up, we need to ensure that the task can't miss CONDITION.

IOW, this all is simply about

CONDITION = T; current->state = TASK_XXX;
mb();

if (p->state) if (!CONDITION)
wake_it_up(); schedule();

race.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/