Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath

From: Paul E. McKenney
Date: Tue May 13 2014 - 11:27:40 EST

Next message: Peter Zijlstra: "Re: [SCHED_DEADLINE man pages 0/2] Summary"
Previous message: Rafael J. Wysocki: "Re: [RFC][PATCH 2/3] PM / sleep: Mechanism to avoid resuming runtime-suspended devices unnecessarily"
In reply to: Peter Zijlstra: "Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath"
Next in thread: Peter Zijlstra: "Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, May 13, 2014 at 04:17:48PM +0200, Peter Zijlstra wrote:
> On Tue, May 13, 2014 at 01:53:13PM +0100, Mel Gorman wrote:
> > On Tue, May 13, 2014 at 10:45:50AM +0100, Mel Gorman wrote:
> > > void unlock_page(struct page *page)
> > > {
> > > + wait_queue_head_t *wqh = clear_page_waiters(page);
> > > +
> > > VM_BUG_ON_PAGE(!PageLocked(page), page);
> > > +
> > > + /*
> > > + * No additional barrier needed due to clear_bit_unlock barriering all updates
> > > + * before waking waiters
> > > + */
> > > clear_bit_unlock(PG_locked, &page->flags);
> > > - smp_mb__after_clear_bit();
> > > - wake_up_page(page, PG_locked);
> >
> > This is wrong. The smp_mb__after_clear_bit() is still required to ensure
> > that the cleared bit is visible before the wakeup on all architectures.
>
> wakeup implies a mb, and I just noticed that our Documentation is
> 'obsolete' and only mentions it implies a wmb.
>
> Also, if you're going to use smp_mb__after_atomic() you can use
> clear_bit() and not use clear_bit_unlock().
>
>
>
> ---
> Subject: doc: Update wakeup barrier documentation
>
> As per commit e0acd0a68ec7 ("sched: fix the theoretical signal_wake_up()
> vs schedule() race") both wakeup and schedule now imply a full barrier.
>
> Furthermore, the barrier is unconditional when calling try_to_wake_up()
> and has been for a fair while.
>
> Cc: Oleg Nesterov <oleg@xxxxxxxxxx>
> Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> Cc: David Howells <dhowells@xxxxxxxxxx>
> Cc: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
> Signed-off-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx>

Some questions below.

Thanx, Paul

> ---
> Documentation/memory-barriers.txt | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 46412bded104..dae5158c2382 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1881,9 +1881,9 @@ The whole sequence above is available in various canned forms, all of which
> event_indicated = 1;
> wake_up_process(event_daemon);
>
> -A write memory barrier is implied by wake_up() and co. if and only if they wake
> -something up. The barrier occurs before the task state is cleared, and so sits
> -between the STORE to indicate the event and the STORE to set TASK_RUNNING:
> +A full memory barrier is implied by wake_up() and co. The barrier occurs

Last I checked, the memory barrier was guaranteed only if a wakeup
actually occurred. If there is a sleep-wakeup race, for example,
between wait_event_interruptible() and wake_up(), then it looks to me
that the following can happen:

o Task A invokes wait_event_interruptible(), waiting for
X==1.

o Before Task A gets anywhere, Task B sets Y=1, does
smp_mb(), then sets X=1.

o Task B invokes wake_up(), which invokes __wake_up(), which
acquires the wait_queue_head_t's lock and invokes
__wake_up_common(), which sees nothing to wake up.

o Task A tests the condition, finds X==1, and returns without
locks, memory barriers, atomic instructions, or anything else
that would guarantee ordering.

o Task A then loads from Y. Because there have been no memory
barriers, it might well see Y==0.

So what am I missing here?

On the other hand, if a wake_up() really does happen, then
the fast-path out of wait_event_interruptible() is not taken,
and __wait_event_interruptible() is called instead. This calls
___wait_event(), which eventually calls prepare_to_wait_event(), which
in turn calls set_current_state(), which calls set_mb(), which does a
full memory barrier. And if that isn't good enough, there is the
call to schedule() itself. ;-)

So if a wait actually sleeps, it does imply a full memory barrier
several times over.

On the wake_up() side, wake_up() calls __wake_up(), which as mentioned
earlier calls __wake_up_common() under a lock. This invokes the
wake-up function stored by the sleeping task, for example,
autoremove_wake_function(), which calls default_wake_function(),
which invokes try_to_wake_up(), which does smp_mb__before_spinlock()
before acquiring the to-be-waked task's PI lock.

The definition of smp_mb__before_spinlock() is smp_wmb(). There is
also an smp_rmb() in try_to_wake_up(), which still does not get us
to a full memory barrier. It also calls select_task_rq(), which
does not seem to guarantee any particular memory ordering (but
I could easily have missed something). It also calls ttwu_queue(),
which invokes ttwu_do_activate() under the RQ lock. I don't see a
full memory barrier in ttwu_do_activate(), but again could easily
have missed one. Ditto for ttwu_stat().

All the locks nest, so other than the smp_wmb() and smp_rmb(), things
could bleed in.

> +before the task state is cleared, and so sits between the STORE to indicate
> +the event and the STORE to set TASK_RUNNING:

If I am in fact correct, and if we really want to advertise the read
memory barrier, I suggest the following replacement text:

A read and a write memory barrier (-not- a full memory barrier)
are implied by wake_up() and co. if and only if they wake
something up. The write barrier occurs before the task state is
cleared, and so sits between the STORE to indicate the event and
the STORE to set TASK_RUNNING, and the read barrier after that:

CPU 1 CPU 2
=============================== ===============================
set_current_state(); STORE event_indicated
set_mb(); wake_up();
STORE current->state <write barrier>
<general barrier> STORE current->state
LOAD event_indicated <read barrier>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Peter Zijlstra: "Re: [SCHED_DEADLINE man pages 0/2] Summary"
Previous message: Rafael J. Wysocki: "Re: [RFC][PATCH 2/3] PM / sleep: Mechanism to avoid resuming runtime-suspended devices unnecessarily"
In reply to: Peter Zijlstra: "Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath"
Next in thread: Peter Zijlstra: "Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]