[PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v8

From: Mel Gorman
Date: Thu May 22 2014 - 12:58:39 EST


Changelog since v7
o Further optimisation when PG_waiters is not available (peterz)
o Catch all opportunities to ClearPageWaiters (peterz)

Changelog since v6
o Optimisation when PG_waiters is not available (peterz)
o Documentation

Changelog since v5
o __always_inline where appropriate (peterz)
o Documentation (akpm)

Changelog since v4
o Remove dependency on io_schedule_timeout
o Push waiting logic down into waitqueue

This patch introduces a new page flag for 64-bit capable machines,
PG_waiters, to signal there are *potentially* processes waiting on
PG_lock or PG_writeback. If there are no possible waiters then we avoid
barriers, a waitqueue hash lookup and a failed wake_up in the unlock_page
and end_page_writeback paths. There is no guarantee that waiters exist if
PG_waiters is set as multiple pages can hash to the same waitqueue and we
cannot accurately detect if a waking process is the last waiter without
a reference count. When this happens, the bit is left set and a future
unlock or writeback completion will lookup the waitqueue and clear the
bit when there are no collisions. This adds a few branches to the fast
path but avoids bouncing a dirty cache line between CPUs. 32-bit machines
always take the slow path but the primary motivation for this patch is
large machines so I do not think that is a concern.

The test case used to evaluate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of the
file is 1/10th physical memory to avoid dirty page balancing. After each
dd there is a sync so the reported times do not vary much. By measuring
the time it takes to do async the impact of page_waitqueue overhead for
async IO is highlighted.

The test machine was single socket and UMA to avoid any scheduling or
NUMA artifacts. The performance results are reported based on a run with
no profiling. Profile data is based on a separate run with oprofile running.

async dd
3.15.0-rc5 3.15.0-rc5
mmotm lockpage-v8
btrfs Max ddtime 0.5863 ( 0.00%) 0.5593 ( 4.61%)
ext3 Max ddtime 1.4870 ( 0.00%) 1.4609 ( 1.76%)
ext4 Max ddtime 1.0440 ( 0.00%) 1.0376 ( 0.61%)
tmpfs Max ddtime 0.3541 ( 0.00%) 0.3478 ( 1.76%)
xfs Max ddtime 0.4995 ( 0.00%) 0.4762 ( 4.65%)

A separate run with profiles showed this

samples percentage
ext3 225851 2.3180 vmlinux-3.15.0-rc5-mmotm test_clear_page_writeback
ext3 106848 1.0966 vmlinux-3.15.0-rc5-mmotm __wake_up_bit
ext3 71849 0.7374 vmlinux-3.15.0-rc5-mmotm page_waitqueue
ext3 40319 0.4138 vmlinux-3.15.0-rc5-mmotm unlock_page
ext3 26243 0.2693 vmlinux-3.15.0-rc5-mmotm end_page_writeback
ext3 203718 2.1020 vmlinux-3.15.0-rc5-lockpage-v8 test_clear_page_writeback
ext3 64004 0.6604 vmlinux-3.15.0-rc5-lockpage-v8 unlock_page
ext3 24753 0.2554 vmlinux-3.15.0-rc5-lockpage-v8 end_page_writeback
ext3 8618 0.0889 vmlinux-3.15.0-rc5-lockpage-v8 __wake_up_bit
ext3 7247 0.0748 vmlinux-3.15.0-rc5-lockpage-v8 __wake_up_page_bit
ext3 2012 0.0208 vmlinux-3.15.0-rc5-lockpage-v8 page_waitqueue

The profiles show a clear reduction in waitqueue and wakeup functions. Note
that end_page_writeback costs the same as the savings there are due
to reduced calls to __wake_up_bit and page_waitqueue so there is no
obvious direct savings. The cost of unlock_page is higher as it's checking
PageWaiters but it is offset by reduced numbers of calls to page_waitqueue
and _wake_up_bit. There is a similar story told for each of the filesystems.
Note that for workloads that contend heavily on the page lock that
unlock_page may increase in cost as it has to clear PG_waiters so while
the typical case should be much faster, the worst case costs are now higher.

This is also reflected in the time taken to mmap a range of pages.
These are the results for xfs only but the other filesystems tell a
similar story.

3.15.0-rc5 3.15.0-rc5
mmotm lockpage-v8
Procs 107M 423.0000 ( 0.00%) 409.0000 ( 3.31%)
Procs 214M 847.0000 ( 0.00%) 821.0000 ( 3.07%)
Procs 322M 1296.0000 ( 0.00%) 1232.0000 ( 4.94%)
Procs 429M 1692.0000 ( 0.00%) 1646.0000 ( 2.72%)
Procs 536M 2137.0000 ( 0.00%) 2052.0000 ( 3.98%)
Procs 644M 2542.0000 ( 0.00%) 2472.0000 ( 2.75%)
Procs 751M 2953.0000 ( 0.00%) 2871.0000 ( 2.78%)
Procs 859M 3360.0000 ( 0.00%) 3290.0000 ( 2.08%)
Procs 966M 3770.0000 ( 0.00%) 3678.0000 ( 2.44%)
Procs 1073M 4220.0000 ( 0.00%) 4101.0000 ( 2.82%)
Procs 1181M 4638.0000 ( 0.00%) 4518.0000 ( 2.59%)
Procs 1288M 5038.0000 ( 0.00%) 4934.0000 ( 2.06%)
Procs 1395M 5481.0000 ( 0.00%) 5344.0000 ( 2.50%)
Procs 1503M 5940.0000 ( 0.00%) 5764.0000 ( 2.96%)
Procs 1610M 6316.0000 ( 0.00%) 6186.0000 ( 2.06%)
Procs 1717M 6749.0000 ( 0.00%) 6595.0000 ( 2.28%)
Procs 1825M 7323.0000 ( 0.00%) 7034.0000 ( 3.95%)
Procs 1932M 7694.0000 ( 0.00%) 7461.0000 ( 3.03%)
Procs 2040M 8079.0000 ( 0.00%) 7837.0000 ( 3.00%)
Procs 2147M 8495.0000 ( 0.00%) 8351.0000 ( 1.70%)

samples percentage
xfs 78334 1.3089 vmlinux-3.15.0-rc5-mmotm page_waitqueue
xfs 55910 0.9342 vmlinux-3.15.0-rc5-mmotm unlock_page
xfs 45120 0.7539 vmlinux-3.15.0-rc5-mmotm __wake_up_bit
xfs 41414 0.6920 vmlinux-3.15.0-rc5-mmotm test_clear_page_writeback
xfs 4823 0.0806 vmlinux-3.15.0-rc5-mmotm end_page_writeback
xfs 120504 2.0046 vmlinux-3.15.0-rc5-lockpage-v8 unlock_page
xfs 49179 0.8181 vmlinux-3.15.0-rc5-lockpage-v8 test_clear_page_writeback
xfs 5397 0.0898 vmlinux-3.15.0-rc5-lockpage-v8 end_page_writeback
xfs 2101 0.0350 vmlinux-3.15.0-rc5-lockpage-v8 __wake_up_bit
xfs 5 8.3e-05 vmlinux-3.15.0-rc5-lockpage-v8 page_waitqueue
xfs 4 6.7e-05 vmlinux-3.15.0-rc5-lockpage-v8 __wake_up_page_bit

[jack@xxxxxxx: Fix add_page_wait_queue]
[mhocko@xxxxxxx: Use sleep_on_page_killable in __wait_on_page_locked_killable]
[steiner@xxxxxxx: Do not update struct page unnecessarily]
[peterz@xxxxxxxxxxxxx: consolidate within wait.c, catch all ClearPageWaiters]
Signed-off-by: Nick Piggin <npiggin@xxxxxxx>
Signed-off-by: Mel Gorman <mgorman@xxxxxxx>
---
include/linux/page-flags.h | 18 +++++
include/linux/wait.h | 8 +++
kernel/sched/wait.c | 161 ++++++++++++++++++++++++++++++++++++---------
mm/filemap.c | 25 +++----
mm/page_alloc.c | 1 +
mm/swap.c | 12 ++++
mm/vmscan.c | 7 ++
7 files changed, 189 insertions(+), 43 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 7baf0fe..b697e4f 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -87,6 +87,7 @@ enum pageflags {
PG_private_2, /* If pagecache, has fs aux data */
PG_writeback, /* Page is under writeback */
#ifdef CONFIG_PAGEFLAGS_EXTENDED
+ PG_waiters, /* Page has PG_locked waiters. */
PG_head, /* A head page */
PG_tail, /* A tail page */
#else
@@ -213,6 +214,22 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)

__PAGEFLAG(SlobFree, slob_free)

+#ifdef CONFIG_PAGEFLAGS_EXTENDED
+PAGEFLAG(Waiters, waiters) __CLEARPAGEFLAG(Waiters, waiters)
+ TESTCLEARFLAG(Waiters, waiters)
+#define __PG_WAITERS (1 << PG_waiters)
+#else
+/* Always fallback to slow path on 32-bit */
+static inline bool PageWaiters(struct page *page)
+{
+ return true;
+}
+static inline void __ClearPageWaiters(struct page *page) {}
+static inline void ClearPageWaiters(struct page *page) {}
+static inline void SetPageWaiters(struct page *page) {}
+#define __PG_WAITERS 0
+#endif /* CONFIG_PAGEFLAGS_EXTENDED */
+
/*
* Private page markings that may be used by the filesystem that owns the page
* for its own purposes.
@@ -509,6 +526,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
1 << PG_writeback | 1 << PG_reserved | \
1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \
1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
+ __PG_WAITERS | \
__PG_COMPOUND_LOCK)

/*
diff --git a/include/linux/wait.h b/include/linux/wait.h
index bd68819..9226724 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -141,14 +141,21 @@ __remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old)
list_del(&old->task_list);
}

+struct page;
+
void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
void __wake_up_bit(wait_queue_head_t *, void *, int);
+void __wake_up_page_bit(wait_queue_head_t *, struct page *page, void *, int);
int __wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned);
+int __wait_on_page_bit(wait_queue_head_t *, struct wait_bit_queue *,
+ struct page *page, int (*)(void *), unsigned);
int __wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned);
+int __wait_on_page_bit_lock(wait_queue_head_t *, struct wait_bit_queue *,
+ struct page *page, int (*)(void *), unsigned);
void wake_up_bit(void *, int);
void wake_up_atomic_t(atomic_t *);
int out_of_line_wait_on_bit(void *, int, int (*)(void *), unsigned);
@@ -822,6 +829,7 @@ void prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state);
void prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state);
long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state);
void finish_wait(wait_queue_head_t *q, wait_queue_t *wait);
+void finish_wait_page(wait_queue_head_t *q, wait_queue_t *wait, struct page *page);
void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait, unsigned int mode, void *key);
int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 0ffa20a..43e7df0 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -167,31 +167,47 @@ EXPORT_SYMBOL_GPL(__wake_up_sync); /* For internal use only */
* stops them from bleeding out - it would still allow subsequent
* loads to move into the critical region).
*/
-void
-prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
+static __always_inline void
+__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
+ struct page *page, int state, bool exclusive)
{
unsigned long flags;

- wait->flags &= ~WQ_FLAG_EXCLUSIVE;
spin_lock_irqsave(&q->lock, flags);
- if (list_empty(&wait->task_list))
- __add_wait_queue(q, wait);
+
+ /*
+ * pages are hashed on a waitqueue that is expensive to lookup.
+ * __wait_on_page_bit and __wait_on_page_bit_lock pass in a page
+ * to set PG_waiters here. A PageWaiters() can then be used at
+ * unlock time or when writeback completes to detect if there
+ * are any potential waiters that justify a lookup.
+ */
+ if (page && !PageWaiters(page))
+ SetPageWaiters(page);
+ if (list_empty(&wait->task_list)) {
+ if (exclusive) {
+ wait->flags |= WQ_FLAG_EXCLUSIVE;
+ __add_wait_queue_tail(q, wait);
+ } else {
+ wait->flags &= ~WQ_FLAG_EXCLUSIVE;
+ __add_wait_queue(q, wait);
+ }
+ }
set_current_state(state);
spin_unlock_irqrestore(&q->lock, flags);
}
+
+void
+prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
+{
+ return __prepare_to_wait(q, wait, NULL, state, false);
+}
EXPORT_SYMBOL(prepare_to_wait);

void
prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state)
{
- unsigned long flags;
-
- wait->flags |= WQ_FLAG_EXCLUSIVE;
- spin_lock_irqsave(&q->lock, flags);
- if (list_empty(&wait->task_list))
- __add_wait_queue_tail(q, wait);
- set_current_state(state);
- spin_unlock_irqrestore(&q->lock, flags);
+ return __prepare_to_wait(q, wait, NULL, state, true);
}
EXPORT_SYMBOL(prepare_to_wait_exclusive);

@@ -219,16 +235,8 @@ long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state)
}
EXPORT_SYMBOL(prepare_to_wait_event);

-/**
- * finish_wait - clean up after waiting in a queue
- * @q: waitqueue waited on
- * @wait: wait descriptor
- *
- * Sets current thread back to running state and removes
- * the wait descriptor from the given waitqueue if still
- * queued.
- */
-void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
+static __always_inline void __finish_wait(wait_queue_head_t *q,
+ wait_queue_t *wait, struct page *page)
{
unsigned long flags;

@@ -249,9 +257,33 @@ void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
if (!list_empty_careful(&wait->task_list)) {
spin_lock_irqsave(&q->lock, flags);
list_del_init(&wait->task_list);
+
+ /*
+ * Clear PG_waiters if the waitqueue is no longer active. There
+ * is no guarantee that a page with no waiters will get cleared
+ * as there may be unrelated pages hashed to sleep on the same
+ * queue. Accurate detection would require a counter but
+ * collisions are expected to be rare.
+ */
+ if (page && !waitqueue_active(q))
+ ClearPageWaiters(page);
spin_unlock_irqrestore(&q->lock, flags);
}
}
+
+/**
+ * finish_wait - clean up after waiting in a queue
+ * @q: waitqueue waited on
+ * @wait: wait descriptor
+ *
+ * Sets current thread back to running state and removes
+ * the wait descriptor from the given waitqueue if still
+ * queued.
+ */
+void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
+{
+ return __finish_wait(q, wait, NULL);
+}
EXPORT_SYMBOL(finish_wait);

/**
@@ -313,24 +345,39 @@ int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *arg)
EXPORT_SYMBOL(wake_bit_function);

/*
- * To allow interruptible waiting and asynchronous (i.e. nonblocking)
- * waiting, the actions of __wait_on_bit() and __wait_on_bit_lock() are
- * permitted return codes. Nonzero return codes halt waiting and return.
+ * waits on a bit to be cleared (see wait_on_bit in wait.h for details.
+ * A page is optionally provided when used to wait on the PG_locked or
+ * PG_writeback bit. By setting PG_waiters a lookup of the waitqueue
+ * can be avoided during unlock_page or end_page_writeback.
*/
int __sched
-__wait_on_bit(wait_queue_head_t *wq, struct wait_bit_queue *q,
+__wait_on_page_bit(wait_queue_head_t *wq, struct wait_bit_queue *q,
+ struct page *page,
int (*action)(void *), unsigned mode)
{
int ret = 0;

do {
- prepare_to_wait(wq, &q->wait, mode);
+ __prepare_to_wait(wq, &q->wait, page, mode, false);
if (test_bit(q->key.bit_nr, q->key.flags))
ret = (*action)(q->key.flags);
} while (test_bit(q->key.bit_nr, q->key.flags) && !ret);
- finish_wait(wq, &q->wait);
+ __finish_wait(wq, &q->wait, page);
return ret;
}
+
+/*
+ * To allow interruptible waiting and asynchronous (i.e. nonblocking)
+ * waiting, the actions of __wait_on_bit() and __wait_on_bit_lock() are
+ * permitted return codes. Nonzero return codes halt waiting and return.
+ */
+int __sched
+__wait_on_bit(wait_queue_head_t *wq, struct wait_bit_queue *q,
+ int (*action)(void *), unsigned mode)
+{
+ return __wait_on_page_bit(wq, q, NULL, action, mode);
+}
+
EXPORT_SYMBOL(__wait_on_bit);

int __sched out_of_line_wait_on_bit(void *word, int bit,
@@ -344,13 +391,14 @@ int __sched out_of_line_wait_on_bit(void *word, int bit,
EXPORT_SYMBOL(out_of_line_wait_on_bit);

int __sched
-__wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
+__wait_on_page_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
+ struct page *page,
int (*action)(void *), unsigned mode)
{
do {
int ret;

- prepare_to_wait_exclusive(wq, &q->wait, mode);
+ __prepare_to_wait(wq, &q->wait, page, mode, true);
if (!test_bit(q->key.bit_nr, q->key.flags))
continue;
ret = action(q->key.flags);
@@ -359,9 +407,16 @@ __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
abort_exclusive_wait(wq, &q->wait, mode, &q->key);
return ret;
} while (test_and_set_bit(q->key.bit_nr, q->key.flags));
- finish_wait(wq, &q->wait);
+ __finish_wait(wq, &q->wait, page);
return 0;
}
+
+int __sched
+__wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
+ int (*action)(void *), unsigned mode)
+{
+ return __wait_on_page_bit_lock(wq, q, NULL, action, mode);
+}
EXPORT_SYMBOL(__wait_on_bit_lock);

int __sched out_of_line_wait_on_bit_lock(void *word, int bit,
@@ -380,6 +435,48 @@ void __wake_up_bit(wait_queue_head_t *wq, void *word, int bit)
if (waitqueue_active(wq))
__wake_up(wq, TASK_NORMAL, 1, &key);
}
+
+void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void *word, int bit)
+{
+ struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
+ unsigned long flags;
+
+ /*
+ * If there is no PG_waiters bit (32-bit), then waitqueue_active can be
+ * checked without wqh->lock as there is no PG_waiters race to protect.
+ */
+ if (!__PG_WAITERS) {
+ if (waitqueue_active(wqh))
+ __wake_up(wqh, TASK_NORMAL, 1, &key);
+ return;
+ }
+
+ /*
+ * Unlike __wake_up_bit it is necessary to check waitqueue_active
+ * under the wqh->lock to avoid races with parallel additions that
+ * could result in lost wakeups.
+ */
+ spin_lock_irqsave(&wqh->lock, flags);
+ if (waitqueue_active(wqh)) {
+ /*
+ * Try waking a task on the queue. Responsibility for clearing
+ * the PG_waiters bit is left to the last waiter on the
+ * waitqueue as PageWaiters is called outside wqh->lock and
+ * we cannot miss wakeups. Due to hashqueue collisions, there
+ * may be colliding pages that still have PG_waiters set but
+ * the impact means there will be at least one unnecessary
+ * lookup of the page waitqueue on the next unlock_page or
+ * end of writeback.
+ */
+ __wake_up_common(wqh, TASK_NORMAL, 1, 0, &key);
+ } else {
+ /* No potential waiters, safe to clear PG_waiters */
+ ClearPageWaiters(page);
+ }
+ spin_unlock_irqrestore(&wqh->lock, flags);
+}
+
+
EXPORT_SYMBOL(__wake_up_bit);

/**
diff --git a/mm/filemap.c b/mm/filemap.c
index 263cffe..07633a4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -682,9 +682,9 @@ static wait_queue_head_t *page_waitqueue(struct page *page)
return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
}

-static inline void wake_up_page(struct page *page, int bit)
+static inline void wake_up_page(struct page *page, int bit_nr)
{
- __wake_up_bit(page_waitqueue(page), &page->flags, bit);
+ __wake_up_page_bit(page_waitqueue(page), page, &page->flags, bit_nr);
}

void wait_on_page_bit(struct page *page, int bit_nr)
@@ -692,8 +692,8 @@ void wait_on_page_bit(struct page *page, int bit_nr)
DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);

if (test_bit(bit_nr, &page->flags))
- __wait_on_bit(page_waitqueue(page), &wait, sleep_on_page,
- TASK_UNINTERRUPTIBLE);
+ __wait_on_page_bit(page_waitqueue(page), &wait, page,
+ sleep_on_page, TASK_UNINTERRUPTIBLE);
}
EXPORT_SYMBOL(wait_on_page_bit);

@@ -704,7 +704,7 @@ int wait_on_page_bit_killable(struct page *page, int bit_nr)
if (!test_bit(bit_nr, &page->flags))
return 0;

- return __wait_on_bit(page_waitqueue(page), &wait,
+ return __wait_on_page_bit(page_waitqueue(page), &wait, page,
sleep_on_page_killable, TASK_KILLABLE);
}

@@ -743,7 +743,8 @@ void unlock_page(struct page *page)
VM_BUG_ON_PAGE(!PageLocked(page), page);
clear_bit_unlock(PG_locked, &page->flags);
smp_mb__after_atomic();
- wake_up_page(page, PG_locked);
+ if (unlikely(PageWaiters(page)))
+ wake_up_page(page, PG_locked);
}
EXPORT_SYMBOL(unlock_page);

@@ -769,7 +770,8 @@ void end_page_writeback(struct page *page)
BUG();

smp_mb__after_atomic();
- wake_up_page(page, PG_writeback);
+ if (unlikely(PageWaiters(page)))
+ wake_up_page(page, PG_writeback);
}
EXPORT_SYMBOL(end_page_writeback);

@@ -806,8 +808,8 @@ void __lock_page(struct page *page)
{
DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);

- __wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
- TASK_UNINTERRUPTIBLE);
+ __wait_on_page_bit_lock(page_waitqueue(page), &wait, page,
+ sleep_on_page, TASK_UNINTERRUPTIBLE);
}
EXPORT_SYMBOL(__lock_page);

@@ -815,9 +817,10 @@ int __lock_page_killable(struct page *page)
{
DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);

- return __wait_on_bit_lock(page_waitqueue(page), &wait,
- sleep_on_page_killable, TASK_KILLABLE);
+ return __wait_on_page_bit_lock(page_waitqueue(page), &wait, page,
+ sleep_on_page, TASK_KILLABLE);
}
+
EXPORT_SYMBOL_GPL(__lock_page_killable);

int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cd1f005..ebb947d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6603,6 +6603,7 @@ static const struct trace_print_flags pageflag_names[] = {
{1UL << PG_private_2, "private_2" },
{1UL << PG_writeback, "writeback" },
#ifdef CONFIG_PAGEFLAGS_EXTENDED
+ {1UL << PG_waiters, "waiters" },
{1UL << PG_head, "head" },
{1UL << PG_tail, "tail" },
#else
diff --git a/mm/swap.c b/mm/swap.c
index 9e8e347..1581dbf 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -67,6 +67,10 @@ static void __page_cache_release(struct page *page)
static void __put_single_page(struct page *page)
{
__page_cache_release(page);
+
+ /* See release_pages on why this clear may be necessary */
+ __ClearPageWaiters(page);
+
free_hot_cold_page(page, false);
}

@@ -916,6 +920,14 @@ void release_pages(struct page **pages, int nr, bool cold)
/* Clear Active bit in case of parallel mark_page_accessed */
__ClearPageActive(page);

+ /*
+ * pages are hashed on a waitqueue so there may be collisions.
+ * When waiters are woken the waitqueue is checked but
+ * unrelated pages on the queue can leave the bit set. Clear
+ * it here if that happens.
+ */
+ __ClearPageWaiters(page);
+
list_add(&page->lru, &pages_to_free);
}
if (zone)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7f85041..d7a4969 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1096,6 +1096,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* waiting on the page lock, because there are no references.
*/
__clear_page_locked(page);
+
+ /* See release_pages on why this clear may be necessary */
+ __ClearPageWaiters(page);
free_it:
nr_reclaimed++;

@@ -1427,6 +1430,8 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
if (put_page_testzero(page)) {
__ClearPageLRU(page);
__ClearPageActive(page);
+ /* See release_pages on why this clear may be necessary */
+ __ClearPageWaiters(page);
del_page_from_lru_list(page, lruvec, lru);

if (unlikely(PageCompound(page))) {
@@ -1650,6 +1655,8 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
if (put_page_testzero(page)) {
__ClearPageLRU(page);
__ClearPageActive(page);
+ /* See release_pages on why this clear may be necessary */
+ __ClearPageWaiters(page);
del_page_from_lru_list(page, lruvec, lru);

if (unlikely(PageCompound(page))) {
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/