Re: Possible memory ordering bug in page reclaim?

From: Nick Piggin
Date: Sat Oct 15 2005 - 08:45:56 EST

Herbert Xu wrote:
Nick Piggin <nickpiggin@xxxxxxxxxxxx> wrote:

Well yes, that's on the store side (1, above). However can't a CPU
still speculatively (eg. guess the branch) load the page->flags
cacheline which might be satisfied from memory before the page->count
cacheline loads? Ie. you can still have the correct write ordering
but have incorrect read ordering?

Because neither PageDirty nor page_count is a barrier, and there is
no read barrier between them.

Yes you're right. A read barrier is required here.

I think Ben was actually agreeing with you. He's just questioning
whether the corresponding write barrier existed on CPU 1 (the answer
to which is affirmative).

Ah, that clears up my misunderstanding.

Yes I agree the write side is OK.

Thanks Ben and Herbert. I guess I should do a proper patch then.

SUSE Labs, Novell Inc.

In mm/vmscan.c, the page reclaim may have the following sequence 2
running concurrently with sequence 1 on another CPU:

1 2
write to page write_lock(tree_lock);
SetPageDirty(); if (page_count != 2
put_page(); || PageDirty())
/* page dirty or busy */
/* free it */

The comment indicates that PageDirty must be checked *after* page_count
indicates there are no users of this page, which prevents the dirty bit
from being lost in the case that that sequence 2 might see the state of
PageDirty() *before* SetPageDirty() in 1, but page_count *after* put_page
in 1.

However, there is no read memory barrier there, and so nothing to stop a
CPU from loading page_count before PageDirty (ie. ->flags). Theoretically,
data corruption is possible.

Signed-off-by: Nick Piggin <npiggin@xxxxxxx>

Index: linux-2.6/mm/vmscan.c
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -511,7 +511,12 @@ static int shrink_list(struct list_head
* PageDirty _after_ making sure that the page is freeable and
* not in use by anybody. (pagecache + us == 2)
- if (page_count(page) != 2 || PageDirty(page)) {
+ if (page_count(page) != 2) {
+ write_unlock_irq(&mapping->tree_lock);
+ goto keep_locked;
+ }
+ smp_rmb();
+ if (PageDirty(page)) {
goto keep_locked;