Re: [PATCH 1/10] shmem: replace page if mapping excludes its zone

From: Hugh Dickins
Date: Fri May 18 2012 - 21:42:11 EST


On Mon, 14 May 2012, Hugh Dickins wrote:
> On Mon, 14 May 2012, Andrew Morton wrote:
> > On Sat, 12 May 2012 04:59:56 -0700 (PDT)
> > Hugh Dickins <hughd@xxxxxxxxxx> wrote:
> > >
> > > We'd like to continue to support GMA500, so now add a new
> > > shmem_should_replace_page() check on the zone when about to move
> > > a page from swapcache to filecache (in swapin and swapoff cases),
> > > with shmem_replace_page() to allocate and substitute a suitable page
> > > (given gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32).
> > >
> ...
> > > + gfp = mapping_gfp_mask(mapping);
> > > + if (shmem_should_replace_page(*pagep, gfp)) {
> > > + mutex_unlock(&shmem_swaplist_mutex);
> > > + error = shmem_replace_page(pagep, gfp, info, index);
> > > + mutex_lock(&shmem_swaplist_mutex);
> > > + /*
> > > + * We needed to drop mutex to make that restrictive page
> > > + * allocation; but the inode might already be freed by now,
> > > + * and we cannot refer to inode or mapping or info to check.
> > > + * However, we do hold page lock on the PageSwapCache page,
> > > + * so can check if that still has our reference remaining.
> > > + */
> > > + if (!page_swapcount(*pagep))
> > > + error = -ENOENT;
> >
> > This has my head spinning a bit. What is "our reference"? I'd expect
> > that to mean a temporary reference which was taken by this thread of
> > control.
>
> (I'm sure you'll prefer a reworking of that comment in an incremental
> fixes patch, but let me try to explain better here too.)
>
> No, I didn't mean a temporary reference taken by this (swapoff) thread,
> but the reference (swap entry) which has just been located in the inode's
> radix_tree, just before this hunk: which would be tracked by page_swapcount
> 1 (there's also a page swapcache bit in the swap_map along with the count,
> corresponding to the reference from the swapcache page itself, but that's
> not included in page_swapcount).
>
> > But such a thing has no relevance when trying to determine
> > the state of the page and/or data structures which refer to it.
>
> I don't understand you there, but maybe it won't matter.
>
> >
> > Also, what are we trying to determine here with this test? Whether the
> > page was removed from swapcache under our feet? Presumably not, as it
> > is locked.
> >
> > So perhaps you could spell out in more detail what we're trying to do
> > here, and what contributes to page_swapcount() here?
>
> The danger here is that the inode we're dealing with has gone through
> shmem_evict_inode() while we dropped shmem_swaplist_mutex: inode was
> certainly in use before, and shmem_swaplist_mutex (together with inode
> being on shmem_swaplist) holds it up from being evicted and freed; but
> once we drop the mutex, it could go away at any moment. We cannot
> determine that by looking at struct inode or struct address_space or
> struct shmem_inode_info, they're all part of what would be freed;
> but we cannot proceed to shmem_add_to_page_cache() once they're freed.
> How to tell whether it's been freed?
>
> Once upon a time I "solved" it with igrab() and iput(), but Konstantin
> demonstrated how that gives no safety against unmounting, and I remain
> reluctant to go back to relying upon filesystem semantics to solve this.
>
> It occurred to me that the inode cannot be freed until that radix_tree
> entry has been removed (by shmem_evict_inode's shmem_truncate_range),
> and the act or removing that entry (free_swap_and_cache) brings
> page_swapcount down from 1 to 0.
>
> You're thinking that the page cannot be removed from swapcache while
> we hold page lock: correct, but... free_swap_and_cache() only does a
> trylock_page(), and happily leaves the swapcache page to be garbage
> collected later if it cannot get the page lock. (And I certainly
> would not want to change it to wait for page lock.) So, the inode
> can get evicted while the page is still in swapcache: the page lock
> gives no protection against that, until the page itself gets into
> the radix_tree.
>
> I doubt that writing this essay into a comment there will be the
> right thing to do (and I may still be losing you); but I shall try
> to rewrite it, and if there's one missing fact that needs highlighting,
> it probably is that last, that free_swap_and_cache() only does a trylock,
> so our page lock does not protect the inode from eviction.
>
> (At this moment, I can't think what is the relevance of my comment
> "we do hold page lock on the PageSwapCache page": in other contexts it
> would be important, but here in swapoff we know that that swap cannot
> get reused, or not before we're done.)
>
> > > @@ -660,7 +679,14 @@ int shmem_unuse(swp_entry_t swap, struct
> > > struct list_head *this, *next;
> > > struct shmem_inode_info *info;
> > > int found = 0;
> > > - int error;
> > > + int error = 0;
> > > +
> > > + /*
> > > + * There's a faint possibility that swap page was replaced before
> > > + * caller locked it: it will come back later with the right page.
> >
> > So a caller locked the page then failed to check that it's still the
> > right sort of page? Shouldn't the caller locally clean up its own mess
> > rather than requiring a callee to know about the caller's intricate
> > shortcomings?
>
> The caller being try_to_unuse(). You're certainly not the first to argue
> that way. Perhaps I'm a bit perverse, in letting code which works even
> in the surprising cases, remain as it is without weeding out those
> surprising cases. And on this occasion didn't want to add an additional
> dependence on a slight subtle change in mm/swapfile.c functionality.
>
> Hmm, yes, I do still prefer to have the check here in shmem.c:
> particularly since it is this "shmem_replace_page" business which is
> increasing the likelihood of such a race, and making further demands
> on it (if we're going to make the copied page PageSwapCache, then we
> need to be sure that the page it's replacing was PageSwapCache - though
> that's something I need to think through again in the light of the race
> which I thought of in responding to Cong).
>
> > > + newpage = shmem_alloc_page(gfp, info, index);
> > > + if (!newpage)
> > > + return -ENOMEM;
> > > + VM_BUG_ON(shmem_should_replace_page(newpage, gfp));
> > > +
> > > + *pagep = newpage;
> > > + page_cache_get(newpage);
> > > + copy_highpage(newpage, oldpage);
> >
> > copy_highpage() doesn't do flush_dcache_page() - did we need copy_user_highpage()?
>
> Ooh, I'm pretty sure you're right that we do need flush_dcache_page()
> there: good catch, thank you. We can't use copy_user_highpage() because
> in general we don't know any address and vma; but should be following the
> shmem_getpage_gfp() pattern of clear_highpage+flush_dcache_page+SetUptodate.
>
> >
> > shmem_replace_page() is a fairly generic and unexceptional sounding
> > thing. Methinks shmem_substitute_page() would be a better name.
>
> Okay, shmem_replace_page() seemed appropriate to me (especially thinking
> of it as "re-place"), but I don't mind changing to shmem_substitute_page().
>
> The flush_dcache_page() addition is important, but until people are
> using GMA500 on ARM or something (I doubt that combination) with more
> than 4GB, this code is not coming into play - so I'm not breaking anyone's
> system if it sneaks into linux-next before I fix that.
>
> The main thing I need to think through quietly is the slippery swap race:
> I'll send you an incremental patch to fix all these up once I'm satisfied
> on that.

I promised you an incremental, but that's not really possible because of
the name changes from "replace" to "substitute". So I'll be sending you
a v2 patch in a moment, to replace (or substitute for) the original 1/10.

It responds to feedback comment:

1. "substitute" instead of "replace" [akpm]
2. more explanation of page_swapcount test [akpm]
3. flush_dcache_page after copy_highpage [akpm]
4. removal of excessive VM_BUG_ONs [wangcong]
5. check page_private before and error path within substitute_page [hughd]

See below for a diff from v1 for review, omitting replace->substitute mods.

Please don't be disappointed if I send you a further patch to
shmem_substitute_page() in the weeks ahead: although the page_private
checks I've added in this one make it very very very unlikely, and its
consequence very probably benign, there is still a surprising (never
observed) race by which shmem_getpage_gfp() could get hold of someone
else's swap.

It's correctly resolved by shmem_add_to_page_cache(), but by that time
we have already done a mem_cgroup charge, and now also this substitution.
It would be better to rearrange a little here, to eliminate all chance of
that surprise: I hoped to complete that earlier, but now think I'd better
get the safer intermediate version to you first.

Thanks,
Hugh

---
mm/shmem.c | 57 +++++++++++++++++++++++++++++++++------------------
1 file changed, 37 insertions(+), 20 deletions(-)

--- 3045N.orig/mm/shmem.c 2012-05-17 16:28:43.278076430 -0700
+++ 3045N/mm/shmem.c 2012-05-18 16:28:33.642198028 -0700
@@ -636,10 +636,21 @@ static int shmem_unuse_inode(struct shme
mutex_lock(&shmem_swaplist_mutex);
/*
* We needed to drop mutex to make that restrictive page
- * allocation; but the inode might already be freed by now,
- * and we cannot refer to inode or mapping or info to check.
- * However, we do hold page lock on the PageSwapCache page,
- * so can check if that still has our reference remaining.
+ * allocation, but the inode might have been freed while we
+ * dropped it: although a racing shmem_evict_inode() cannot
+ * complete without emptying the radix_tree, our page lock
+ * on this swapcache page is not enough to prevent that -
+ * free_swap_and_cache() of our swap entry will only
+ * trylock_page(), removing swap from radix_tree whatever.
+ *
+ * We must not proceed to shmem_add_to_page_cache() if the
+ * inode has been freed, but of course we cannot rely on
+ * inode or mapping or info to check that. However, we can
+ * safely check if our swap entry is still in use (and here
+ * it can't have got reused for another page): if it's still
+ * in use, then the inode cannot have been freed yet, and we
+ * can safely proceed (if it's no longer in use, that tells
+ * nothing about the inode, but we don't need to unuse swap).
*/
if (!page_swapcount(*pagep))
error = -ENOENT;
@@ -683,9 +694,9 @@ int shmem_unuse(swp_entry_t swap, struct

/*
* There's a faint possibility that swap page was substituted before
- * caller locked it: it will come back later with the right page.
+ * caller locked it: caller will come back later with the right page.
*/
- if (unlikely(!PageSwapCache(page)))
+ if (unlikely(!PageSwapCache(page) || page_private(page) != swap.val))
goto out;

/*
@@ -916,21 +927,15 @@ static int shmem_substitute_page(struct
newpage = shmem_alloc_page(gfp, info, index);
if (!newpage)
return -ENOMEM;
- VM_BUG_ON(shmem_should_substitute_page(newpage, gfp));

- *pagep = newpage;
page_cache_get(newpage);
copy_highpage(newpage, oldpage);
+ flush_dcache_page(newpage);

- VM_BUG_ON(!PageLocked(oldpage));
__set_page_locked(newpage);
- VM_BUG_ON(!PageUptodate(oldpage));
SetPageUptodate(newpage);
- VM_BUG_ON(!PageSwapBacked(oldpage));
SetPageSwapBacked(newpage);
- VM_BUG_ON(!swap_index);
set_page_private(newpage, swap_index);
- VM_BUG_ON(!PageSwapCache(oldpage));
SetPageSwapCache(newpage);

/*
@@ -940,13 +945,24 @@ static int shmem_substitute_page(struct
spin_lock_irq(&swap_mapping->tree_lock);
error = shmem_radix_tree_replace(swap_mapping, swap_index, oldpage,
newpage);
- __inc_zone_page_state(newpage, NR_FILE_PAGES);
- __dec_zone_page_state(oldpage, NR_FILE_PAGES);
+ if (!error) {
+ __inc_zone_page_state(newpage, NR_FILE_PAGES);
+ __dec_zone_page_state(oldpage, NR_FILE_PAGES);
+ }
spin_unlock_irq(&swap_mapping->tree_lock);
- BUG_ON(error);

- mem_cgroup_replace_page_cache(oldpage, newpage);
- lru_cache_add_anon(newpage);
+ if (unlikely(error)) {
+ /*
+ * Is this possible? I think not, now that our callers check
+ * both PageSwapCache and page_private after getting page lock;
+ * but be defensive. Reverse old to newpage for clear and free.
+ */
+ oldpage = newpage;
+ } else {
+ mem_cgroup_replace_page_cache(oldpage, newpage);
+ lru_cache_add_anon(newpage);
+ *pagep = newpage;
+ }

ClearPageSwapCache(oldpage);
set_page_private(oldpage, 0);
@@ -954,7 +970,7 @@ static int shmem_substitute_page(struct
unlock_page(oldpage);
page_cache_release(oldpage);
page_cache_release(oldpage);
- return 0;
+ return error;
}

/*
@@ -1025,7 +1041,8 @@ repeat:

/* We have to do this with page locked to prevent races */
lock_page(page);
- if (!PageSwapCache(page) || page->mapping) {
+ if (!PageSwapCache(page) || page_private(page) != swap.val ||
+ page->mapping) {
error = -EEXIST; /* try again */
goto failed;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/