Re: [PATCH 05/18] mm: memcontrol: convert page cache to a new mem_cgroup_charge() API

From: Johannes Weiner
Date: Mon May 11 2020 - 14:11:19 EST


On Mon, May 11, 2020 at 09:32:16AM -0700, Hugh Dickins wrote:
> On Mon, 11 May 2020, Johannes Weiner wrote:
> > On Mon, May 11, 2020 at 12:38:04AM -0700, Hugh Dickins wrote:
> > > On Fri, 8 May 2020, Johannes Weiner wrote:
> > > >
> > > > I looked at this some more, as well as compared it to non-shmem
> > > > swapping. My conclusion is - and Hugh may correct me on this - that
> > > > the deletion looks mandatory but is actually an optimization. Page
> > > > reclaim will ultimately pick these pages up.
> > > >
> > > > When non-shmem pages are swapped in by readahead (locked until IO
> > > > completes) and their page tables are simultaneously unmapped, the
> > > > zap_pte_range() code calls free_swap_and_cache() and the locked pages
> > > > are stranded in the swap cache with no page table references. We rely
> > > > on page reclaim to pick them up later on.
> > > >
> > > > The same appears to be true for shmem. If the references to the swap
> > > > page are zapped while we're trying to swap in, we can strand the page
> > > > in the swap cache. But it's not up to swapin to detect this reliably,
> > > > it just frees the page more quickly than having to wait for reclaim.
> > >
> > > I think you've got all that exactly right, thanks for working it out.
> > > It originates from v3.7's 215c02bc33bb ("tmpfs: fix shmem_getpage_gfp()
> > > VM_BUG_ON") - in which I also had to thank you.
> >
> > I should have looked where it actually came from - I had forgotten
> > about that patch!
> >
> > > I think I chose to do the delete_from_swap_cache() right there, partly
> > > because of following shmem_unuse_inode() code which already did that,
> > > partly on the basis that while we have to observe the case then it's
> > > better to clean it up, and partly out of guilt that our page lock here
> > > is what had prevented shmem_undo_range() from completing its job; but
> > > I believe you're right that unused swapcache reclaim would sort it out
> > > eventually.
> >
> > That makes sense to me.
> >
> > > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > > index e80167927dce..236642775f89 100644
> > > > --- a/mm/shmem.c
> > > > +++ b/mm/shmem.c
> > > > @@ -640,7 +640,7 @@ static int shmem_add_to_page_cache(struct page *page,
> > > > xas_lock_irq(&xas);
> > > > entry = xas_find_conflict(&xas);
> > > > if (entry != expected)
> > > > - xas_set_err(&xas, -EEXIST);
> > > > + xas_set_err(&xas, expected ? -ENOENT : -EEXIST);
> > >
> > > Two things on this.
> > >
> > > Minor matter of taste, I'd prefer that as
> > > xas_set_err(&xas, entry ? -EEXIST : -ENOENT);
> > > which would be more general and more understandable -
> > > but what you have written should be fine for the actual callers.
> >
> > Yes, checking `expected' was to differentiate the behavior depending
> > on the callsite. But testing `entry' is more obvious in that location.
> >
> > > Except... I think returning -ENOENT there will not work correctly,
> > > in the case of a punched hole. Because (unless you've reworked it
> > > and I just haven't looked) shmem_getpage_gfp() knows to retry in
> > > the case of -EEXIST, but -ENOENT will percolate up to shmem_fault()
> > > and result in a SIGBUS, or a read/write error, when the hole should
> > > just get refilled instead.
> >
> > Good catch, I had indeed missed that. I'm going to make it retry on
> > -ENOENT as well.
> >
> > We could have it go directly to allocating a new page, but it seems
> > unnecessarily complicated: we've already been retrying in this
> > situation until now, so I would stick to "there was a race, retry."
> >
> > > Not something that needs fixing in a hurry (it took trinity to
> > > generate this racy case in the first place), I'll take another look
> > > once I've pulled it into a tree (or collected next mmotm) - unless
> > > you've already have changed it around by then.
> >
> > Attaching a delta fix based on your observations.
> >
> > Andrew, barring any objections to this, could you please fold it into
> > the version you have in your tree already?
>
> Not so strong as an objection, and I won't get to see whether your
> retry on -ENOENT is good (can -ENOENT arrive at that point from any
> other case, that might endlessly retry?) until I've got the full
> context; but I had arrived at the opposite conclusion overnight.
>
> Given that this case only appeared with a fuzzer, and stale swapcache
> reclaim is anyway relied upon to clean up after plenty of other such
> races, I think we should agree that I over-complicated the VM_BUG_ON
> removal originally, and it's best to kill that delete_from_swap_cache(),
> and the comment having to explain it, and your EEXIST/ENOENT distinction.
>
> (I haven't checked, but I suspect that the shmem_unuse_inode() case
> that I copied from, actually really needed to delete_from_swap_cache(),
> in order to swapoff the page without full retry of the big swapoff loop.)

Since commit b56a2d8af914 ("mm: rid swapoff of quadratic complexity"),
shmem_unuse_inode() doesn't have its own copy anymore - it uses
shmem_swapin_page().

However, that commit appears to have made shmem's private call to
delete_from_swap_cache() obsolete as well. Whereas before this change
we fully relied on shmem_unuse() to find and clear a shmem swap entry
and its swapcache page, we now only need it to clean out shmem's
private state in the inode, as it's followed by a loop over all
remaining swap slots, calling try_to_free_swap() on stragglers.

Unless I missed something, it's still merely an optimization, and we
can delete it for simplicity:

---