Re: [PATCH mmotm] mm/swap: fix livelock in __read_swap_cache_async()
From: Rafael Aquini
Date: Fri May 22 2020 - 13:09:00 EST
On Thu, May 21, 2020 at 10:56:20PM -0700, Hugh Dickins wrote:
> I've only seen this livelock on one machine (repeatably, but not to
> order), and not fully analyzed it - two processes seen looping around
> getting -EEXIST from swapcache_prepare(), I guess a third (at lower
> priority? but wanting the same cpu as one of the loopers? preemption
> or cond_resched() not enough to let it back in?) set SWAP_HAS_CACHE,
> then went off into direct reclaim, scheduled away, and somehow could
> not get back to add the page to swap cache and let them all complete.
>
> Restore the page allocation in __read_swap_cache_async() to before
> the swapcache_prepare() call: "mm: memcontrol: charge swapin pages
> on instantiation" moved it outside the loop, which indeed looks much
> nicer, but exposed this weakness. We used to allocate new_page once
> and then keep it across all iterations of the loop: but I think that
> just optimizes for a rare case, and complicates the flow, so go with
> the new simpler structure, with allocate+free each time around (which
> is more considerate use of the memory too).
>
> Fix the comment on the looping case, which has long been inaccurate:
> it's not a racing get_swap_page() that's the problem here.
>
> Fix the add_to_swap_cache() and mem_cgroup_charge() error recovery:
> not swap_free(), but put_swap_page() to undo SWAP_HAS_CACHE, as was
> done before; but delete_from_swap_cache() already includes it.
>
> And one more nit: I don't think it makes any difference in practice,
> but remove the "& GFP_KERNEL" mask from the mem_cgroup_charge() call:
> add_to_swap_cache() needs that, to convert gfp_mask from user and page
> cache allocation (e.g. highmem) to radix node allocation (lowmem), but
> we don't need or usually apply that mask when charging mem_cgroup.
>
> Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx>
> ---
> Mostly fixing mm-memcontrol-charge-swapin-pages-on-instantiation.patch
> but now I see that mm-memcontrol-delete-unused-lrucare-handling.patch
> made a further change here (took an arg off the mem_cgroup_charge call):
> as is, this patch is diffed to go on top of both of them, and better
> that I get it out now for Johannes look at; but could be rediffed for
> folding into blah-instantiation.patch later.
>
> Earlier in the day I promised two patches to __read_swap_cache_async(),
> but find now that I cannot quite justify the second patch: it makes a
> slight adjustment in swapcache_prepare(), then removes the redundant
> __swp_swapcount() && swap_slot_cache_enabled business from blah_async().
>
> I'd still like to do that, but this patch here brings back the
> alloc_page_vma() in between them, and I don't have any evidence to
> reassure us that I'm not then pessimizing a readahead case by doing
> unnecessary allocation and free. Leave it for some other time perhaps.
>
> mm/swap_state.c | 52 +++++++++++++++++++++++++---------------------
> 1 file changed, 29 insertions(+), 23 deletions(-)
>
> --- 5.7-rc6-mm1/mm/swap_state.c 2020-05-20 12:21:56.149694170 -0700
> +++ linux/mm/swap_state.c 2020-05-21 20:17:50.188773901 -0700
> @@ -392,56 +392,62 @@ struct page *__read_swap_cache_async(swp
> return NULL;
>
> /*
> + * Get a new page to read into from swap. Allocate it now,
> + * before marking swap_map SWAP_HAS_CACHE, when -EEXIST will
> + * cause any racers to loop around until we add it to cache.
> + */
> + page = alloc_page_vma(gfp_mask, vma, addr);
> + if (!page)
> + return NULL;
> +
> + /*
> * Swap entry may have been freed since our caller observed it.
> */
> err = swapcache_prepare(entry);
> if (!err)
> break;
>
> - if (err == -EEXIST) {
> - /*
> - * We might race against get_swap_page() and stumble
> - * across a SWAP_HAS_CACHE swap_map entry whose page
> - * has not been brought into the swapcache yet.
> - */
> - cond_resched();
> - continue;
> - }
> + put_page(page);
> + if (err != -EEXIST)
> + return NULL;
>
> - return NULL;
> + /*
> + * We might race against __delete_from_swap_cache(), and
> + * stumble across a swap_map entry whose SWAP_HAS_CACHE
> + * has not yet been cleared. Or race against another
> + * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
> + * in swap_map, but not yet added its page to swap cache.
> + */
> + cond_resched();
> }
>
> /*
> - * The swap entry is ours to swap in. Prepare a new page.
> + * The swap entry is ours to swap in. Prepare the new page.
> */
>
> - page = alloc_page_vma(gfp_mask, vma, addr);
> - if (!page)
> - goto fail_free;
> -
> __SetPageLocked(page);
> __SetPageSwapBacked(page);
>
> /* May fail (-ENOMEM) if XArray node allocation failed. */
> - if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL))
> + if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL)) {
> + put_swap_page(page, entry);
> goto fail_unlock;
> + }
>
> - if (mem_cgroup_charge(page, NULL, gfp_mask & GFP_KERNEL))
> - goto fail_delete;
> + if (mem_cgroup_charge(page, NULL, gfp_mask)) {
> + delete_from_swap_cache(page);
> + goto fail_unlock;
> + }
>
> - /* Initiate read into locked page */
> + /* Caller will initiate read into locked page */
> SetPageWorkingset(page);
> lru_cache_add_anon(page);
> *new_page_allocated = true;
> return page;
>
> -fail_delete:
> - delete_from_swap_cache(page);
> fail_unlock:
> unlock_page(page);
> put_page(page);
> -fail_free:
> - swap_free(entry);
> return NULL;
> }
>
>
Acked-by: Rafael Aquini <aquini@xxxxxxxxxx>