Re: [PATCH] mm: avoid unconditional one-tick sleep when swapcache_prepare fails

From: Barry Song
Date: Fri Oct 04 2024 - 12:16:50 EST


On Fri, Oct 4, 2024 at 6:22 AM Chris Li <chrisl@xxxxxxxxxx> wrote:
>
> On Thu, Sep 26, 2024 at 2:20 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
> >
> > From: Barry Song <v-songbaohua@xxxxxxxx>
> >
> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > introduced an unconditional one-tick sleep when `swapcache_prepare()`
> > fails, which has led to reports of UI stuttering on latency-sensitive
> > Android devices. To address this, we can use a waitqueue to wake up
> > tasks that fail `swapcache_prepare()` sooner, instead of always
> > sleeping for a full tick. While tasks may occasionally be woken by an
> > unrelated `do_swap_page()`, this method is preferable to two scenarios:
> > rapid re-entry into page faults, which can cause livelocks, and
> > multiple millisecond sleeps, which visibly degrade user experience.
> >
> > Oven's testing shows that a single waitqueue resolves the UI
> > stuttering issue. If a 'thundering herd' problem becomes apparent
> > later, a waitqueue hash similar to `folio_wait_table[PAGE_WAIT_TABLE_SIZE]`
> > for page bit locks can be introduced.
> >
> > Fixes: 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > Cc: Kairui Song <kasong@xxxxxxxxxxx>
> > Cc: "Huang, Ying" <ying.huang@xxxxxxxxx>
> > Cc: Yu Zhao <yuzhao@xxxxxxxxxx>
> > Cc: David Hildenbrand <david@xxxxxxxxxx>
> > Cc: Chris Li <chrisl@xxxxxxxxxx>
> > Cc: Hugh Dickins <hughd@xxxxxxxxxx>
> > Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
> > Cc: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx>
> > Cc: Michal Hocko <mhocko@xxxxxxxx>
> > Cc: Minchan Kim <minchan@xxxxxxxxxx>
> > Cc: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> > Cc: SeongJae Park <sj@xxxxxxxxxx>
> > Cc: Kalesh Singh <kaleshsingh@xxxxxxxxxx>
> > Cc: Suren Baghdasaryan <surenb@xxxxxxxxxx>
> > Cc: <stable@xxxxxxxxxxxxxxx>
> > Reported-by: Oven Liyang <liyangouwen1@xxxxxxxx>
> > Tested-by: Oven Liyang <liyangouwen1@xxxxxxxx>
> > Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx>
> > ---
> > mm/memory.c | 13 +++++++++++--
> > 1 file changed, 11 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 2366578015ad..6913174f7f41 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4192,6 +4192,8 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> > }
> > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > +static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
> > +
> > /*
> > * We enter with non-exclusive mmap_lock (to exclude vma changes,
> > * but allow concurrent faults), and pte mapped but not yet locked.
> > @@ -4204,6 +4206,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > {
> > struct vm_area_struct *vma = vmf->vma;
> > struct folio *swapcache, *folio = NULL;
> > + DECLARE_WAITQUEUE(wait, current);
> > struct page *page;
> > struct swap_info_struct *si = NULL;
> > rmap_t rmap_flags = RMAP_NONE;
> > @@ -4302,7 +4305,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > * Relax a bit to prevent rapid
> > * repeated page faults.
> > */
> > + add_wait_queue(&swapcache_wq, &wait);
> > schedule_timeout_uninterruptible(1);
> > + remove_wait_queue(&swapcache_wq, &wait);
>
> There is only one "swapcache_wq", if we don't care about the memory
> overhead, ideally should be per swap entry that fails to grab the
> HAS_CACHE bit and has one wait queue. Currently all swap entries using
> one wait queue will likely cause other swap entries (if any) get wait
> up then find out the swap entry it cares hasn't been served yet.
>

even page bit locks do have a waitqueue for one page, i believe that
case has much serious contention then swap-in. page bit lock depends
on a waitqueue hash to decrease unrelated wake-up.

if one process is woken-up by unrelated do_swap_page() and its swapcache
is not released, it will sleep again after re-checking swapcache_prepare().

Too many unrelated wake-ups would be just a 'thundering herd' but not
a livelock.

> Another thing to consider is that, if we are using a wait queue, the
> 1ms is not relevant any more. It can be longer than 1ms and it is
> getting waited up by the wait queue anyway. Here you might use
> indefinitely sleep to reduce the unnecessary wait up and the
> complexity of the timer.

not quite sure what you mean for 1ms, in an embedded system, we never
use 1000HZ, the typical/maximum HZ is 250. not quite sure what
you mean by "indefinitely sleep", my understanding is that we can't
poll the result of swapcache_prepare() as the winner process
which does swapcache_prepare() successfully will drop the
swap slots.

>
> > goto out_page;
> > }
> > need_clear_cache = true;
> > @@ -4609,8 +4614,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > pte_unmap_unlock(vmf->pte, vmf->ptl);
> > out:
> > /* Clear the swap cache pin for direct swapin after PTL unlock */
> > - if (need_clear_cache)
> > + if (need_clear_cache) {
> > swapcache_clear(si, entry, nr_pages);
> > + wake_up(&swapcache_wq);
>
> Agree with Ying that here the common path will need to take a lock to
> wait up the wait queue.

waitqueue_active() might be a good candidate.

>
> Chris
>
>
> > + }
> > if (si)
> > put_swap_device(si);
> > return ret;
> > @@ -4625,8 +4632,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > folio_unlock(swapcache);
> > folio_put(swapcache);
> > }
> > - if (need_clear_cache)
> > + if (need_clear_cache) {
> > swapcache_clear(si, entry, nr_pages);
> > + wake_up(&swapcache_wq);
> > + }
> > if (si)
> > put_swap_device(si);
> > return ret;
> > --
> > 2.34.1
> >

Thanks
Barry