Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

From: Barry Song
Date: Sun Mar 17 2024 - 22:41:49 EST


On Mon, Mar 18, 2024 at 2:54 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>
> Barry Song <21cnbao@xxxxxxxxx> writes:
>
> > On Fri, Mar 15, 2024 at 10:17 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> >>
> >> Barry Song <21cnbao@xxxxxxxxx> writes:
> >>
> >> > On Fri, Mar 15, 2024 at 9:43 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> >> >>
> >> >> Barry Song <21cnbao@xxxxxxxxx> writes:
> >> >>
> >> >> > From: Chuanhua Han <hanchuanhua@xxxxxxxx>
> >> >> >
> >> >> > On an embedded system like Android, more than half of anon memory is
> >> >> > actually in swap devices such as zRAM. For example, while an app is
> >> >> > switched to background, its most memory might be swapped-out.
> >> >> >
> >> >> > Now we have mTHP features, unfortunately, if we don't support large folios
> >> >> > swap-in, once those large folios are swapped-out, we immediately lose the
> >> >> > performance gain we can get through large folios and hardware optimization
> >> >> > such as CONT-PTE.
> >> >> >
> >> >> > This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> >> >> > to those contiguous swaps which were likely swapped out from mTHP as a
> >> >> > whole.
> >> >> >
> >> >> > Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
> >> >> > case. It doesn't support swapin_readahead as large folios yet since this
> >> >> > kind of shared memory is much less than memory mapped by single process.
> >> >>
> >> >> In contrast, I still think that it's better to start with normal swap-in
> >> >> path, then expand to SWAP_SYCHRONOUS case.
> >> >
> >> > I'd rather try the reverse direction as non-sync anon memory is only around
> >> > 3% in a phone as my observation.
> >>
> >> Phone is not the only platform that Linux is running on.
> >
> > I suppose it's generally true that forked shared anonymous pages only
> > constitute a
> > small portion of all anonymous pages. The majority of anonymous pages are within
> > a single process.
>
> Yes. But IIUC, SWP_SYNCHRONOUS_IO is quite limited, they are set only
> for memory backed swap devices.

SWP_SYNCHRONOUS_IO is the most common case for embedded linux.
note almost all Android/embedded devices use zRAM rather than a disk
for swap.

And we can have an upper limit order or a new control like
/sys/kernel/mm/transparent_hugepage/hugepages-256kB/swapin
and set them default to 0 first.

>
> > I agree phones are not the only platform. But Rome wasn't built in a
> > day. I can only get
> > started on a hardware which I can easily reach and have enough hardware/test
> > resources on it. So we may take the first step which can be applied on
> > a real product
> > and improve its performance, and step by step, we broaden it and make it
> > widely useful to various areas in which I can't reach :-)
>
> We must guarantee the normal swap path runs correctly and has no
> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
> So we have to put some effort on the normal path test anyway.
>
> > so probably we can have a sysfs "enable" entry with default "n" or
> > have a maximum
> > swap-in order as Ryan's suggestion [1] at the beginning,
> >
> > "
> > So in the common case, swap-in will pull in the same size of folio as was
> > swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> > it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> > it makes sense for 2M THP; As the size increases the chances of actually needing
> > all of the folio reduces so chances are we are wasting IO. There are similar
> > arguments for CoW, where we currently copy 1 page per fault - it probably makes
> > sense to copy the whole folio up to a certain size.
> > "
> >
> >>
> >> >>
> >> >> In normal swap-in path, we can take advantage of swap readahead
> >> >> information to determine the swapped-in large folio order. That is, if
> >> >> the return value of swapin_nr_pages() > 1, then we can try to allocate
> >> >> and swapin a large folio.
> >> >
> >> > I am not quite sure we still need to depend on this. in do_anon_page,
> >> > we have broken the assumption and allocated a large folio directly.
> >>
> >> I don't think that we have a sophisticated policy to allocate large
> >> folio. Large folio could waste memory for some workloads, so I think
> >> that it's a good idea to allocate large folio always.
> >
> > i agree, but we still have the below check just like do_anon_page() has it,
> >
> > orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
> > BIT(PMD_ORDER) - 1);
> > orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> >
> > in do_anon_page, we don't worry about the waste so much, the same
> > logic also applies to do_swap_page().
>
> As I said, "readahead" may help us from application/user specific
> configuration in most cases. It can be a starting point of "using mTHP
> automatically when it helps and not cause many issues".

I'd rather start from the simpler code path and really improve on
phones & embedded linux which our team can really reach :-)

>
> >>
> >> Readahead gives us an opportunity to play with the policy.
> >
> > I feel somehow the rules of the game have changed with an upper
> > limit for swap-in size. for example, if the upper limit is 4 order,
> > it limits folio size to 64KiB which is still a proper size for ARM64
> > whose base page can be 64KiB.
> >
> > on the other hand, while swapping out large folios, we will always
> > compress them as a whole(zsmalloc/zram patch will come in a
> > couple of days), if we choose to decompress a subpage instead of
> > a large folio in do_swap_page(), we might need to decompress
> > nr_pages times. for example,
> >
> > For large folios 16*4KiB, they are saved as a large object in zsmalloc(with
> > the coming patch), if we swap in a small folio, we decompress the large
> > object; next time, we will still need to decompress a large object. so
> > it is more sensible to swap in a large folio if we find those
> > swap entries are contiguous and were allocated by a large folio swap-out.
>
> I understand that there are some special requirements for ZRAM. But I
> don't think it's a good idea to force the general code to fit the
> requirements of a specific swap device too much. This is one of the
> reasons that I think that we should start with normal swap devices, then
> try to optimize for some specific devices.

I agree. but we are having a good start. zRAM is not a specific device,
it widely represents embedded Linux.

>
> >>
> >> > On the other hand, compressing/decompressing large folios as a
> >> > whole rather than doing it one by one can save a large percent of
> >> > CPUs and provide a much lower compression ratio. With a hardware
> >> > accelerator, this is even faster.
> >>
> >> I am not against to support large folio for compressing/decompressing.
> >>
> >> I just suggest to do that later, after we play with normal swap-in.
> >> SWAP_SYCHRONOUS related swap-in code is an optimization based on normal
> >> swap. So, it seems natural to support large folio swap-in for normal
> >> swap-in firstly.
> >
> > I feel like SWAP_SYCHRONOUS is a simpler case and even more "normal"
> > than the swapcache path since it is the majority.
>
> I don't think so. Most PC and server systems uses !SWAP_SYCHRONOUS
> swap devices.

The problem is that our team is all focusing on phones, we won't have
any resource
and bandwidth on PC and server. A more realistic goal is that we at
least let the
solutions benefit phones and similar embedded Linux and extend it to more areas
such as PC and server.

I'd be quite happy if you or other people can join together on PC and server.

>
> > and on the other hand, a lot
> > of modification is required for the swapcache path. in OPPO's code[1], we did
> > bring-up both paths, but the swapcache path is much much more complicated
> > than the SYNC path and hasn't really noticeable improvement.
> >
> > [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/tree/oneplus/sm8650_u_14.0.0_oneplus12
>
> That's great. Please cleanup the code and post it to mailing list. Why
> doesn't it help? IIUC, it can optimize TLB at least.

I agree this can improve but most of the anon pages are single-process
mapped. only
quite a few pages go to the readahead code path on phones. That's why
there is no
noticeable improvement finally.

I understand all the benefits you mentioned on changing readahead, but
simply because
those kinds of pages are really really rare, so improving that path
doesn't really help
Android devices.

>
> >>
> >> > So I'd rather more aggressively get large folios swap-in involved
> >> > than depending on readahead.
> >>
> >> We can take advantage of readahead algorithm in SWAP_SYCHRONOUS
> >> optimization too. The sub-pages that is not accessed by page fault can
> >> be treated as readahead. I think that is a better policy than
> >> allocating large folio always.

This is always true even in do_anonymous_page(). but we don't worry that
too much as we have per-size control. overload has the chance to set its
preferences.
/*
* Get a list of all the (large) orders below PMD_ORDER that are enabled
* for this vma. Then filter out the orders that can't be allocated over
* the faulting address and still be fully contained in the vma.
*/
orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
BIT(PMD_ORDER) - 1);
orders = thp_vma_suitable_orders(vma, vmf->address, orders);

On the other hand, we are not always allocating large folios. we are allocating
large folios when the swapped-out folio was large. This is quite important to
an embedded linux, as swap is happening so often. more than half memory
can be in swap, if we swap-out them as a large folio, but swap them in a
small, we immediately lose all advantages such as less page faults, CONT-PTE
etc.

> >
> > Considering the zsmalloc optimization, it would be a better choice to
> > always allocate
> > large folios if we find those swap entries are for a swapped-out large folio. as
> > decompressing just once, we get all subpages.
> > Some hardware accelerators are even able to decompress a large folio with
> > multi-hardware threads, for example, 16 hardware threads can compress
> > each subpage of a large folio at the same time, it is just as fast as
> > decompressing
> > one subpage.
> >
> > for platforms without the above optimizations, a proper upper limit
> > will help them
> > disable the large folios swap-in or decrease the impact. For example,
> > if the upper
> > limit is 0-order, we are just removing this patchset. if the upper
> > limit is 2 orders, we
> > are just like BASE_PAGE size is 16KiB.
> >
> >>
> >> >>
> >> >> To do that, we need to track whether the sub-pages are accessed. I
> >> >> guess we need that information for large file folio readahead too.
> >> >>
> >> >> Hi, Matthew,
> >> >>
> >> >> Can you help us on tracking whether the sub-pages of a readahead large
> >> >> folio has been accessed?
> >> >>
> >> >> > Right now, we are re-faulting large folios which are still in swapcache as a
> >> >> > whole, this can effectively decrease extra loops and early-exitings which we
> >> >> > have increased in arch_swap_restore() while supporting MTE restore for folios
> >> >> > rather than page. On the other hand, it can also decrease do_swap_page as
> >> >> > PTEs used to be set one by one even we hit a large folio in swapcache.
> >> >> >
> >> >>
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry