Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

From: Huang, Ying
Date: Wed Mar 20 2024 - 02:22:48 EST


Barry Song <21cnbao@xxxxxxxxx> writes:

> On Wed, Mar 20, 2024 at 3:20 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>>
>> Ryan Roberts <ryan.roberts@xxxxxxx> writes:
>>
>> > On 19/03/2024 09:20, Huang, Ying wrote:
>> >> Ryan Roberts <ryan.roberts@xxxxxxx> writes:
>> >>
>> >>>>>> I agree phones are not the only platform. But Rome wasn't built in a
>> >>>>>> day. I can only get
>> >>>>>> started on a hardware which I can easily reach and have enough hardware/test
>> >>>>>> resources on it. So we may take the first step which can be applied on
>> >>>>>> a real product
>> >>>>>> and improve its performance, and step by step, we broaden it and make it
>> >>>>>> widely useful to various areas in which I can't reach :-)
>> >>>>>
>> >>>>> We must guarantee the normal swap path runs correctly and has no
>> >>>>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
>> >>>>> So we have to put some effort on the normal path test anyway.
>> >>>>>
>> >>>>>> so probably we can have a sysfs "enable" entry with default "n" or
>> >>>>>> have a maximum
>> >>>>>> swap-in order as Ryan's suggestion [1] at the beginning,
>> >>>>>>
>> >>>>>> "
>> >>>>>> So in the common case, swap-in will pull in the same size of folio as was
>> >>>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
>> >>>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
>> >>>>>> it makes sense for 2M THP; As the size increases the chances of actually needing
>> >>>>>> all of the folio reduces so chances are we are wasting IO. There are similar
>> >>>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
>> >>>>>> sense to copy the whole folio up to a certain size.
>> >>>>>> "
>> >>>
>> >>> I thought about this a bit more. No clear conclusions, but hoped this might help
>> >>> the discussion around policy:
>> >>>
>> >>> The decision about the size of the THP is made at first fault, with some help
>> >>> from user space and in future we might make decisions to split based on
>> >>> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
>> >>> the THP out at some point in its lifetime should not impact on its size. It's
>> >>> just being moved around in the system and the reason for our original decision
>> >>> should still hold.
>> >>>
>> >>> So from that PoV, it would be good to swap-in to the same size that was
>> >>> swapped-out.
>> >>
>> >> Sorry, I don't agree with this. It's better to swap-in and swap-out in
>> >> smallest size if the page is only accessed seldom to avoid to waste
>> >> memory.
>> >
>> > If we want to optimize only for memory consumption, I'm sure there are many
>> > things we would do differently. We need to find a balance between memory and
>> > performance. The benefits of folios are well documented and the kernel is
>> > heading in the direction of managing memory in variable-sized blocks. So I don't
>> > think it's as simple as saying we should always swap-in the smallest possible
>> > amount of memory.
>>
>> It's conditional, that is,
>>
>> "if the page is only accessed seldom"
>>
>> Then, the page swapped-in will be swapped-out soon and adjacent pages in
>> the same large folio will not be accessed during this period.
>>
>> So, I suggest to create an algorithm to decide swap-in order based on
>> swap-readahead information automatically. It can detect the situation
>> above via reduced swap readahead window size. And, if the page is
>> accessed for quite long time, and the adjacent pages in the same large
>> folio are accessed too, swap-readahead window will increase and large
>> swap-in order will be used.
>
> The original size of do_anonymous_page() should be honored, considering it
> embodies a decision influenced by not only sysfs settings and per-vma
> HUGEPAGE hints but also architectural characteristics, for example
> CONT-PTE.
>
> The model you're proposing may offer memory-saving benefits or reduce I/O,
> but it entirely disassociates the size of the swap in from the size prior to the
> swap out.

Readahead isn't the only factor to determine folio order. For example,
we must respect "never" policy to allocate order-0 folio always.
There's no requirements to use swap-out order in swap-in too. Memory
allocation has different performance character of storage reading.

> Moreover, there's no guarantee that the large folio generated by
> the readahead window is contiguous in the swap and can be added to the
> swap cache, as we are currently dealing with folio->swap instead of
> subpage->swap.

Yes. We can optimize only when all conditions are satisfied. Just like
other optimization.

> Incidentally, do_anonymous_page() serves as the initial location for allocating
> large folios. Given that memory conservation is a significant consideration in
> do_swap_page(), wouldn't it be even more crucial in do_anonymous_page()?

Yes. We should consider that too. IIUC, that is why mTHP support is
off by default for now. After we find a way to solve the memory usage
issue. We may make default "on".

> A large folio, by its nature, represents a high-quality resource that has the
> potential to leverage hardware characteristics for the benefit of the
> entire system.

But not at the cost of memory wastage.

> Conversely, I don't believe that a randomly determined size dictated by the
> readahead window possesses the same advantageous qualities.

There's a readahead algorithm which is not pure random.

> SWP_SYNCHRONOUS_IO devices are not reliant on readahead whatsoever,
> their needs should also be respected.

I understand that there are special requirements for SWP_SYNCHRONOUS_IO
devices. I just suggest to work on general code before specific
optimization.

>> > You also said we should swap *out* in smallest size possible. Have I
>> > misunderstood you? I thought the case for swapping-out a whole folio without
>> > splitting was well established and non-controversial?
>>
>> That is conditional too.
>>
>> >>
>> >>> But we only kind-of keep that information around, via the swap
>> >>> entry contiguity and alignment. With that scheme it is possible that multiple
>> >>> virtually adjacent but not physically contiguous folios get swapped-out to
>> >>> adjacent swap slot ranges and then they would be swapped-in to a single, larger
>> >>> folio. This is not ideal, and I think it would be valuable to try to maintain
>> >>> the original folio size information with the swap slot. One way to do this would
>> >>> be to store the original order for which the cluster was allocated in the
>> >>> cluster. Then we at least know that a given swap slot is either for a folio of
>> >>> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
>> >>> steal a bit from swap_map to determine which case it is? Or are there better
>> >>> approaches?
>> >>
>> >> [snip]

--
Best Regards,
Huang, Ying