Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole
From: Ryan Roberts
Date: Mon Mar 18 2024 - 12:45:38 EST
>>> I agree phones are not the only platform. But Rome wasn't built in a
>>> day. I can only get
>>> started on a hardware which I can easily reach and have enough hardware/test
>>> resources on it. So we may take the first step which can be applied on
>>> a real product
>>> and improve its performance, and step by step, we broaden it and make it
>>> widely useful to various areas in which I can't reach :-)
>>
>> We must guarantee the normal swap path runs correctly and has no
>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
>> So we have to put some effort on the normal path test anyway.
>>
>>> so probably we can have a sysfs "enable" entry with default "n" or
>>> have a maximum
>>> swap-in order as Ryan's suggestion [1] at the beginning,
>>>
>>> "
>>> So in the common case, swap-in will pull in the same size of folio as was
>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
>>> it makes sense for 2M THP; As the size increases the chances of actually needing
>>> all of the folio reduces so chances are we are wasting IO. There are similar
>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
>>> sense to copy the whole folio up to a certain size.
>>> "
I thought about this a bit more. No clear conclusions, but hoped this might help
the discussion around policy:
The decision about the size of the THP is made at first fault, with some help
from user space and in future we might make decisions to split based on
munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
the THP out at some point in its lifetime should not impact on its size. It's
just being moved around in the system and the reason for our original decision
should still hold.
So from that PoV, it would be good to swap-in to the same size that was
swapped-out. But we only kind-of keep that information around, via the swap
entry contiguity and alignment. With that scheme it is possible that multiple
virtually adjacent but not physically contiguous folios get swapped-out to
adjacent swap slot ranges and then they would be swapped-in to a single, larger
folio. This is not ideal, and I think it would be valuable to try to maintain
the original folio size information with the swap slot. One way to do this would
be to store the original order for which the cluster was allocated in the
cluster. Then we at least know that a given swap slot is either for a folio of
that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
steal a bit from swap_map to determine which case it is? Or are there better
approaches?
Next we (I?) have concerns about wasting IO by swapping-in folios that are too
large (e.g. 2M). I'm not sure if this is a real problem or not - intuitively I'd
say yes but I have no data. But on the other hand, memory is aged and
swapped-out per-folio, so why shouldn't it be swapped-in per folio? If the
original allocation size policy is good (it currently isn't) then a folio should
be sized to cover temporally close memory and if we need to access some of it,
chances are we need all of it.
If we think the IO concern is legitimate then we could define a threshold size
(sysfs?) for when we start swapping-in the folio in chunks. And how big should
those chunks be - one page, or the threshold size itself? Probably the latter?
And perhaps that threshold could also be used by zRAM to decide its upper limit
for compression chunk.
Perhaps we can learn from khugepaged here? I think it has programmable
thresholds for how many swapped-out pages can be swapped-in to aid collapse to a
THP? I guess that exists for the same concerns about increased IO pressure?
If we think we will ever be swapping-in folios in chunks less than their
original size, then we need a separate mechanism to re-foliate them. We have
discussed a khugepaged-like approach for doing this asynchronously in the
background. I know that scares the Android folks, but David has suggested that
this could well be very cheap compared with khugepaged, because it would be
entirely limited to a single pgtable, so we only need the PTL. If we need this
mechanism anyway, perhaps we should develop it and see how it performs if
swap-in remains order-0? Although I guess that would imply not being able to
benefit from compressing THPs for the zRAM case.
I see all this as orthogonal to synchronous vs asynchronous swap devices. I
think the latter just implies that you might want to do some readahead to try to
cover up the latency? If swap is moving towards being folio-orientated, then
readahead also surely needs to be folio-orientated, but I think that should be
the only major difference.
Anyway, just some thoughts!
Thanks,
Ryan