Re: [PATCH v4 1/2] mm: store zero pages to be swapped out in a bitmap
From: Usama Arif
Date: Thu Sep 05 2024 - 06:53:55 EST
On 05/09/2024 11:33, Barry Song wrote:
> On Thu, Sep 5, 2024 at 10:10 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
>>
>> On Thu, Sep 5, 2024 at 8:49 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
>>>
>>> On Thu, Sep 5, 2024 at 7:55 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
>>>>
>>>> On Thu, Sep 5, 2024 at 12:03 AM Barry Song <21cnbao@xxxxxxxxx> wrote:
>>>>>
>>>>> On Thu, Sep 5, 2024 at 5:41 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
>>>>>>
>>>>>> [..]
>>>>>>>> I understand the point of doing this to unblock the synchronous large
>>>>>>>> folio swapin support work, but at some point we're gonna have to
>>>>>>>> actually handle the cases where a large folio being swapped in is
>>>>>>>> partially in the swap cache, zswap, the zeromap, etc.
>>>>>>>>
>>>>>>>> All these cases will need similar-ish handling, and I suspect we won't
>>>>>>>> just skip swapping in large folios in all these cases.
>>>>>>>
>>>>>>> I agree that this is definitely the goal. `swap_read_folio()` should be a
>>>>>>> dependable API that always returns reliable data, regardless of whether
>>>>>>> `zeromap` or `zswap` is involved. Despite these issues, mTHP swap-in shouldn't
>>>>>>> be held back. Significant efforts are underway to support large folios in
>>>>>>> `zswap`, and progress is being made. Not to mention we've already allowed
>>>>>>> `zeromap` to proceed, even though it doesn't support large folios.
>>>>>>>
>>>>>>> It's genuinely unfair to let the lack of mTHP support in `zeromap` and
>>>>>>> `zswap` hold swap-in hostage.
>>>>>>
>>>>>
>>>>> Hi Yosry,
>>>>>
>>>>>> Well, two points here:
>>>>>>
>>>>>> 1. I did not say that we should block the synchronous mTHP swapin work
>>>>>> for this :) I said the next item on the TODO list for mTHP swapin
>>>>>> support should be handling these cases.
>>>>>
>>>>> Thanks for your clarification!
>>>>>
>>>>>>
>>>>>> 2. I think two things are getting conflated here. Zswap needs to
>>>>>> support mTHP swapin*. Zeromap already supports mTHPs AFAICT. What is
>>>>>> truly, and is outside the scope of zswap/zeromap, is being able to
>>>>>> support hybrid mTHP swapin.
>>>>>>
>>>>>> When swapping in an mTHP, the swapped entries can be on disk, in the
>>>>>> swapcache, in zswap, or in the zeromap. Even if all these things
>>>>>> support mTHPs individually, we essentially need support to form an
>>>>>> mTHP from swap entries in different backends. That's what I meant.
>>>>>> Actually if we have that, we may not really need mTHP swapin support
>>>>>> in zswap, because we can just form the large folio in the swap layer
>>>>>> from multiple zswap entries.
>>>>>>
>>>>>
>>>>> After further consideration, I've actually started to disagree with the idea
>>>>> of supporting hybrid swapin (forming an mTHP from swap entries in different
>>>>> backends). My reasoning is as follows:
>>>>
>>>> I do not have any data about this, so you could very well be right
>>>> here. Handling hybrid swapin could be simply falling back to the
>>>> smallest order we can swapin from a single backend. We can at least
>>>> start with this, and collect data about how many mTHP swapins fallback
>>>> due to hybrid backends. This way we only take the complexity if
>>>> needed.
>>>>
>>>> I did imagine though that it's possible for two virtually contiguous
>>>> folios to be swapped out to contiguous swap entries and end up in
>>>> different media (e.g. if only one of them is zero-filled). I am not
>>>> sure how rare it would be in practice.
>>>>
>>>>>
>>>>> 1. The scenario where an mTHP is partially zeromap, partially zswap, etc.,
>>>>> would be an extremely rare case, as long as we're swapping out the mTHP as
>>>>> a whole and all the modules are handling it accordingly. It's highly
>>>>> unlikely to form this mix of zeromap, zswap, and swapcache unless the
>>>>> contiguous VMA virtual address happens to get some small folios with
>>>>> aligned and contiguous swap slots. Even then, they would need to be
>>>>> partially zeromap and partially non-zeromap, zswap, etc.
>>>>
>>>> As I mentioned, we can start simple and collect data for this. If it's
>>>> rare and we don't need to handle it, that's good.
>>>>
>>>>>
>>>>> As you mentioned, zeromap handles mTHP as a whole during swapping
>>>>> out, marking all subpages of the entire mTHP as zeromap rather than just
>>>>> a subset of them.
>>>>>
>>>>> And swap-in can also entirely map a swapcache which is a large folio based
>>>>> on our previous patchset which has been in mainline:
>>>>> "mm: swap: entirely map large folios found in swapcache"
>>>>> https://lore.kernel.org/all/20240529082824.150954-1-21cnbao@xxxxxxxxx/
>>>>>
>>>>> It seems the only thing we're missing is zswap support for mTHP.
>>>>
>>>> It is still possible for two virtually contiguous folios to be swapped
>>>> out to contiguous swap entries. It is also possible that a large folio
>>>> is swapped out as a whole, then only a part of it is swapped in later
>>>> due to memory pressure. If that part is later reclaimed again and gets
>>>> added to the swapcache, we can run into the hybrid swapin situation.
>>>> There may be other scenarios as well, I did not think this through.
>>>>
>>>>>
>>>>> 2. Implementing hybrid swap-in would be extremely tricky and could disrupt
>>>>> several software layers. I can share some pseudo code below:
>>>>
>>>> Yeah it definitely would be complex, so we need proper justification for it.
>>>>
>>>>>
>>>>> swap_read_folio()
>>>>> {
>>>>> if (zeromap_full)
>>>>> folio_read_from_zeromap()
>>>>> else if (zswap_map_full)
>>>>> folio_read_from_zswap()
>>>>> else {
>>>>> folio_read_from_swapfile()
>>>>> if (zeromap_partial)
>>>>> folio_read_from_zeromap_fixup() /* fill zero
>>>>> for partially zeromap subpages */
>>>>> if (zwap_partial)
>>>>> folio_read_from_zswap_fixup() /* zswap_load
>>>>> for partially zswap-mapped subpages */
>>>>>
>>>>> folio_mark_uptodate()
>>>>> folio_unlock()
>>>>> }
>>>>>
>>>>> We'd also need to modify folio_read_from_swapfile() to skip
>>>>> folio_mark_uptodate()
>>>>> and folio_unlock() after completing the BIO. This approach seems to
>>>>> entirely disrupt
>>>>> the software layers.
>>>>>
>>>>> This could also lead to unnecessary IO operations for subpages that
>>>>> require fixup.
>>>>> Since such cases are quite rare, I believe the added complexity isn't worth it.
>>>>>
>>>>> My point is that we should simply check that all PTEs have consistent zeromap,
>>>>> zswap, and swapcache statuses before proceeding, otherwise fall back to the next
>>>>> lower order if needed. This approach improves performance and avoids complex
>>>>> corner cases.
>>>>
>>>> Agree that we should start with that, although we should probably
>>>> fallback to the largest order we can swapin from a single backend,
>>>> rather than the next lower order.
>>>>
>>>>>
>>>>> So once zswap mTHP is there, I would also expect an API similar to
>>>>> swap_zeromap_entries_check()
>>>>> for example:
>>>>> zswap_entries_check(entry, nr) which can return if we are having
>>>>> full, non, and partial zswap to replace the existing
>>>>> zswap_never_enabled().
>>>>
>>>> I think a better API would be similar to what Usama had. Basically
>>>> take in (entry, nr) and return how much of it is in zswap starting at
>>>> entry, so that we can decide the swapin order.
>>>>
>>>> Maybe we can adjust your proposed swap_zeromap_entries_check() as well
>>>> to do that? Basically return the number of swap entries in the zeromap
>>>> starting at 'entry'. If 'entry' itself is not in the zeromap we return
>>>> 0 naturally. That would be a small adjustment/fix over what Usama had,
>>>> but implementing it with bitmap operations like you did would be
>>>> better.
>>>
>>> I assume you means the below
>>>
>>> /*
>>> * Return the number of contiguous zeromap entries started from entry
>>> */
>>> static inline unsigned int swap_zeromap_entries_count(swp_entry_t entry, int nr)
>>> {
>>> struct swap_info_struct *sis = swp_swap_info(entry);
>>> unsigned long start = swp_offset(entry);
>>> unsigned long end = start + nr;
>>> unsigned long idx;
>>>
>>> idx = find_next_bit(sis->zeromap, end, start);
>>> if (idx != start)
>>> return 0;
>>>
>>> return find_next_zero_bit(sis->zeromap, end, start) - idx;
>>> }
>>>
>>> If yes, I really like this idea.
>>>
>>> It seems much better than using an enum, which would require adding a new
>>> data structure :-) Additionally, returning the number allows callers
>>> to fall back
>>> to the largest possible order, rather than trying next lower orders
>>> sequentially.
>>
>> No, returning 0 after only checking first entry would still reintroduce
>> the current bug, where the start entry is zeromap but other entries
>> might not be. We need another value to indicate whether the entries
>> are consistent if we want to avoid the enum:
>>
>> /*
>> * Return the number of contiguous zeromap entries started from entry;
>> * If all entries have consistent zeromap, *consistent will be true;
>> * otherwise, false;
>> */
>> static inline unsigned int swap_zeromap_entries_count(swp_entry_t entry,
>> int nr, bool *consistent)
>> {
>> struct swap_info_struct *sis = swp_swap_info(entry);
>> unsigned long start = swp_offset(entry);
>> unsigned long end = start + nr;
>> unsigned long s_idx, c_idx;
>>
>> s_idx = find_next_bit(sis->zeromap, end, start);
>> if (s_idx == end) {
>> *consistent = true;
>> return 0;
>> }
>>
>> c_idx = find_next_zero_bit(sis->zeromap, end, start);
>> if (c_idx == end) {
>> *consistent = true;
>> return nr;
>> }
>>
>> *consistent = false;
>> if (s_idx == start)
>> return 0;
>> return c_idx - s_idx;
>> }
>>
>> I can actually switch the places of the "consistent" and returned
>> number if that looks
>> better.
>
> I'd rather make it simpler by:
>
> /*
> * Check if all entries have consistent zeromap status, return true if
> * all entries are zeromap or non-zeromap, else return false;
> */
> static inline bool swap_zeromap_entries_check(swp_entry_t entry, int nr)
> {
> struct swap_info_struct *sis = swp_swap_info(entry);
> unsigned long start = swp_offset(entry);
> unsigned long end = start + *nr;
>
I guess you meant end= start + nr here?
> if (find_next_bit(sis->zeromap, end, start) == end)
> return true;
> if (find_next_zero_bit(sis->zeromap, end, start) == end)
> return true;
>
So if zeromap is all false, this still returns true. We cant use this function in swap_read_folio_zeromap,
to check at time of swapin if all were zeros, right?
> return false;
> }
>
> mm/page_io.c can combine this with reading the zeromap of first entry to
> decide if it will read folio from zeromap; mm/memory.c only needs the bool
> to fallback to the largest possible order.
>
> static inline unsigned long thp_swap_suitable_orders(...)
> {
> int order, nr;
>
> order = highest_order(orders);
>
> while (orders) {
> nr = 1 << order;
> if ((addr >> PAGE_SHIFT) % nr == swp_offset % nr &&
> swap_zeromap_entries_check(entry, nr))
> break;
> order = next_order(&orders, order);
> }
>
> return orders;
> }
>
>>
>>>
>>> Hi Usama,
>>> what is your take on this?
>>>
>>>>
>>>>>
>>>>> Though I am not sure how cheap zswap can implement it,
>>>>> swap_zeromap_entries_check()
>>>>> could be two simple bit operations:
>>>>>
>>>>> +static inline zeromap_stat_t swap_zeromap_entries_check(swp_entry_t
>>>>> entry, int nr)
>>>>> +{
>>>>> + struct swap_info_struct *sis = swp_swap_info(entry);
>>>>> + unsigned long start = swp_offset(entry);
>>>>> + unsigned long end = start + nr;
>>>>> +
>>>>> + if (find_next_bit(sis->zeromap, end, start) == end)
>>>>> + return SWAP_ZEROMAP_NON;
>>>>> + if (find_next_zero_bit(sis->zeromap, end, start) == end)
>>>>> + return SWAP_ZEROMAP_FULL;
>>>>> +
>>>>> + return SWAP_ZEROMAP_PARTIAL;
>>>>> +}
>>>>>
>>>>> 3. swapcache is different from zeromap and zswap. Swapcache indicates
>>>>> that the memory
>>>>> is still available and should be re-mapped rather than allocating a
>>>>> new folio. Our previous
>>>>> patchset has implemented a full re-map of an mTHP in do_swap_page() as mentioned
>>>>> in 1.
>>>>>
>>>>> For the same reason as point 1, partial swapcache is a rare edge case.
>>>>> Not re-mapping it
>>>>> and instead allocating a new folio would add significant complexity.
>>>>>
>>>>>>>
>>>>>>> Nonetheless, `zeromap` and `zswap` are distinct cases. With `zeromap`, we
>>>>>>> permit almost all mTHP swap-ins, except for those rare situations where
>>>>>>> small folios that were swapped out happen to have contiguous and aligned
>>>>>>> swap slots.
>>>>>>>
>>>>>>> swapcache is another quite different story, since our user scenarios begin from
>>>>>>> the simplest sync io on mobile phones, we don't quite care about swapcache.
>>>>>>
>>>>>> Right. The reason I bring this up is as I mentioned above, there is a
>>>>>> common problem of forming large folios from different sources, which
>>>>>> includes the swap cache. The fact that synchronous swapin does not use
>>>>>> the swapcache was a happy coincidence for you, as you can add support
>>>>>> mTHP swapins without handling this case yet ;)
>>>>>
>>>>> As I mentioned above, I'd really rather filter out those corner cases
>>>>> than support
>>>>> them, not just for the current situation to unlock swap-in series :-)
>>>>
>>>> If they are indeed corner cases, then I definitely agree.
>>>
>
> Thanks
> Barry