Re: [PATCH] mm: zswap: add VM_BUG_ON() if large folio swapin is attempted

From: David Hildenbrand
Date: Fri Jun 07 2024 - 03:11:53 EST

Next message: Nuno Sá: "Re: [PATCH v3 4/6] spi: spi-axi-spi-engine: Add support for MOSI idle configuration"
Previous message: Aleksandr Mishin: "[PATCH net] bnxt_en: Adjust logging of firmware messages in case of released token in __hwrm_send()"
In reply to: Barry Song: "Re: [PATCH] mm: zswap: add VM_BUG_ON() if large folio swapin is attempted"
Next in thread: Yosry Ahmed: "Re: [PATCH] mm: zswap: add VM_BUG_ON() if large folio swapin is attempted"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 06.06.24 23:53, Barry Song wrote:

On Fri, Jun 7, 2024 at 9:37 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:

On Thu, Jun 6, 2024 at 2:30 PM Barry Song <21cnbao@xxxxxxxxx> wrote:

On Fri, Jun 7, 2024 at 9:17 AM David Hildenbrand <david@xxxxxxxxxx> wrote:

On 06.06.24 22:31, Yosry Ahmed wrote:

On Thu, Jun 6, 2024 at 1:22 PM David Hildenbrand <david@xxxxxxxxxx> wrote:

On 06.06.24 20:48, Yosry Ahmed wrote:

With ongoing work to support large folio swapin, it is important to make
sure we do not pass large folios to zswap_load() without implementing
proper support.

For example, if a swapin fault observes that contiguous PTEs are
pointing to contiguous swap entries and tries to swap them in as a large
folio, swap_read_folio() will pass in a large folio to zswap_load(), but
zswap_load() will only effectively load the first page in the folio. If
the first page is not in zswap, the folio will be read from disk, even
though other pages may be in zswap.

In both cases, this will lead to silent data corruption.

Proper large folio swapin support needs to go into zswap before zswap
can be enabled in a system that supports large folio swapin.

Looking at callers of swap_read_folio(), it seems like they are either
allocated from __read_swap_cache_async() or do_swap_page() in the
SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we
are fine for now.

Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in
the order of those allocations without proper handling of zswap.

Alternatively, swap_read_folio() (or its callers) can be updated to have
a fallback mechanism that splits large folios or reads subpages
separately. Similar logic may be needed anyway in case part of a large
folio is already in the swapcache and the rest of it is swapped out.

Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
---

Sorry for the long CC list, I just found myself repeatedly looking at
new series that add swap support for mTHPs / large folios, making sure
they do not break with zswap or make incorrect assumptions. This debug
check should give us some peace of mind. Hopefully this patch will also
raise awareness among people who are working on this.

---
mm/zswap.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/mm/zswap.c b/mm/zswap.c
index b9b35ef86d9be..6007252429bb2 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio)
if (!entry)
return false;

+ /* Zswap loads do not handle large folio swapins correctly yet */
+ VM_BUG_ON(folio_test_large(folio));
+

There is no way we could have a WARN_ON_ONCE() and recover, right?

Not without making more fundamental changes to the surrounding swap
code. Currently zswap_load() returns either true (folio was loaded
from zswap) or false (folio is not in zswap).

To handle this correctly zswap_load() would need to tell
swap_read_folio() which subpages are in zswap and have been loaded,
and then swap_read_folio() would need to read the remaining subpages
from disk. This of course assumes that the caller of swap_read_folio()
made sure that the entire folio is swapped out and protected against
races with other swapins.

Also, because swap_read_folio() cannot split the folio itself, other
swap_read_folio_*() functions that are called from it should be
updated to handle swapping in tail subpages, which may be questionable
in its own right.

An alternative would be that zswap_load() (or a separate interface)
could tell swap_read_folio() that the folio is partially in zswap,
then we can just bail and tell the caller that it cannot read the
large folio and that it should be split.

There may be other options as well, but the bottom line is that it is
possible, but probably not something that we want to do right now.

A stronger protection method would be to introduce a config option or
boot parameter for large folio swapin, and then make CONFIG_ZSWAP
depend on it being disabled, or have zswap check it at boot and refuse
to be enabled if it is on.

Right, sounds like the VM_BUG_ON() really is not that easily avoidable.

I was wondering, if we could WARN_ON_ONCE and make the swap code detect
this like a read-error from disk.

I think do_swap_page() detects that by checking if the folio is not
uptodate:

if (unlikely(!folio_test_uptodate(folio))) {
ret = VM_FAULT_SIGBUS;
goto out_nomap;
}

So maybe WARN_ON_ONCE() + triggering that might be a bit nicer to the
system (but the app would crash either way, there is no way around it).

I'd rather fallback to small folios swapin instead crashing apps till we fix
the large folio swapin in zswap :-)

I think David is referring to catching the buggy cases that do not
properly fallback to small folios with zswap, not as an alternative to
the fallback. This is at least what I had in mind with the patch.

Cool. Thanks for the clarification. I'm fine with keeping the fallback,
whether it's the current VM_BUG_ON or David's recommended
SIGBUS.

Currently, mainline doesn't support large folios swap-in, so I see
your patch as a helpful warning for those attempting large folio
swap-in to avoid making mistakes like loading large folios from
zswap.

In fact, I spent a week trying to figure out why my app was crashing
before I added 'if (zswap_is_enabled()) goto fallback'. If your patch
had come earlier, it would have saved me that week of work :-)

To me, a direct crash seems like a more straightforward way to
prompt people to fallback when attempting large folio swap-ins.
So, I am slightly in favor of your current patch.

BUG_ON and friends are frowned-upon, in particular in scenarios where we can recover or that are so unexpected that they can be found during early testing.

I have no strong opinion on this one, but likely a VM_WARN_ON would also be sufficient to find such issues early during testing. No need to crash the machine.

--
Cheers,

David / dhildenb

Next message: Nuno Sá: "Re: [PATCH v3 4/6] spi: spi-axi-spi-engine: Add support for MOSI idle configuration"
Previous message: Aleksandr Mishin: "[PATCH net] bnxt_en: Adjust logging of firmware messages in case of released token in __hwrm_send()"
In reply to: Barry Song: "Re: [PATCH] mm: zswap: add VM_BUG_ON() if large folio swapin is attempted"
Next in thread: Yosry Ahmed: "Re: [PATCH] mm: zswap: add VM_BUG_ON() if large folio swapin is attempted"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]