Re: [PATCH] mm: zswap: add VM_BUG_ON() if large folio swapin is attempted
From: Barry Song
Date: Fri Jun 07 2024 - 17:21:02 EST
On Sat, Jun 8, 2024 at 5:43 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
>
> On Fri, Jun 7, 2024 at 12:11 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
> >
> > On 06.06.24 23:53, Barry Song wrote:
> > > On Fri, Jun 7, 2024 at 9:37 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> > >>
> > >> On Thu, Jun 6, 2024 at 2:30 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
> > >>>
> > >>> On Fri, Jun 7, 2024 at 9:17 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
> > >>>>
> > >>>> On 06.06.24 22:31, Yosry Ahmed wrote:
> > >>>>> On Thu, Jun 6, 2024 at 1:22 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
> > >>>>>>
> > >>>>>> On 06.06.24 20:48, Yosry Ahmed wrote:
> > >>>>>>> With ongoing work to support large folio swapin, it is important to make
> > >>>>>>> sure we do not pass large folios to zswap_load() without implementing
> > >>>>>>> proper support.
> > >>>>>>>
> > >>>>>>> For example, if a swapin fault observes that contiguous PTEs are
> > >>>>>>> pointing to contiguous swap entries and tries to swap them in as a large
> > >>>>>>> folio, swap_read_folio() will pass in a large folio to zswap_load(), but
> > >>>>>>> zswap_load() will only effectively load the first page in the folio. If
> > >>>>>>> the first page is not in zswap, the folio will be read from disk, even
> > >>>>>>> though other pages may be in zswap.
> > >>>>>>>
> > >>>>>>> In both cases, this will lead to silent data corruption.
> > >>>>>>>
> > >>>>>>> Proper large folio swapin support needs to go into zswap before zswap
> > >>>>>>> can be enabled in a system that supports large folio swapin.
> > >>>>>>>
> > >>>>>>> Looking at callers of swap_read_folio(), it seems like they are either
> > >>>>>>> allocated from __read_swap_cache_async() or do_swap_page() in the
> > >>>>>>> SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we
> > >>>>>>> are fine for now.
> > >>>>>>>
> > >>>>>>> Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in
> > >>>>>>> the order of those allocations without proper handling of zswap.
> > >>>>>>>
> > >>>>>>> Alternatively, swap_read_folio() (or its callers) can be updated to have
> > >>>>>>> a fallback mechanism that splits large folios or reads subpages
> > >>>>>>> separately. Similar logic may be needed anyway in case part of a large
> > >>>>>>> folio is already in the swapcache and the rest of it is swapped out.
> > >>>>>>>
> > >>>>>>> Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> > >>>>>>> ---
> > >>>>>>>
> > >>>>>>> Sorry for the long CC list, I just found myself repeatedly looking at
> > >>>>>>> new series that add swap support for mTHPs / large folios, making sure
> > >>>>>>> they do not break with zswap or make incorrect assumptions. This debug
> > >>>>>>> check should give us some peace of mind. Hopefully this patch will also
> > >>>>>>> raise awareness among people who are working on this.
> > >>>>>>>
> > >>>>>>> ---
> > >>>>>>> mm/zswap.c | 3 +++
> > >>>>>>> 1 file changed, 3 insertions(+)
> > >>>>>>>
> > >>>>>>> diff --git a/mm/zswap.c b/mm/zswap.c
> > >>>>>>> index b9b35ef86d9be..6007252429bb2 100644
> > >>>>>>> --- a/mm/zswap.c
> > >>>>>>> +++ b/mm/zswap.c
> > >>>>>>> @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio)
> > >>>>>>> if (!entry)
> > >>>>>>> return false;
> > >>>>>>>
> > >>>>>>> + /* Zswap loads do not handle large folio swapins correctly yet */
> > >>>>>>> + VM_BUG_ON(folio_test_large(folio));
> > >>>>>>> +
> > >>>>>>
> > >>>>>> There is no way we could have a WARN_ON_ONCE() and recover, right?
> > >>>>>
> > >>>>> Not without making more fundamental changes to the surrounding swap
> > >>>>> code. Currently zswap_load() returns either true (folio was loaded
> > >>>>> from zswap) or false (folio is not in zswap).
> > >>>>>
> > >>>>> To handle this correctly zswap_load() would need to tell
> > >>>>> swap_read_folio() which subpages are in zswap and have been loaded,
> > >>>>> and then swap_read_folio() would need to read the remaining subpages
> > >>>>> from disk. This of course assumes that the caller of swap_read_folio()
> > >>>>> made sure that the entire folio is swapped out and protected against
> > >>>>> races with other swapins.
> > >>>>>
> > >>>>> Also, because swap_read_folio() cannot split the folio itself, other
> > >>>>> swap_read_folio_*() functions that are called from it should be
> > >>>>> updated to handle swapping in tail subpages, which may be questionable
> > >>>>> in its own right.
> > >>>>>
> > >>>>> An alternative would be that zswap_load() (or a separate interface)
> > >>>>> could tell swap_read_folio() that the folio is partially in zswap,
> > >>>>> then we can just bail and tell the caller that it cannot read the
> > >>>>> large folio and that it should be split.
> > >>>>>
> > >>>>> There may be other options as well, but the bottom line is that it is
> > >>>>> possible, but probably not something that we want to do right now.
> > >>>>>
> > >>>>> A stronger protection method would be to introduce a config option or
> > >>>>> boot parameter for large folio swapin, and then make CONFIG_ZSWAP
> > >>>>> depend on it being disabled, or have zswap check it at boot and refuse
> > >>>>> to be enabled if it is on.
> > >>>>
> > >>>> Right, sounds like the VM_BUG_ON() really is not that easily avoidable.
> > >>>>
> > >>>> I was wondering, if we could WARN_ON_ONCE and make the swap code detect
> > >>>> this like a read-error from disk.
> > >>>>
> > >>>> I think do_swap_page() detects that by checking if the folio is not
> > >>>> uptodate:
> > >>>>
> > >>>> if (unlikely(!folio_test_uptodate(folio))) {
> > >>>> ret = VM_FAULT_SIGBUS;
> > >>>> goto out_nomap;
> > >>>> }
> > >>>>
> > >>>> So maybe WARN_ON_ONCE() + triggering that might be a bit nicer to the
> > >>>> system (but the app would crash either way, there is no way around it).
> > >>>>
> > >>>
> > >>> I'd rather fallback to small folios swapin instead crashing apps till we fix
> > >>> the large folio swapin in zswap :-)
> > >>
> > >> I think David is referring to catching the buggy cases that do not
> > >> properly fallback to small folios with zswap, not as an alternative to
> > >> the fallback. This is at least what I had in mind with the patch.
> > >
> > > Cool. Thanks for the clarification. I'm fine with keeping the fallback,
> > > whether it's the current VM_BUG_ON or David's recommended
> > > SIGBUS.
> > >
> > > Currently, mainline doesn't support large folios swap-in, so I see
> > > your patch as a helpful warning for those attempting large folio
> > > swap-in to avoid making mistakes like loading large folios from
> > > zswap.
> > >
> > > In fact, I spent a week trying to figure out why my app was crashing
> > > before I added 'if (zswap_is_enabled()) goto fallback'. If your patch
> > > had come earlier, it would have saved me that week of work :-)
> > >
> > > To me, a direct crash seems like a more straightforward way to
> > > prompt people to fallback when attempting large folio swap-ins.
> > > So, I am slightly in favor of your current patch.
> >
> > BUG_ON and friends are frowned-upon, in particular in scenarios where we
> > can recover or that are so unexpected that they can be found during
> > early testing.
> >
> > I have no strong opinion on this one, but likely a VM_WARN_ON would also
> > be sufficient to find such issues early during testing. No need to crash
> > the machine.
>
> I thought VM_BUG_ON() was less frowned-upon than BUG_ON(), but after
> some digging I found your patches to checkpatch and Linus clearly
> stating that it isn't.
>
> How about something like the following (untested), it is the minimal
> recovery we can do but should work for a lot of cases, and does
> nothing beyond a warning if we can swapin the large folio from disk:
>
> diff --git a/mm/page_io.c b/mm/page_io.c
> index f1a9cfab6e748..8f441dd8e109f 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct
> swap_iocb **plug)
> delayacct_swapin_start();
>
> if (zswap_load(folio)) {
> - folio_mark_uptodate(folio);
> folio_unlock(folio);
> } else if (data_race(sis->flags & SWP_FS_OPS)) {
> swap_read_folio_fs(folio, plug);
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 6007252429bb2..cc04db6bb217e 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1557,6 +1557,22 @@ bool zswap_load(struct folio *folio)
>
> VM_WARN_ON_ONCE(!folio_test_locked(folio));
>
> + /*
> + * Large folios should not be swapped in while zswap is being used, as
> + * they are not properly handled.
> + *
> + * If any of the subpages are in zswap, reading from disk would result
> + * in data corruption, so return true without marking the folio uptodate
> + * so that an IO error is emitted (e.g. do_swap_page() will sigfault).
> + *
> + * Otherwise, return false and read the folio from disk.
> + */
> + if (WARN_ON_ONCE(folio_test_large(folio))) {
> + if (xa_find(tree, &offset, offset +
> folio_nr_pages(folio) - 1, 0))
> + return true;
> + return false;
> + }
> +
> /*
> * When reading into the swapcache, invalidate our entry. The
> * swapcache can be the authoritative owner of the page and
> @@ -1593,7 +1609,7 @@ bool zswap_load(struct folio *folio)
> zswap_entry_free(entry);
> folio_mark_dirty(folio);
> }
> -
> + folio_mark_uptodate(folio);
> return true;
> }
>
> One problem is that even if zswap was never enabled, the warning will
> be emitted just if CONFIG_ZSWAP is on. Perhaps we need a variable or
> static key if zswap was "ever" enabled.
>
> Barry, I suspect your is_zswap_enabled() check is deficient for
> similar reasons, zswap could have been enabled before then became
> disabled.
I don't understand this. if zswap was enabled before but is disabled when
I am loading data, will I get corrupted data before zswap was once enabled?
If not, it seems nothing important.
Thanks
Barry