Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting

From: Barry Song
Date: Tue Feb 27 2024 - 21:48:05 EST


On Wed, Feb 28, 2024 at 2:37 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>
> On 05/02/2024 09:51, Barry Song wrote:
> > +Chris, Suren and Chuanhua
> >
> > Hi Ryan,
> >
> >> + /*
> >> + * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
> >> + * so indicate that we are scanning to synchronise with swapoff.
> >> + */
> >> + si->flags += SWP_SCANNING;
> >> + ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
> >> + si->flags -= SWP_SCANNING;
> >
> > nobody is using this scan_base afterwards. it seems a bit weird to
> > pass a pointer.
> >
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >> if (!can_split_folio(folio, NULL))
> >> goto activate_locked;
> >> /*
> >> - * Split folios without a PMD map right
> >> - * away. Chances are some or all of the
> >> - * tail pages can be freed without IO.
> >> + * Split PMD-mappable folios without a
> >> + * PMD map right away. Chances are some
> >> + * or all of the tail pages can be freed
> >> + * without IO.
> >> */
> >> - if (!folio_entire_mapcount(folio) &&
> >> + if (folio_test_pmd_mappable(folio) &&
> >> + !folio_entire_mapcount(folio) &&
> >> split_folio_to_list(folio,
> >> folio_list))
> >> goto activate_locked;
> >> --
> >
> > Chuanhua and I ran this patchset for a couple of days and found a race
> > between reclamation and split_folio. this might cause applications get
> > wrong data 0 while swapping-in.
> >
> > in case one thread(T1) is reclaiming a large folio by some means, still
> > another thread is calling madvise MADV_PGOUT(T2). and at the same time,
> > we have two threads T3 and T4 to swap-in in parallel. T1 doesn't split
> > and T2 does split as below,
>

Hi Ryan,

> Hi Barry,
>
> Do you have a test case you can share that provokes this problem? And is this a
> separate problem to the race you solved with TTU_SYNC or is this solving the
> same problem?

They are the same.

After sending you the report about the races, I spent some time and
finally figured
out what was happening, why corrupted data came while swapping in. it
is absolutely
not your fault, but TTU_SYNC does somehow resolve my problem though it is not
the root cause. this corrupted data only can reproduce after applying
patch 4[1] of
swap-in series,
[1] [PATCH RFC 4/6] mm: support large folios swapin as a whole
https://lore.kernel.org/linux-mm/20240118111036.72641-5-21cnbao@xxxxxxxxx/

In case we have a large folio with 16 PTEs as below, and after
add_to_swap(), they get
swapoffset 0x10000, their PTEs are all present as they are still mapped.
PTE pte_stat
PTE0 present
PTE1 present
PTE2 present
PTE3 present
..
PTE15 present

then we get to try_to_unmap_one, as try_to_unmap_one doesn't hold PTL
from PTE0, while it scans PTEs, we might have
PTE pte_stat
PTE0 none (someone is writing PTE0 for various reasons)
PTE1 present
PTE2 present
PTE3 present
..
PTE15 present

We hold PTL from PTE1.

after try_to_unmap_one, PTEs become

PTE pte_stat
PTE0 present (someone finished the write of PTE0)
PTE1 swap 0x10001
PTE2 swap 0x10002
PTE3 swap 0x10003
..
..
PTE15 swap 0x1000F

Thus, after try_to_unmap_one, the large folio is still mapped. so its swapcache
will still be there.

Now a parallel thread runs MADV_PAGEOUT, and it finds this large folio
is not completely mapped, so it splits the folio into 16 small folios but their
swap offsets are kept.

Now in swapcache, we have 16 folios with contiguous swap offsets.
MADV_PAGEOUT will reclaim these 16 folios, after new 16 try_to_unmap_one,

PTE pte_stat
PTE0 swap 0x10000 SWAP_HAS_CACHE
PTE1 swap 0x10001 SWAP_HAS_CACHE
PTE2 swap 0x10002 SWAP_HAS_CACHE
PTE3 swap 0x10003 SWAP_HAS_CACHE
..
PTE15 swap 0x1000F SWAP_HAS_CACHE

>From this time, we can have various different cases for these 16 PTEs.
for example,

PTE pte_stat
PTE0 swap 0x10000 SWAP_HAS_CACHE = 0 -> become false due to
finished pageout and remove_mapping
PTE1 swap 0x10001 SWAP_HAS_CACHE = 0 -> become false due to
finished pageout and remove_mapping
PTE2 swap 0x10002 SWAP_HAS_CACHE = 0 -> become false due to
concurrent swapin and swapout
PTE3 swap 0x10003 SWAP_HAS_CACHE = 1
..
PTE13 swap 0x1000D SWAP_HAS_CACHE = 1
PTE14 swap 0x1000E SWAP_HAS_CACHE = 1
PTE15 swap 0x1000F SWAP_HAS_CACHE = 1

but all of them have swp_count = 1 and different SWAP_HAS_CACHE. some of these
small folios might be in swapcache, some others might not be in.

then we do_swap_page at one PTE whose SWAP_HAS_CACHE=0 and
swap_count=1 (the folio is not in swapcache, thus has been written to swap),
we do this check:

static bool pte_range_swap(pte_t *pte, int nr_pages)
{
int i;
swp_entry_t entry;
unsigned type;
pgoff_t start_offset;

entry = pte_to_swp_entry(ptep_get_lockless(pte));
if (non_swap_entry(entry))
return false;
start_offset = swp_offset(entry);
if (start_offset % nr_pages)
return false;

type = swp_type(entry);
for (i = 1; i < nr_pages; i++) {
entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
if (non_swap_entry(entry))
return false;
if (swp_offset(entry) != start_offset + i)
return false;
if (swp_type(entry) != type)
return false;
}

return true;
}

as those swp entries are contiguous, we will call swap_read_folio().
For those folios which are still in swapcache and haven't been written,
we get zero-filled data from zRAM.

So the root cause is that pte_range_swap should check
all 16 swap_map have the same SWAP_HAS_CACHE as
false.

static bool is_pte_range_contig_swap(pte_t *pte, int nr_pages)
{
...
count = si->swap_map[start_offset];
for (i = 1; i < nr_pages; i++) {
entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
if (non_swap_entry(entry))
return false;
if (swp_offset(entry) != start_offset + i)
return false;
if (swp_type(entry) != type)
return false;
/* fallback to small folios if SWAP_HAS_CACHE isn't same */
if (si->swap_map[start_offset + i] != count)
return false;
}

return true;
}

but somehow TTU_SYNC "resolves" it by giving no chance to
MADV_PAGEOUT to split this folio as the large folio are either
entirely written by swap entries, or entirely keep present PTEs.

Though the bug is within the swap-in series, I am still a big fan of
TTU_SYNC for large folio reclamation for at least three reasons,

1. We remove some possibility that large folios fail to be reclaimed, improving
reclamation efficiency.

2. We avoid many strange cases and potential folio_split during reclamation.
without TTU_SYNC, folios can be splitted later, or partially being set to swap
entries while partially being still present

3. we don't increase PTL contention. My test shows try_to_unmap_one
will always get PTL after it sometimes skips one or two PTEs because
intermediate break-before-makes are short. Of course, most time try_to_unmap_one
will get PTL from PTE0.

>
> Thanks,
> Ryan
>

Thanks
Barry