Re: [RFC PATCH 0/2] remove SWAP_MAP_SHMEM

From: Baolin Wang
Date: Tue Sep 24 2024 - 21:54:05 EST




On 2024/9/24 23:48, Nhat Pham wrote:
On Tue, Sep 24, 2024 at 8:08 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:

On Tue, Sep 24, 2024 at 7:32 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote:

On Mon, Sep 23, 2024 at 8:25 PM Baolin Wang
<baolin.wang@xxxxxxxxxxxxxxxxx> wrote:


On 2024/9/24 10:15, Yosry Ahmed wrote:
On Mon, Sep 23, 2024 at 6:55 PM Baolin Wang
<baolin.wang@xxxxxxxxxxxxxxxxx> wrote:



On 2024/9/24 07:11, Nhat Pham wrote:
The SWAP_MAP_SHMEM state was originally introduced in the commit
aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a
swap entry belongs to shmem during swapoff.

However, swapoff has since been rewritten drastically in the commit
b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now
having swap count == SWAP_MAP_SHMEM value is basically the same as having
swap count == 1, and swap_shmem_alloc() behaves analogously to
swap_duplicate()

This RFC proposes the removal of this state and the associated helper to
simplify the state machine (both mentally and code-wise). We will also
have an extra state/special value that can be repurposed (for swap entries
that never gets re-duplicated).

Another motivation (albeit a bit premature at the moment) is the new swap
abstraction I am currently working on, that would allow for swap/zswap
decoupling, swapoff optimization, etc. The fewer states and swap API
functions there are, the simpler the conversion will be.

I am sending this series first as an RFC, just in case I missed something
or misunderstood this state, or if someone has a swap optimization in mind
for shmem that would require this special state.

The idea makes sense to me. I did a quick test with shmem mTHP, and
encountered the following warning which is triggered by
'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate().

Apparently __swap_duplicate() does not currently handle increasing the
swap count for multiple swap entries by 1 (i.e. usage == 1) because it
does not handle rolling back count increases when
swap_count_continued() fails.

I guess this voids my Reviewed-by until we sort this out. Technically
swap_count_continued() won't ever be called for shmem because we only
ever increment the count by 1, but there is no way to know this in
__swap_duplicate() without SWAP_HAS_SHMEM.

Ah this is my bad. I compiled with CONFIG_THP_SWAP, but forgot to
remove the swapfile check (that's another can of worms, but I need
data before submitting the patch to remove it...)

One thing we can do is instead of warning here, we can handle it in
the for loop check, where we have access to count - that's the point
of having that for-loop check anyway? :)

There's a couple of ways to go about it:

1. VM_WARN_ON(usage == 1 && nr > 1 && count != 0 );

Hmm that should work, although it's a bit complicated tbh.

(or more accurately, (count & ~COUNT_CONTINUED) >= SWAP_MAP_MAX))

I think this will make the warning very hard to hit if there's a
misuse of __swap_duplicate(). It will only be hit when an entry needs
count continuation, which I am not sure is very common. If there's a
bug, the warning will potentially catch it too late, if ever.

The side effect here is failing to decrement the swap count of some
swap entries which will lead to them never being freed, essentially
leaking swap capacity slowly over time. I am not sure if there are
more detrimental effects.


2. Alternatively, instead of warning here, we can simply return
-ENOMEM. Then, at shmem callsite, have a VM_WARN_ON/VM_BUG_ON(), since
this MUST succeed.

We still fail to rollback incremented counts though when we return
-ENOMEM, right? Maybe I didn't get what you mean.

My understanding now is that there are two for loops. One for loop
that checks the entry's states, and one for loop that does the actual
incrementing work (or state modification).

We can check in the first for loop, if it is safe to proceed:

if (!count && !has_cache) {
err = -ENOENT;
} else if (usage == SWAP_HAS_CACHE) {
if (has_cache)
err = -EEXIST;
} else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) {
err = -EINVAL;
} else if (usage == 1 && nr > 1 && (count & ~COUNT_CONTINUED) >=
SWAP_MAP_MAX)) {
/* the batched variants currently do not support rollback */
err = -ENOMEM;
}

At this point, IIUC, we have not done any incrementing, so no rollback
needed? :)

Right, looks good (although need some cleanup pointed by Yosry).


Either solutions should follow with careful documentation to make it
clear the expectation/guarantee of the new API.

Yosry, Baolin, how do you two feel about this? Would something like
this work? I need to test it first, but let me know if I'm missing
something.

If this does not work, we can do what Baolin is suggesting, and
perhaps maintain the swap_shmem_alloc() helper. It's less than ideal,
but at least we still lose a state...

Depending on the complexity tbh, right now removing SWAP_MAP_SHMEM is
just a cleanup with small wins, so if it's too complicated to remove
it it may not be worth it. I am assuming with your ongoing work, it
becomes much more valuable, so maybe if it's too complicated we can
defer it until the benefits are realizable?

I agree :)

One benefit I can mention is that removing 'SWAP_MAP_SHMEM' can help to batch free shmem swap entries in __swap_entries_free(), similar to the commit bea67dcc5eea ("mm: attempt to batch free swap entries for zap_pte_range()") did, which can improve the performance of shmem mTHP munmap() function based on my testing.

Without this patch set, I need do following changes to batch free shmem swap entries:
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0cded32414a1..94e28cd60c52 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -163,7 +163,7 @@ static bool swap_is_last_map(struct swap_info_struct *si,
unsigned char *map_end = map + nr_pages;
unsigned char count = *map;

- if (swap_count(count) != 1)
+ if (swap_count(count) != 1 && swap_count(count) != SWAP_MAP_SHMEM)
return false;

while (++map < map_end) {
@@ -1503,10 +1503,10 @@ static bool __swap_entries_free(struct swap_info_struct *si,
unsigned int type = swp_type(entry);
struct swap_cluster_info *ci;
bool has_cache = false;
- unsigned char count;
+ unsigned char count = swap_count(data_race(si->swap_map[offset]));
int i;

- if (nr <= 1 || swap_count(data_race(si->swap_map[offset])) != 1)
+ if (nr <= 1 || (count != 1 && count != SWAP_MAP_SHMEM))
goto fallback;
/* cross into another cluster */
if (nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER)