Re: [PATCH v4 1/2] mm: store zero pages to be swapped out in a bitmap

From: Usama Arif
Date: Thu Jun 13 2024 - 07:37:57 EST



On 12/06/2024 21:13, Yosry Ahmed wrote:
On Wed, Jun 12, 2024 at 01:43:35PM +0100, Usama Arif wrote:
[..]

Hi Usama,

A few more comments/questions, sorry for not looking closely earlier.
No worries, Thanks for the reviews!
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f1e559e216bd..48d8dca0b94b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -453,6 +453,8 @@ static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
static void swap_cluster_schedule_discard(struct swap_info_struct *si,
unsigned int idx)
{
+ unsigned int i;
+
/*
* If scan_swap_map_slots() can't find a free cluster, it will check
* si->swap_map directly. To make sure the discarding cluster isn't
@@ -461,6 +463,13 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
*/
memset(si->swap_map + idx * SWAPFILE_CLUSTER,
SWAP_MAP_BAD, SWAPFILE_CLUSTER);
+ /*
+ * zeromap can see updates from concurrent swap_writepage() and swap_read_folio()
+ * call on other slots, hence use atomic clear_bit for zeromap instead of the
+ * non-atomic bitmap_clear.
+ */
I don't think this is accurate. swap_read_folio() does not update the
zeromap. I think the need for an atomic operation here is because we may
be updating adjacent bits simulatenously, so we may cause lost updates
otherwise (i.e. corrupting adjacent bits).


Thanks, will change to "Use atomic clear_bit instead of non-atomic bitmap_clear to prevent adjacent bits corruption due to simultaneous writes." in the next revision


+ for (i = 0; i < SWAPFILE_CLUSTER; i++)
+ clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap);
Could you explain why we need to clear the zeromap here?

swap_cluster_schedule_discard() is called from:
- swap_free_cluster() -> free_cluster()

This is already covered below.

- swap_entry_free() -> dec_cluster_info_page() -> free_cluster()

Each entry in the cluster should have its zeromap bit cleared in
swap_entry_free() before the entire cluster is free and we call
free_cluster().

Am I missing something?

Yes, it looks like this one is not needed as swap_entry_free and swap_free_cluster would already have cleared the bit. Will remove it.

I had initially started checking what codepaths zeromap would need to be cleared. But then thought I could do it wherever si->swap_map is cleared or set to SWAP_MAP_BAD, which is why I added it here.

cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
@@ -482,7 +491,7 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
static void swap_do_scheduled_discard(struct swap_info_struct *si)
{
struct swap_cluster_info *info, *ci;
- unsigned int idx;
+ unsigned int idx, i;
info = si->cluster_info;
@@ -498,6 +507,8 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
__free_cluster(si, idx);
memset(si->swap_map + idx * SWAPFILE_CLUSTER,
0, SWAPFILE_CLUSTER);
+ for (i = 0; i < SWAPFILE_CLUSTER; i++)
+ clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap);
Same here. I didn't look into the specific code paths, but shouldn't the
cluster be unused (and hence its zeromap bits already cleared?).

I think this one is needed (or atleast very good to have). There are 2 paths:

1) swap_cluster_schedule_discard (clears zeromap) -> swap_discard_work -> swap_do_scheduled_discard (clears zeromap)

Path 1 doesnt need it as swap_cluster_schedule_discard already clears it.

2) scan_swap_map_slots -> scan_swap_map_try_ssd_cluster -> swap_do_scheduled_discard (clears zeromap)

Path 2 might need it as zeromap isnt cleared earlier I believe (eventhough I think it might already be 0).

Even if its cleared in path 2, I think its good to keep this one, as the function is swap_do_scheduled_discard, i.e. incase it gets directly called or si->discard_work gets scheduled anywhere else in the future, it should do as the function name suggests, i.e. swap discard(clear zeromap).

unlock_cluster(ci);
}
}
@@ -1059,9 +1070,12 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
{
unsigned long offset = idx * SWAPFILE_CLUSTER;
struct swap_cluster_info *ci;
+ unsigned int i;
ci = lock_cluster(si, offset);
memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
+ for (i = 0; i < SWAPFILE_CLUSTER; i++)
+ clear_bit(offset + i, si->zeromap);
cluster_set_count_flag(ci, 0, 0);
free_cluster(si, idx);
unlock_cluster(ci);
@@ -1336,6 +1350,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
count = p->swap_map[offset];
VM_BUG_ON(count != SWAP_HAS_CACHE);
p->swap_map[offset] = 0;
+ clear_bit(offset, p->zeromap);
I think instead of clearing the zeromap in swap_free_cluster() and here
separately, we can just do it in swap_range_free(). I suspect this may
be the only place we really need to clear the zero in the swapfile code.

Sure, we could move it to swap_range_free, but then also move the clearing of swap_map.

When it comes to clearing zeromap, I think its just generally a good idea to clear it wherever swap_map is cleared.

So the diff over v4 looks like below (should address all comments but not remove it from swap_do_scheduled_discard, and move si->swap_map/zeromap clearing from swap_free_cluster/swap_entry_free to swap_range_free):



diff --git a/mm/swapfile.c b/mm/swapfile.c
index 48d8dca0b94b..39cad0d09525 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -463,13 +463,6 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
         */
        memset(si->swap_map + idx * SWAPFILE_CLUSTER,
                        SWAP_MAP_BAD, SWAPFILE_CLUSTER);
-       /*
-        * zeromap can see updates from concurrent swap_writepage() and swap_read_folio()
-        * call on other slots, hence use atomic clear_bit for zeromap instead of the
-        * non-atomic bitmap_clear.
-        */
-       for (i = 0; i < SWAPFILE_CLUSTER; i++)
-               clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap);

        cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);

@@ -758,6 +751,15 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
        unsigned long begin = offset;
        unsigned long end = offset + nr_entries - 1;
        void (*swap_slot_free_notify)(struct block_device *, unsigned long);
+       unsigned int i;
+
+       memset(si->swap_map + offset, 0, nr_entries);
+       /*
+        * Use atomic clear_bit operations only on zeromap instead of non-atomic
+        * bitmap_clear to prevent adjacent bits corruption due to simultaneous writes.
+        */
+       for (i = 0; i < nr_entries; i++)
+               clear_bit(offset + i, si->zeromap);

        if (offset < si->lowest_bit)
                si->lowest_bit = offset;
@@ -1070,12 +1072,8 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
 {
        unsigned long offset = idx * SWAPFILE_CLUSTER;
        struct swap_cluster_info *ci;
-       unsigned int i;

        ci = lock_cluster(si, offset);
-       memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
-       for (i = 0; i < SWAPFILE_CLUSTER; i++)
-               clear_bit(offset + i, si->zeromap);
        cluster_set_count_flag(ci, 0, 0);
        free_cluster(si, idx);
        unlock_cluster(ci);
@@ -1349,8 +1347,6 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
        ci = lock_cluster(p, offset);
        count = p->swap_map[offset];
        VM_BUG_ON(count != SWAP_HAS_CACHE);
-       p->swap_map[offset] = 0;
-       clear_bit(offset, p->zeromap);
        dec_cluster_info_page(p, p->cluster_info, offset);
        unlock_cluster(ci);