Re: [PATCH v2 1/2] mm: clear pte for folios that are zero filled

From: David Hildenbrand
Date: Fri Jun 07 2024 - 07:16:18 EST


On 07.06.24 12:24, Usama Arif wrote:

On 04/06/2024 13:43, David Hildenbrand wrote:
On 04.06.24 14:30, David Hildenbrand wrote:
On 04.06.24 12:58, Usama Arif wrote:
Approximately 10-20% of pages to be swapped out are zero pages [1].
Rather than reading/writing these pages to flash resulting
in increased I/O and flash wear, the pte can be cleared for those
addresses at unmap time while shrinking folio list. When this
causes a page fault, do_pte_missing will take care of this page.
With this patch, NVMe writes in Meta server fleet decreased
by almost 10% with conventional swap setup (zswap disabled).

[1]
https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/

Signed-off-by: Usama Arif <usamaarif642@xxxxxxxxx>
---
   include/linux/rmap.h |   1 +
   mm/rmap.c            | 163
++++++++++++++++++++++---------------------
   mm/vmscan.c          |  89 ++++++++++++++++-------
   3 files changed, 150 insertions(+), 103 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index bb53e5920b88..b36db1e886e4 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -100,6 +100,7 @@ enum ttu_flags {
                        * do a final flush if necessary */
       TTU_RMAP_LOCKED        = 0x80,    /* do not grab rmap lock:
                        * caller holds it */
+    TTU_ZERO_FOLIO        = 0x100,/* zero folio */
   };
      #ifdef CONFIG_MMU
diff --git a/mm/rmap.c b/mm/rmap.c
index 52357d79917c..d98f70876327 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1819,96 +1819,101 @@ static bool try_to_unmap_one(struct folio
*folio, struct vm_area_struct *vma,
                */
               dec_mm_counter(mm, mm_counter(folio));
           } else if (folio_test_anon(folio)) {
-            swp_entry_t entry = page_swap_entry(subpage);
-            pte_t swp_pte;
-            /*
-             * Store the swap location in the pte.
-             * See handle_pte_fault() ...
-             */
-            if (unlikely(folio_test_swapbacked(folio) !=
-                    folio_test_swapcache(folio))) {
+            if (flags & TTU_ZERO_FOLIO) {
+                pte_clear(mm, address, pvmw.pte);
+                dec_mm_counter(mm, MM_ANONPAGES);

Is there an easy way to reduce the code churn and highlight the added
code?

Like

} else if (folio_test_anon(folio) && (flags & TTU_ZERO_FOLIO)) {

} else if (folio_test_anon(folio)) {



Also to concerns that I want to spell out:

(a) what stops the page from getting modified in the meantime? The CPU
      can write it until the TLB was flushed.

Thanks for pointing this out David and Shakeel. This is a big issue in
this v2, and as Shakeel pointed out in [1] we need to do a second rmap
walk. Looking at how ksm deals with this in try_to_merge_one_page which
calls write_protect_page for each vma (i.e. basically an rmap walk),
this would be much more CPU expensive and complicated compared to v1
[2], where the swap subsystem can handle all complexities. I will go
back to my v1 solution for the next revision as its much more simpler
and the memory usage is very low (0.003%) as pointed out by Johannes [3]
which would likely go away with the memory savings of not having a
zswap_entry for zero filled pages, and the solution being a lot simpler
than what a valid v2 approach would look like.

Agreed.

--
Cheers,

David / dhildenb