Re: [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios

From: David Hildenbrand
Date: Tue Apr 01 2025 - 09:05:03 EST

Next message: Mateusz Guzik: "Re: [PATCH v3] exit: combine work under lock in synchronize_group_exit() and coredump_task_exit()"
Previous message: Andy Shevchenko: "Re: [PATCH 2/2] media: atomisp: Fix Wformat-truncation warning"
In reply to: Ryan Roberts: "Re: [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios"
Next in thread: Baolin Wang: "Re: [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 01.04.25 12:45, Ryan Roberts wrote:

On 30/03/2025 15:57, Baolin Wang wrote:

On 2025/3/27 22:08, Ryan Roberts wrote:

On 25/03/2025 23:38, Baolin Wang wrote:

When I tested the mincore() syscall, I observed that it takes longer with
64K mTHP enabled on my Arm64 server. The reason is the mincore_pte_range()
still checks each PTE individually, even when the PTEs are contiguous,
which is not efficient.

Thus we can use folio_pte_batch() to get the batch number of the present
contiguous PTEs, which can improve the performance. I tested the mincore()
syscall with 1G anonymous memory populated with 64K mTHP, and observed an
obvious performance improvement:

w/o patch        w/ patch        changes
6022us            1115us            +81%

Moreover, I also tested mincore() with disabling mTHP/THP, and did not
see any obvious regression.

Signed-off-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>
---
mm/mincore.c | 27 ++++++++++++++++++++++-----
1 file changed, 22 insertions(+), 5 deletions(-)

diff --git a/mm/mincore.c b/mm/mincore.c
index 832f29f46767..88be180b5550 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -21,6 +21,7 @@
#include <linux/uaccess.h>
#include "swap.h"
+#include "internal.h"
static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long
addr,
              unsigned long end, struct mm_walk *walk)
@@ -105,6 +106,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long
addr, unsigned long end,
      pte_t *ptep;
      unsigned char *vec = walk->private;
      int nr = (end - addr) >> PAGE_SHIFT;
+    int step, i;
      ptl = pmd_trans_huge_lock(pmd, vma);
      if (ptl) {
@@ -118,16 +120,31 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long
addr, unsigned long end,
          walk->action = ACTION_AGAIN;
          return 0;
      }
-    for (; addr != end; ptep++, addr += PAGE_SIZE) {
+    for (; addr != end; ptep += step, addr += step * PAGE_SIZE) {
          pte_t pte = ptep_get(ptep);
+        step = 1;
          /* We need to do cache lookup too for pte markers */
          if (pte_none_mostly(pte))
              __mincore_unmapped_range(addr, addr + PAGE_SIZE,
                           vma, vec);
-        else if (pte_present(pte))
-            *vec = 1;
-        else { /* pte is a swap entry */
+        else if (pte_present(pte)) {
+            if (pte_batch_hint(ptep, pte) > 1) {
+                struct folio *folio = vm_normal_folio(vma, addr, pte);
+
+                if (folio && folio_test_large(folio)) {
+                    const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
+                                FPB_IGNORE_SOFT_DIRTY;
+                    int max_nr = (end - addr) / PAGE_SIZE;
+
+                    step = folio_pte_batch(folio, addr, ptep, pte,
+                            max_nr, fpb_flags, NULL, NULL, NULL);
+                }
+            }

You could simplify to the following, I think, to avoid needing to grab the folio
and call folio_pte_batch():

            else if (pte_present(pte)) {
                int max_nr = (end - addr) / PAGE_SIZE;
                step = min(pte_batch_hint(ptep, pte), max_nr);
            } ...

I expect the regression you are seeing here is all due to calling ptep_get() for
every pte in the contpte batch, which will cause 16 memory reads per pte (to
gather the access/dirty bits). For small folios its just 1 read per pte.

Right.

pte_batch_hint() will skip forward in blocks of 16 so you now end up with the
same number as for the small folio case. You don't need all the fancy extras
that folio_pte_batch() gives you here.

Sounds reasonable. Your suggestion looks simple, but my method can batch the
whole large folio (such as large folios containing more than 16 contiguous PTEs)
at once.

Sure but folio_pte_batch() just implements that with another loop that calls
pte_batch_hint(), so it all amounts to the same thing. In fact there are some
extra checks in folio_pte_batch() that you don't need so it might be a bit slower.

I don't enjoy open-coding the batching, especially if only cont-pte users will benefit from it. But I also don't enjoy the open-coded pte_batch_hint() :)

But we really don't need the folio here, so I assume the short variant you (Ryan) suggest is alright to just avoid the ptep_get().

As Oscar says, these details might soon be hidden inside a new page table walker API (even though it will likely end up using folio_pte_batch() internally, TBD).

--
Cheers,

David / dhildenb

Next message: Mateusz Guzik: "Re: [PATCH v3] exit: combine work under lock in synchronize_group_exit() and coredump_task_exit()"
Previous message: Andy Shevchenko: "Re: [PATCH 2/2] media: atomisp: Fix Wformat-truncation warning"
In reply to: Ryan Roberts: "Re: [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios"
Next in thread: Baolin Wang: "Re: [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]