Re: [PATCH] mm/huge_memory: Avoid PMD-size page cache if needed

From: David Hildenbrand
Date: Sat Jul 13 2024 - 00:17:53 EST


On 13.07.24 06:01, Baolin Wang wrote:


On 2024/7/13 09:03, David Hildenbrand wrote:
On 12.07.24 07:39, Gavin Shan wrote:
On 7/12/24 7:03 AM, David Hildenbrand wrote:
On 11.07.24 22:46, Matthew Wilcox wrote:
On Thu, Jul 11, 2024 at 08:48:40PM +1000, Gavin Shan wrote:
+++ b/mm/huge_memory.c
@@ -136,7 +136,8 @@ unsigned long __thp_vma_allowable_orders(struct
vm_area_struct *vma,
           while (orders) {
               addr = vma->vm_end - (PAGE_SIZE << order);
-            if (thp_vma_suitable_order(vma, addr, order))
+            if (!(vma->vm_file && order > MAX_PAGECACHE_ORDER) &&
+                thp_vma_suitable_order(vma, addr, order))
                   break;

Why does 'orders' even contain potential orders that are larger than
MAX_PAGECACHE_ORDER?

We do this at the top:

          orders &= vma_is_anonymous(vma) ?
                          THP_ORDERS_ALL_ANON : THP_ORDERS_ALL_FILE;

include/linux/huge_mm.h:#define THP_ORDERS_ALL_FILE
(BIT(PMD_ORDER) | BIT(PUD_ORDER))

... and that seems very wrong.  We support all kinds of orders for
files, not just PMD order.  We don't support PUD order at all.

What the hell is going on here?

yes, that's just absolutely confusing. I mentioned it to Ryan lately
that we should clean that up (I wanted to look into that, but am
happy if someone else can help).

There should likely be different defines for

DAX (PMD|PUD)

SHMEM (PMD) -- but soon more. Not sure if we want separate ANON_SHMEM
for the time being. Hm. But shmem is already handles separately, so
maybe we can just ignore shmem here.

PAGECACHE (1 .. MAX_PAGECACHE_ORDER)

? But it's still unclear to me.

At least DAX must stay special I think, and PAGECACHE should be
capped at MAX_PAGECACHE_ORDER.


David, I can help to clean it up. Could you please help to confirm the
following

Thanks!

changes are exactly what you're suggesting? Hopefully, there are
nothing I've missed.
The original issue can be fixed by the changes. With the changes
applied, madvise(MADV_COLLAPSE)
returns with errno -22 in the test program.

The fix tag needs to adjusted either.

Fixes: 3485b88390b0 ("mm: thp: introduce multi-size THP sysfs interface")

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2aa986a5cd1b..45909efb0ef0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -74,7 +74,12 @@ extern struct kobj_attribute shmem_enabled_attr;
   /*
    * Mask of all large folio orders supported for file THP.
    */
-#define THP_ORDERS_ALL_FILE    (BIT(PMD_ORDER) | BIT(PUD_ORDER))

DAX doesn't have any MAX_PAGECACHE_ORDER restrictions (like hugetlb). So
this should be

/*
 * FSDAX never splits folios, so the MAX_PAGECACHE_ORDER limit does not
 * apply here.
 */
THP_ORDERS_ALL_FILE_DAX ((BIT(PMD_ORDER) | BIT(PUD_ORDER))

Something like that

+#define THP_ORDERS_ALL_FILE_DAX                \
+       ((BIT(PMD_ORDER) | BIT(PUD_ORDER)) & (BIT(MAX_PAGECACHE_ORDER
+ 1) - 1))
+#define THP_ORDERS_ALL_FILE_DEFAULT    \
+       ((BIT(MAX_PAGECACHE_ORDER + 1) - 1) & ~BIT(0))
+#define THP_ORDERS_ALL_FILE            \
+       (THP_ORDERS_ALL_FILE_DAX | THP_ORDERS_ALL_FILE_DEFAULT)

Maybe we can get rid of THP_ORDERS_ALL_FILE (to prevent abuse) and fixup
THP_ORDERS_ALL instead.

   /*
    * Mask of all large folio orders supported for THP.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2120f7478e55..4690f33afaa6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -88,9 +88,17 @@ unsigned long __thp_vma_allowable_orders(struct
vm_area_struct *vma,
          bool smaps = tva_flags & TVA_SMAPS;
          bool in_pf = tva_flags & TVA_IN_PF;
          bool enforce_sysfs = tva_flags & TVA_ENFORCE_SYSFS;
+       unsigned long supported_orders;
+
          /* Check the intersection of requested and supported orders. */
-       orders &= vma_is_anonymous(vma) ?
-                       THP_ORDERS_ALL_ANON : THP_ORDERS_ALL_FILE;
+       if (vma_is_anonymous(vma))
+               supported_orders = THP_ORDERS_ALL_ANON;
+       else if (vma_is_dax(vma))
+               supported_orders = THP_ORDERS_ALL_FILE_DAX;
+       else
+               supported_orders = THP_ORDERS_ALL_FILE_DEFAULT;

This is what I had in mind.

But, do we have to special-case shmem as well or will that be handled
correctly?

For anonymous shmem, it is now same as anonymous THP, which can utilize
THP_ORDERS_ALL_ANON.
For tmpfs, we currently only support PMD-sized THP
(will support more larger orders in the future). Therefore, I think we
can reuse THP_ORDERS_ALL_ANON for shmem now:

if (vma_is_anonymous(vma) || shmem_file(vma->vm_file)))
supported_orders = THP_ORDERS_ALL_ANON;
......



It should be THP_ORDERS_ALL_FILE_DEFAULT (MAX_PAGECACHE_ORDER imitation applies).

--
Cheers,

David / dhildenb