Re: [PATCH] mm/huge_memory: Avoid PMD-size page cache if needed

From: Gavin Shan
Date: Sat Jul 13 2024 - 05:25:56 EST


On 7/13/24 11:03 AM, David Hildenbrand wrote:
On 12.07.24 07:39, Gavin Shan wrote:

David, I can help to clean it up. Could you please help to confirm the following

Thanks!

changes are exactly what you're suggesting? Hopefully, there are nothing I've missed.
The original issue can be fixed by the changes. With the changes applied, madvise(MADV_COLLAPSE)
returns with errno -22 in the test program.

The fix tag needs to adjusted either.

Fixes: 3485b88390b0 ("mm: thp: introduce multi-size THP sysfs interface")

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2aa986a5cd1b..45909efb0ef0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -74,7 +74,12 @@ extern struct kobj_attribute shmem_enabled_attr;
   /*
    * Mask of all large folio orders supported for file THP.
    */
-#define THP_ORDERS_ALL_FILE    (BIT(PMD_ORDER) | BIT(PUD_ORDER))

DAX doesn't have any MAX_PAGECACHE_ORDER restrictions (like hugetlb). So this should be

/*
 * FSDAX never splits folios, so the MAX_PAGECACHE_ORDER limit does not
 * apply here.
 */
THP_ORDERS_ALL_FILE_DAX ((BIT(PMD_ORDER) | BIT(PUD_ORDER))

Something like that


Ok. It will be corrected in v2.

+#define THP_ORDERS_ALL_FILE_DAX                \
+       ((BIT(PMD_ORDER) | BIT(PUD_ORDER)) & (BIT(MAX_PAGECACHE_ORDER + 1) - 1))
+#define THP_ORDERS_ALL_FILE_DEFAULT    \
+       ((BIT(MAX_PAGECACHE_ORDER + 1) - 1) & ~BIT(0))
+#define THP_ORDERS_ALL_FILE            \
+       (THP_ORDERS_ALL_FILE_DAX | THP_ORDERS_ALL_FILE_DEFAULT)

Maybe we can get rid of THP_ORDERS_ALL_FILE (to prevent abuse) and fixup
THP_ORDERS_ALL instead.


Sure, it will be removed in v2.

   /*
    * Mask of all large folio orders supported for THP.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2120f7478e55..4690f33afaa6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -88,9 +88,17 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
          bool smaps = tva_flags & TVA_SMAPS;
          bool in_pf = tva_flags & TVA_IN_PF;
          bool enforce_sysfs = tva_flags & TVA_ENFORCE_SYSFS;
+       unsigned long supported_orders;
+
          /* Check the intersection of requested and supported orders. */
-       orders &= vma_is_anonymous(vma) ?
-                       THP_ORDERS_ALL_ANON : THP_ORDERS_ALL_FILE;
+       if (vma_is_anonymous(vma))
+               supported_orders = THP_ORDERS_ALL_ANON;
+       else if (vma_is_dax(vma))
+               supported_orders = THP_ORDERS_ALL_FILE_DAX;
+       else
+               supported_orders = THP_ORDERS_ALL_FILE_DEFAULT;

This is what I had in mind.

But, do we have to special-case shmem as well or will that be handled correctly?


With previous fixes and this one, I don't see there is any missed cases
for shmem to have 512MB page cache, exceeding MAX_PAGECACHE_ORDER. Hopefully,
I don't miss anything from the code inspection.

- regular read/write paths: covered by the previous fixes
- synchronous readahead: covered by the previous fixes
- asynchronous readahead: page size granularity, no huge page
- page fault handling: covered by the previous fixes
- collapsing PTEs to PMD: to be covered by this patch
- swapin: shouldn't have 512MB huge page since we don't have such huge pages during swapout period
- other cases I missed (?)

Thanks,
Gavin