Re: [RFC PATCH 12/12] selftests/mm: khugepaged: Enlighten for mTHP collapse

From: Dev Jain
Date: Mon Dec 30 2024 - 02:09:45 EST



On 20/12/24 4:35 pm, Ryan Roberts wrote:
On 18/12/2024 09:50, Dev Jain wrote:
On 18/12/24 2:33 pm, Ryan Roberts wrote:
On 16/12/2024 16:51, Dev Jain wrote:
One of the testcases triggers a CoW on the 255th page (0-indexing) with
max_ptes_shared = 256. This leads to 0-254 pages (255 in number) being unshared,
and 257 pages shared, exceeding the constraint. Suppose we run the test as
./khugepaged -s 2. Therefore, khugepaged starts collapsing the range to order-2
folios, since PMD-collapse will fail due to the constraint.
When the scan reaches 254-257 PTE range, because at least one PTE in this range
is writable, with other 3 being read-only, khugepaged collapses this into an
order-2 mTHP, resulting in 3 extra PTEs getting unshared. After this, we
encounter
a 4-sized chunk of read-only PTEs, and mTHP collapse stops according to the
scaled
constraint, but the number of shared PTEs have now come under the constraint for
PMD-sized THPs. Therefore, the next scan of khugepaged will be able to collapse
this range into a PMD-mapped hugepage, leading to failure of this subtest. Fix
this by reducing the CoW range.
Is this description essentially saying that it's now possible to creep towards
collapsing to a full PMD-size block over successive scans due to rounding errors
in the scaling? Or is this just trying an edge case and the problem doesn't
generalize?
For this case, max_ptes_shared for order-2 is 256 >> (9 - 2) = 2, without
rounding errors. We cannot
really get a rounding problem because we are rounding down, essentially either
keep the restriction
same, or making it stricter, as we go down the orders.

But thinking again, this behaviour may generalize: essentially, let us say that
the distribution
of none ptes vs filled ptes is very skewed for the PMD case. In a local region,
this distribution
may not be skewed, and then an mTHP collapse will occur, making this entire
region uniform. Over time this
may keep happening and then the region will become globally uniform to come
under the PMD-constraint
on max_ptes_none, and eventually PMD-collapse will occur. Which may beg the
question whether we want
to detach khugepaged orders from mTHP sysfs settings.
We want to avoid new user controls at all costs, I think.

I think an example of the problem you are describing is: Let's say we start off
with all mTHP orders enabled and max_ptes_none is 50% (so 256). We have a 2M VMA
aligned over a single PMD. The first 2 4K pages of the VMA are allocated.

khugepaged will scan this VMA and decide to collapse the first 4 PTEs to a
single order-2 (16K) folio; that's allowed because 50% of the PTEs were none.
But now on the next scan, 50% of the first 8 PTEs are none so it will collapse
to 32K. Then on the next scan it will collapse to 64K, and so on all the way to
2M. So by faulting in 2 pages originally we have now collapsed to 2M dispite the
control trying to prevent it, and we have done it in a very inefficient way.

If max_ptes_none was 75% you would only need every other order enabled (I think?).

In practice perhaps it's not a problem because you are only likely to have 1 or
2 mTHP sizes enabled. But I wonder if we need to think about how to protect from
this "creep"?

Perhaps only consider a large folio for collapse into a larger folio if it
wasn't originally collapsed by khugepaged in the first place? That would need a
folio flag... and I suspect that will cause other edge case issues if we think
about it for 5 mins...

Another way of thinking about it is; if all the same mTHP orders were enabled at
fault time (and the allocation suceeded) we would have allocated the largest
order anyway, so the end states are the same. But the large number of
incremental collapses that khugepaged will perform feel like a problem.

I'm not quite sure what the answer is.

Can't really think of anything else apart from decoupling khugepaged sysfs from mTHP sysfs...


Thanks,
Ryan


Note: The only objective of this patch is to make the test work for the PMD-
case;
no extension has been made for testing for mTHPs.

Signed-off-by: Dev Jain <dev.jain@xxxxxxx>
---
  tools/testing/selftests/mm/khugepaged.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/
selftests/mm/khugepaged.c
index 8a4d34cce36b..143c4ad9f6a1 100644
--- a/tools/testing/selftests/mm/khugepaged.c
+++ b/tools/testing/selftests/mm/khugepaged.c
@@ -981,6 +981,7 @@ static void collapse_fork_compound(struct
collapse_context *c, struct mem_ops *o
  static void collapse_max_ptes_shared(struct collapse_context *c, struct
mem_ops *ops)
  {
      int max_ptes_shared = thp_read_num("khugepaged/max_ptes_shared");
+    int fault_nr_pages = is_anon(ops) ? 1 << anon_order : 1;
      int wstatus;
      void *p;
  @@ -997,8 +998,8 @@ static void collapse_max_ptes_shared(struct
collapse_context *c, struct mem_ops
              fail("Fail");
            printf("Trigger CoW on page %d of %d...",
-                hpage_pmd_nr - max_ptes_shared - 1, hpage_pmd_nr);
-        ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared - 1) * page_size);
+                hpage_pmd_nr - max_ptes_shared - fault_nr_pages, hpage_pmd_nr);
+        ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared - fault_nr_pages) *
page_size);
          if (ops->check_huge(p, 0))
              success("OK");
          else