Re: [PATCH v8 03/10] mm: thp: Introduce multi-size THP sysfs interface

From: David Hildenbrand
Date: Tue Dec 05 2023 - 04:58:28 EST


On 05.12.23 10:50, Ryan Roberts wrote:
On 05/12/2023 04:21, Barry Song wrote:
On Mon, Dec 4, 2023 at 11:21 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:

In preparation for adding support for anonymous multi-size THP,
introduce new sysfs structure that will be used to control the new
behaviours. A new directory is added under transparent_hugepage for each
supported THP size, and contains an `enabled` file, which can be set to
"inherit" (to inherit the global setting), "always", "madvise" or
"never". For now, the kernel still only supports PMD-sized anonymous
THP, so only 1 directory is populated.

The first half of the change converts transhuge_vma_suitable() and
hugepage_vma_check() so that they take a bitfield of orders for which
the user wants to determine support, and the functions filter out all
the orders that can't be supported, given the current sysfs
configuration and the VMA dimensions. If there is only 1 order set in
the input then the output can continue to be treated like a boolean;
this is the case for most call sites. The resulting functions are
renamed to thp_vma_suitable_orders() and thp_vma_allowable_orders()
respectively.

The second half of the change implements the new sysfs interface. It has
been done so that each supported THP size has a `struct thpsize`, which
describes the relevant metadata and is itself a kobject. This is pretty
minimal for now, but should make it easy to add new per-thpsize files to
the interface if needed in future (e.g. per-size defrag). Rather than
keep the `enabled` state directly in the struct thpsize, I've elected to
directly encode it into huge_anon_orders_[always|madvise|inherit]
bitfields since this reduces the amount of work required in
thp_vma_allowable_orders() which is called for every page fault.

See Documentation/admin-guide/mm/transhuge.rst, as modified by this
commit, for details of how the new sysfs interface works.

Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx>

Reviewed-by: Barry Song <v-songbaohua@xxxxxxxx>

Thanks!


-khugepaged will be automatically started when
-transparent_hugepage/enabled is set to "always" or "madvise, and it'll
-be automatically shutdown if it's set to "never".
+khugepaged will be automatically started when one or more hugepage
+sizes are enabled (either by directly setting "always" or "madvise",
+or by setting "inherit" while the top-level enabled is set to "always"
+or "madvise"), and it'll be automatically shutdown when the last
+hugepage size is disabled (either by directly setting "never", or by
+setting "inherit" while the top-level enabled is set to "never").

Khugepaged controls
-------------------

+.. note::
+ khugepaged currently only searches for opportunities to collapse to
+ PMD-sized THP and no attempt is made to collapse to other THP
+ sizes.

For small-size THP, collapse is probably a bad idea. we like a one-shot
try in Android especially we are using a 64KB and less large folio size. if
PF succeeds in getting large folios, we map large folios, otherwise we
give up as those memories can be quite unstably swapped-out, swapped-in
and madvised to be DONTNEED.

too many compactions will increase power consumption and decrease UI
response.

Understood; that's very useful information for the Android context. Multiple
people have made comments about eventually needing khugepaged (or something
similar) support in the server context though to async collapse to contpte size.
Actually one suggestion was a user space daemon that scans and collapses with
MADV_COLLAPSE. I suspect the key will be to ensure whatever solution we go for
is flexible and can be enabled/disabled/configured for the different environments.

There certainly is interest for 2 MiB THP on arm64 64k where the THP size would normally be 512 MiB. In that scenario, khugepaged makes perfect sense.

--
Cheers,

David / dhildenb