Re: [PATCH v8 03/10] mm: thp: Introduce multi-size THP sysfs interface

From: David Hildenbrand
Date: Tue Dec 05 2023 - 11:58:09 EST

Next message: Gergo Koteles: "Re: [PATCH 2/2] ALSA: hda/tas2563: Add tas2563 HDA driver"
Previous message: Sean Christopherson: "Re: [PATCH v15 22/23] x86/mce: Improve error log of kernel space TDX #MC due to erratum"
In reply to: Ryan Roberts: "Re: [PATCH v8 03/10] mm: thp: Introduce multi-size THP sysfs interface"
Next in thread: Ryan Roberts: "Re: [PATCH v8 03/10] mm: thp: Introduce multi-size THP sysfs interface"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 04.12.23 11:20, Ryan Roberts wrote:

In preparation for adding support for anonymous multi-size THP,
introduce new sysfs structure that will be used to control the new
behaviours. A new directory is added under transparent_hugepage for each
supported THP size, and contains an `enabled` file, which can be set to
"inherit" (to inherit the global setting), "always", "madvise" or
"never". For now, the kernel still only supports PMD-sized anonymous
THP, so only 1 directory is populated.

The first half of the change converts transhuge_vma_suitable() and
hugepage_vma_check() so that they take a bitfield of orders for which
the user wants to determine support, and the functions filter out all
the orders that can't be supported, given the current sysfs
configuration and the VMA dimensions. If there is only 1 order set in
the input then the output can continue to be treated like a boolean;
this is the case for most call sites. The resulting functions are
renamed to thp_vma_suitable_orders() and thp_vma_allowable_orders()
respectively.

The second half of the change implements the new sysfs interface. It has
been done so that each supported THP size has a `struct thpsize`, which
describes the relevant metadata and is itself a kobject. This is pretty
minimal for now, but should make it easy to add new per-thpsize files to
the interface if needed in future (e.g. per-size defrag). Rather than
keep the `enabled` state directly in the struct thpsize, I've elected to
directly encode it into huge_anon_orders_[always|madvise|inherit]
bitfields since this reduces the amount of work required in
thp_vma_allowable_orders() which is called for every page fault.

See Documentation/admin-guide/mm/transhuge.rst, as modified by this
commit, for details of how the new sysfs interface works.

Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx>
---

Some comments mostly regarding thp_vma_allowable_orders and friends. In general, LGTM. I'll have to go over the order logic once again, I got a bit lost once we started mixing anon and file orders.

[...]

Doc updates all looked good to me, skimming over them.

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index fa0350b0812a..bd0eadd3befb 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h

[...]

+static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
+ unsigned long addr, unsigned long orders)
+{
+ int order;
+
+ /*
+ * Iterate over orders, highest to lowest, removing orders that don't
+ * meet alignment requirements from the set. Exit loop at first order
+ * that meets requirements, since all lower orders must also meet
+ * requirements.
+ */
+
+ order = first_order(orders);

nit: "highest_order" or "largest_order" would be more expressive regarding the actual semantics.

+
+ while (orders) {
+ unsigned long hpage_size = PAGE_SIZE << order;
+ unsigned long haddr = ALIGN_DOWN(addr, hpage_size);
+
+ if (haddr >= vma->vm_start &&
+ haddr + hpage_size <= vma->vm_end) {
+ if (!vma_is_anonymous(vma)) {
+ if (IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) -
+ vma->vm_pgoff,
+ hpage_size >> PAGE_SHIFT))
+ break;
+ } else
+ break;

Comment: Codying style wants you to use if () {} else {}

But I'd recommend for the conditions:

if (haddr < vma->vm_start ||
haddr + hpage_size > vma->vm_end)
continue;
/* Don't have to check pgoff for anonymous vma */
if (!vma_is_anonymous(vma))
break;
if (IS_ALIGNED((...
break;

[...]

+/**
+ * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
+ * @vma: the vm area to check
+ * @vm_flags: use these vm_flags instead of vma->vm_flags
+ * @smaps: whether answer will be used for smaps file
+ * @in_pf: whether answer will be used by page fault handler
+ * @enforce_sysfs: whether sysfs config should be taken into account
+ * @orders: bitfield of all orders to consider
+ *
+ * Calculates the intersection of the requested hugepage orders and the allowed
+ * hugepage orders for the provided vma. Permitted orders are encoded as a set
+ * bit at the corresponding bit position (bit-2 corresponds to order-2, bit-3
+ * corresponds to order-3, etc). Order-0 is never considered a hugepage order.
+ *
+ * Return: bitfield of orders allowed for hugepage in the vma. 0 if no hugepage
+ * orders are allowed.
+ */
+unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
+ unsigned long vm_flags, bool smaps,
+ bool in_pf, bool enforce_sysfs,
+ unsigned long orders)
+{
+ /* Check the intersection of requested and supported orders. */
+ orders &= vma_is_anonymous(vma) ?
+ THP_ORDERS_ALL_ANON : THP_ORDERS_ALL_FILE;
+ if (!orders)
+ return 0;

Comment: if this is called from some hot path, we might want to move as much as possible into a header, so we can avoid this function call here when e.g., THP are completely disabled etc.

+
if (!vma->vm_mm) /* vdso */
- return false;
+ return 0;
/*
* Explicitly disabled through madvise or prctl, or some
@@ -88,16 +141,16 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
* */
if ((vm_flags & VM_NOHUGEPAGE) ||
test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
- return false;
+ return 0;
/*
* If the hardware/firmware marked hugepage support disabled.
*/
if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
- return false;
+ return 0;
/* khugepaged doesn't collapse DAX vma, but page fault is fine. */
if (vma_is_dax(vma))
- return in_pf;
+ return in_pf ? orders : 0;
/*
* khugepaged special VMA and hugetlb VMA.
@@ -105,17 +158,29 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
* VM_MIXEDMAP set.
*/
if (!in_pf && !smaps && (vm_flags & VM_NO_KHUGEPAGED))
- return false;
+ return 0;
/*
- * Check alignment for file vma and size for both file and anon vma.
+ * Check alignment for file vma and size for both file and anon vma by
+ * filtering out the unsuitable orders.
*
* Skip the check for page fault. Huge fault does the check in fault
- * handlers. And this check is not suitable for huge PUD fault.
+ * handlers.
*/
- if (!in_pf &&
- !transhuge_vma_suitable(vma, (vma->vm_end - HPAGE_PMD_SIZE)))
- return false;
+ if (!in_pf) {
+ int order = first_order(orders);
+ unsigned long addr;
+
+ while (orders) {
+ addr = vma->vm_end - (PAGE_SIZE << order);
+ if (thp_vma_suitable_orders(vma, addr, BIT(order)))
+ break;

Comment: you'd want a "thp_vma_suitable_order" helper here. But maybe the compiler is smart enough to optimize the loop and everyything else out.

[...]

+
+static ssize_t thpsize_enabled_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ int order = to_thpsize(kobj)->order;
+ ssize_t ret = count;
+
+ if (sysfs_streq(buf, "always")) {
+ set_bit(order, &huge_anon_orders_always);
+ clear_bit(order, &huge_anon_orders_inherit);
+ clear_bit(order, &huge_anon_orders_madvise);
+ } else if (sysfs_streq(buf, "inherit")) {
+ set_bit(order, &huge_anon_orders_inherit);
+ clear_bit(order, &huge_anon_orders_always);
+ clear_bit(order, &huge_anon_orders_madvise);
+ } else if (sysfs_streq(buf, "madvise")) {
+ set_bit(order, &huge_anon_orders_madvise);
+ clear_bit(order, &huge_anon_orders_always);
+ clear_bit(order, &huge_anon_orders_inherit);
+ } else if (sysfs_streq(buf, "never")) {
+ clear_bit(order, &huge_anon_orders_always);
+ clear_bit(order, &huge_anon_orders_inherit);
+ clear_bit(order, &huge_anon_orders_madvise);

Note: I was wondering for a second if some concurrent cames could lead to an inconsistent state. I think in the worst case we'll simply end up with "never" on races.

[...]

static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
{
int err;
+ struct thpsize *thpsize;
+ unsigned long orders;
+ int order;
+
+ /*
+ * Default to setting PMD-sized THP to inherit the global setting and
+ * disable all other sizes. powerpc's PMD_ORDER isn't a compile-time
+ * constant so we have to do this here.
+ */
+ huge_anon_orders_inherit = BIT(PMD_ORDER);
*hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
if (unlikely(!*hugepage_kobj)) {
@@ -434,8 +631,24 @@ static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
goto remove_hp_group;
}
+ orders = THP_ORDERS_ALL_ANON;
+ order = first_order(orders);
+ while (orders) {
+ thpsize = thpsize_create(order, *hugepage_kobj);
+ if (IS_ERR(thpsize)) {
+ pr_err("failed to create thpsize for order %d\n", order);
+ err = PTR_ERR(thpsize);
+ goto remove_all;
+ }
+ list_add(&thpsize->node, &thpsize_list);
+ order = next_order(&orders, order);
+ }
+
return 0;

[...]

page = compound_head(page);
@@ -5116,7 +5116,8 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
return VM_FAULT_OOM;
retry_pud:
if (pud_none(*vmf.pud) &&
- hugepage_vma_check(vma, vm_flags, false, true, true)) {
+ thp_vma_allowable_orders(vma, vm_flags, false, true, true,
+ BIT(PUD_ORDER))) {
ret = create_huge_pud(&vmf);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
@@ -5150,7 +5151,8 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
goto retry_pud;
if (pmd_none(*vmf.pmd) &&
- hugepage_vma_check(vma, vm_flags, false, true, true)) {
+ thp_vma_allowable_orders(vma, vm_flags, false, true, true,
+ BIT(PMD_ORDER))) {

Comment: A helper like "thp_vma_allowable_order(vma, PMD_ORDER)" might make this easier to read -- and the implemenmtation will be faster.

ret = create_huge_pmd(&vmf);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index e0b368e545ed..64da127cc267 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -268,7 +268,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
* cleared *pmd but not decremented compound_mapcount().
*/
if ((pvmw->flags & PVMW_SYNC) &&
- transhuge_vma_suitable(vma, pvmw->address) &&
+ thp_vma_suitable_orders(vma, pvmw->address,
+ BIT(PMD_ORDER)) &&

Comment: Similarly, a helper like "thp_vma_suitable_order(vma, PMD_ORDER)" might make this easier to read.

(pvmw->nr_pages >= HPAGE_PMD_NR)) {
spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);

--
Cheers,

David / dhildenb

Next message: Gergo Koteles: "Re: [PATCH 2/2] ALSA: hda/tas2563: Add tas2563 HDA driver"
Previous message: Sean Christopherson: "Re: [PATCH v15 22/23] x86/mce: Improve error log of kernel space TDX #MC due to erratum"
In reply to: Ryan Roberts: "Re: [PATCH v8 03/10] mm: thp: Introduce multi-size THP sysfs interface"
Next in thread: Ryan Roberts: "Re: [PATCH v8 03/10] mm: thp: Introduce multi-size THP sysfs interface"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]