Re: [PATCH v3 1/2] mm: Make alloc_contig_range handle free hugetlb pages

From: David Hildenbrand
Date: Mon Mar 01 2021 - 09:13:48 EST

Next message: Rafael J. Wysocki: "Re: [PATCH v1] xen: ACPI: Get rid of ACPICA message printing"
Previous message: tip-bot2 for Juergen Gross: "[tip: locking/core] locking/csd_lock: Add boot parameter for controlling CSD lock debugging"
Next in thread: Oscar Salvador: "Re: [PATCH v3 1/2] mm: Make alloc_contig_range handle free hugetlb pages"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 22.02.21 14:51, Oscar Salvador wrote:

alloc_contig_range will fail if it ever sees a HugeTLB page within the
range we are trying to allocate, even when that page is free and can be
easily reallocated.
This has proved to be problematic for some users of alloc_contic_range,
e.g: CMA and virtio-mem, where those would fail the call even when those
pages lay in ZONE_MOVABLE and are free.

We can do better by trying to replace such page.

Free hugepages are tricky to handle so as to no userspace application
notices disruption, we need to replace the current free hugepage with
a new one.

In order to do that, a new function called alloc_and_dissolve_huge_page
is introduced.
This function will first try to get a new fresh hugepage, and if it
succeeds, it will replace the old one in the free hugepage pool.

All operations are being handled under hugetlb_lock, so no races are
possible. The only exception is when page's refcount is 0, but it still
has not been flagged as PageHugeFreed.
In this case we retry as the window race is quite small and we have high
chances to succeed next time.

With regard to the allocation, we restrict it to the node the page belongs
to with __GFP_THISNODE, meaning we do not fallback on other node's zones.

Note that gigantic hugetlb pages are fenced off since there is a cyclic
dependency between them and alloc_contig_range.

Signed-off-by: Oscar Salvador <osalvador@xxxxxxx>
---
include/linux/hugetlb.h | 6 +++
mm/compaction.c | 12 ++++++
mm/hugetlb.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 127 insertions(+), 2 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b5807f23caf8..72352d718829 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -505,6 +505,7 @@ struct huge_bootmem_page {
struct hstate *hstate;
};
+bool isolate_or_dissolve_huge_page(struct page *page);
struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve);
struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
@@ -775,6 +776,11 @@ void set_page_huge_active(struct page *page);
#else /* CONFIG_HUGETLB_PAGE */
struct hstate {};
+static inline bool isolate_or_dissolve_huge_page(struct page *page)
+{
+ return false;
+}
+
static inline struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr,
int avoid_reserve)
diff --git a/mm/compaction.c b/mm/compaction.c
index 190ccdaa6c19..d52506ed9db7 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -905,6 +905,18 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
valid_page = page;
}
+ if (PageHuge(page) && cc->alloc_contig) {
+ if (!isolate_or_dissolve_huge_page(page))
+ goto isolate_fail;

So, the callchain is:

alloc_contig_range()->__alloc_contig_migrate_range()->isolate_migratepages_range()->isolate_migratepages_block()

The case I am thinking about is if we run out of memory and would return -ENOMEM from alloc_and_dissolve_huge_page(). We silently drop the real error (e.g., -ENOMEM vs. -EBUSY vs. e.g., -EAGAIN) we had in isolate_or_dissolve_huge_page().

I think we should not swallo such return values in isolate_or_dissolve_huge_page() and instead properly report esp. -ENOMEM properly up this callchain now. Otherwise we'll end up retrying / reporting -EBUSY, which is misleading.

From isolate_migratepages_range()/isolate_migratepages_block() we'll keep reporting "pfn > 0".

a) In isolate_migratepages_range() we'll keep iterating over pageblocks although we should just fail with -ENOMEM right away.

b) In __alloc_contig_migrate_range() we'll keep retrying up to 5 times although we should just fail with -ENOMEM. We end up returning "-EBUSY" after retrying.

c) In alloc_contig_range() we'll continue trying to isolate although we should just return -ENOMEM.

I think we have should start returning proper errors from isolate_migratepages_range()/isolate_migratepages_block() on critical issues (-EINTR, -ENOMEM) instead of going via "!pfn vs. pfn" and retrying on "pfn".

So we should then fail with -ENOMEM during isolate_migratepages_range() cleanly, just as we would do when we get -ENOMEM during migrate_pages().

--
Thanks,

David / dhildenb

Next message: Rafael J. Wysocki: "Re: [PATCH v1] xen: ACPI: Get rid of ACPICA message printing"
Previous message: tip-bot2 for Juergen Gross: "[tip: locking/core] locking/csd_lock: Add boot parameter for controlling CSD lock debugging"
Next in thread: Oscar Salvador: "Re: [PATCH v3 1/2] mm: Make alloc_contig_range handle free hugetlb pages"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]