Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene

From: Zi Yan
Date: Fri Oct 13 2023 - 10:51:36 EST


On 12 Oct 2023, at 20:06, Zi Yan wrote:

> On 10 Oct 2023, at 17:12, Johannes Weiner wrote:
>
>> Hello!
>>
>> On Mon, Oct 02, 2023 at 10:26:44PM -0400, Zi Yan wrote:
>>> On 27 Sep 2023, at 22:51, Zi Yan wrote:
>>> I attached my revised patch 2 and 3 (with all the suggestions above).
>>
>> Thanks! It took me a bit to read through them. It's a tricky codebase!
>>
>> Some comments below.
>>
>>> From 1c8f99cff5f469ee89adc33e9c9499254cad13f2 Mon Sep 17 00:00:00 2001
>>> From: Zi Yan <ziy@xxxxxxxxxx>
>>> Date: Mon, 25 Sep 2023 16:27:14 -0400
>>> Subject: [PATCH v2 1/2] mm: set migratetype after free pages are moved between
>>> free lists.
>>>
>>> This avoids changing migratetype after move_freepages() or
>>> move_freepages_block(), which is error prone. It also prepares for upcoming
>>> changes to fix move_freepages() not moving free pages partially in the
>>> range.
>>>
>>> Signed-off-by: Zi Yan <ziy@xxxxxxxxxx>
>>
>> This is great and indeed makes the callsites much simpler. Thanks,
>> I'll fold this into the series.
>>
>>> @@ -1597,9 +1615,29 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>>> unsigned long end_pfn, int old_mt, int new_mt)
>>> {
>>> struct page *page;
>>> - unsigned long pfn;
>>> + unsigned long pfn, pfn2;
>>> unsigned int order;
>>> int pages_moved = 0;
>>> + unsigned long mt_changed_pfn = start_pfn - pageblock_nr_pages;
>>> + unsigned long new_start_pfn = get_freepage_start_pfn(start_pfn);
>>> +
>>> + /* split at start_pfn if it is in the middle of a free page */
>>> + if (new_start_pfn != start_pfn && PageBuddy(pfn_to_page(new_start_pfn))) {
>>> + struct page *new_page = pfn_to_page(new_start_pfn);
>>> + int new_page_order = buddy_order(new_page);
>>
>> get_freepage_start_pfn() returns start_pfn if it didn't find a large
>> buddy, so the buddy check shouldn't be necessary, right?
>>
>>> + if (new_start_pfn + (1 << new_page_order) > start_pfn) {
>>
>> This *should* be implied according to the comments on
>> get_freepage_start_pfn(), but it currently isn't. Doing so would help
>> here, and seemingly also in alloc_contig_range().
>>
>> How about this version of get_freepage_start_pfn()?
>>
>> /*
>> * Scan the range before this pfn for a buddy that straddles it
>> */
>> static unsigned long find_straddling_buddy(unsigned long start_pfn)
>> {
>> int order = 0;
>> struct page *page;
>> unsigned long pfn = start_pfn;
>>
>> while (!PageBuddy(page = pfn_to_page(pfn))) {
>> /* Nothing found */
>> if (++order > MAX_ORDER)
>> return start_pfn;
>> pfn &= ~0UL << order;
>> }
>>
>> /*
>> * Found a preceding buddy, but does it straddle?
>> */
>> if (pfn + (1 << buddy_order(page)) > start_pfn)
>> return pfn;
>>
>> /* Nothing found */
>> return start_pfn;
>> }
>>
>>> @@ -1614,10 +1652,43 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>>>
>>> order = buddy_order(page);
>>> move_to_free_list(page, zone, order, old_mt, new_mt);
>>> + /*
>>> + * set page migratetype 1) only after we move all free pages in
>>> + * one pageblock and 2) for all pageblocks within the page.
>>> + *
>>> + * for 1), since move_to_free_list() checks page migratetype with
>>> + * old_mt and changing one page migratetype affects all pages
>>> + * within the same pageblock, if we are moving more than
>>> + * one free pages in the same pageblock, setting migratetype
>>> + * right after first move_to_free_list() triggers the warning
>>> + * in the following move_to_free_list().
>>> + *
>>> + * for 2), when a free page order is greater than pageblock_order,
>>> + * all pageblocks within the free page need to be changed after
>>> + * move_to_free_list().
>>
>> I think this can be somewhat simplified.
>>
>> There are two assumptions we can make. Buddies always consist of 2^n
>> pages. And buddies and pageblocks are naturally aligned. This means
>> that if this pageblock has the start of a buddy that straddles into
>> the next pageblock(s), it must be the first page in the block. That in
>> turn means we can move the handling before the loop.
>>
>> If we split first, it also makes the loop a little simpler because we
>> know that any buddies that start inside this block cannot extend
>> beyond it (due to the alignment). The loop how it was originally
>> written can remain untouched.
>>
>>> + */
>>> + if (pfn + (1 << order) > pageblock_end_pfn(pfn)) {
>>> + for (pfn2 = pfn;
>>> + pfn2 < min_t(unsigned long,
>>> + pfn + (1 << order),
>>> + end_pfn + 1);
>>> + pfn2 += pageblock_nr_pages) {
>>> + set_pageblock_migratetype(pfn_to_page(pfn2),
>>> + new_mt);
>>> + mt_changed_pfn = pfn2;
>>
>> Hm, this seems to assume that start_pfn to end_pfn can be more than
>> one block. Why is that? This function is only used on single blocks.
>
> You are right. I made unnecessary assumptions when I wrote the code.
>
>>
>>> + }
>>> + /* split the free page if it goes beyond the specified range */
>>> + if (pfn + (1 << order) > (end_pfn + 1))
>>> + split_free_page(page, order, end_pfn + 1 - pfn);
>>> + }
>>> pfn += 1 << order;
>>> pages_moved += 1 << order;
>>> }
>>> - set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
>>> + /* set migratetype for the remaining pageblocks */
>>> + for (pfn2 = mt_changed_pfn + pageblock_nr_pages;
>>> + pfn2 <= end_pfn;
>>> + pfn2 += pageblock_nr_pages)
>>> + set_pageblock_migratetype(pfn_to_page(pfn2), new_mt);
>>
>> If I rework the code on the above, I'm arriving at the following:
>>
>> static int move_freepages(struct zone *zone, unsigned long start_pfn,
>> unsigned long end_pfn, int old_mt, int new_mt)
>> {
>> struct page *start_page = pfn_to_page(start_pfn);
>> int pages_moved = 0;
>> unsigned long pfn;
>>
>> VM_WARN_ON(start_pfn & (pageblock_nr_pages - 1));
>> VM_WARN_ON(start_pfn + pageblock_nr_pages - 1 != end_pfn);
>>
>> /*
>> * A free page may be comprised of 2^n blocks, which means our
>> * block of interest could be head or tail in such a page.
>> *
>> * If we're a tail, update the type of our block, then split
>> * the page into pageblocks. The splitting will do the leg
>> * work of sorting the blocks into the right freelists.
>> *
>> * If we're a head, split the page into pageblocks first. This
>> * ensures the migratetypes still match up during the freelist
>> * removal. Then do the regular scan for buddies in the block
>> * of interest, which will handle the rest.
>> *
>> * In theory, we could try to preserve 2^1 and larger blocks
>> * that lie outside our range. In practice, MAX_ORDER is
>> * usually one or two pageblocks anyway, so don't bother.
>> *
>> * Note that this only applies to page isolation, which calls
>> * this on random blocks in the pfn range! When we move stuff
>> * from inside the page allocator, the pages are coming off
>> * the freelist (can't be tail) and multi-block pages are
>> * handled directly in the stealing code (can't be a head).
>> */
>>
>> /* We're a tail */
>> pfn = find_straddling_buddy(start_pfn);
>> if (pfn != start_pfn) {
>> struct page *free_page = pfn_to_page(pfn);
>>
>> set_pageblock_migratetype(start_page, new_mt);
>> split_free_page(free_page, buddy_order(free_page),
>> pageblock_nr_pages);
>> return pageblock_nr_pages;
>> }
>>
>> /* We're a head */
>> if (PageBuddy(start_page) && buddy_order(start_page) > pageblock_order)
>> split_free_page(start_page, buddy_order(start_page),
>> pageblock_nr_pages);
>
> This actually can be:
>
> /* We're a head */
> if (PageBuddy(start_page) && buddy_order(start_page) > pageblock_order) {
> set_pageblock_migratetype(start_page, new_mt);
> split_free_page(start_page, buddy_order(start_page),
> pageblock_nr_pages);
> return pageblock_nr_pages;
> }
>
>
>>
>> /* Move buddies within the block */
>> while (pfn <= end_pfn) {
>> struct page *page = pfn_to_page(pfn);
>> int order, nr_pages;
>>
>> if (!PageBuddy(page)) {
>> pfn++;
>> continue;
>> }
>>
>> /* Make sure we are not inadvertently changing nodes */
>> VM_BUG_ON_PAGE(page_to_nid(page) != zone_to_nid(zone), page);
>> VM_BUG_ON_PAGE(page_zone(page) != zone, page);
>>
>> order = buddy_order(page);
>> nr_pages = 1 << order;
>>
>> move_to_free_list(page, zone, order, old_mt, new_mt);
>>
>> pfn += nr_pages;
>> pages_moved += nr_pages;
>> }
>>
>> set_pageblock_migratetype(start_page, new_mt);
>>
>> return pages_moved;
>> }
>>
>> Does this look reasonable to you?
>
> Looks good to me. Thanks.
>
>>
>> Note that the page isolation specific stuff comes first. If this code
>> holds up, we should be able to move it to page-isolation.c and keep it
>> out of the regular allocator path.
>
> You mean move the tail and head part to set_migratetype_isolate()?
> And change move_freepages_block() to separate prep_move_freepages_block(),
> the tail and head code, and move_freepages()? It should work and looks
> like a similar code pattern as steal_suitable_fallback().

The attached patch has all the suggested changes, let me know how it
looks to you. Thanks.

--
Best Regards,
Yan, Zi
From 32e7aefe352785b29b31b72ce0bb8b4e608860ca Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@xxxxxxxxxx>
Date: Mon, 25 Sep 2023 16:55:18 -0400
Subject: [PATCH] mm/page_isolation: split cross-pageblock free pages during
isolation

alloc_contig_range() uses set_migrateype_isolate(), which eventually calls
move_freepages(), to isolate free pages. But move_freepages() was not able
to move free pages partially covered by the specified range, leaving a race
window open[1]. Fix it by splitting such pages before calling
move_freepages().

A common code to find the start pfn of a free page straddling a given pfn
is refactored in find_straddling_buddy().

[1] https://lore.kernel.org/linux-mm/20230920160400.GC124289@xxxxxxxxxxx/

Suggested-by: Johannes Weiner <hannes@xxxxxxxxxxx>
Signed-off-by: Zi Yan <ziy@xxxxxxxxxx>
---
include/linux/page-isolation.h | 7 +++
mm/page_alloc.c | 94 ++++++++++++++++++++--------------
mm/page_isolation.c | 90 ++++++++++++++++++++------------
3 files changed, 121 insertions(+), 70 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 901915747960..4873f1a41792 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -34,8 +34,15 @@ static inline bool is_migrate_isolate(int migratetype)
#define REPORT_FAILURE 0x2

void set_pageblock_migratetype(struct page *page, int migratetype);
+unsigned long find_straddling_buddy(unsigned long start_pfn);
int move_freepages_block(struct zone *zone, struct page *page,
int old_mt, int new_mt);
+bool prep_move_freepages_block(struct zone *zone, struct page *page,
+ unsigned long *start_pfn,
+ unsigned long *end_pfn,
+ int *num_free, int *num_movable);
+int move_freepages(struct zone *zone, unsigned long start_pfn,
+ unsigned long end_pfn, int old_mt, int new_mt);

int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
int migratetype, int flags, gfp_t gfp_flags);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 928bb595d7cc..74831a86f41d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -865,15 +865,15 @@ int split_free_page(struct page *free_page,
struct zone *zone = page_zone(free_page);
unsigned long free_page_pfn = page_to_pfn(free_page);
unsigned long pfn;
- unsigned long flags;
int free_page_order;
int mt;
int ret = 0;

- if (split_pfn_offset == 0)
- return ret;
+ /* zone lock should be held when this function is called */
+ lockdep_assert_held(&zone->lock);

- spin_lock_irqsave(&zone->lock, flags);
+ if (split_pfn_offset == 0 || split_pfn_offset >= (1 << order))
+ return ret;

if (!PageBuddy(free_page) || buddy_order(free_page) != order) {
ret = -ENOENT;
@@ -899,7 +899,6 @@ int split_free_page(struct page *free_page,
split_pfn_offset = (1UL << order) - (pfn - free_page_pfn);
}
out:
- spin_unlock_irqrestore(&zone->lock, flags);
return ret;
}
/*
@@ -1588,21 +1587,52 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
unsigned int order) { return NULL; }
#endif

+/*
+ * Scan the range before this pfn for a buddy that straddles it
+ */
+unsigned long find_straddling_buddy(unsigned long start_pfn)
+{
+ int order = 0;
+ struct page *page;
+ unsigned long pfn = start_pfn;
+
+ while (!PageBuddy(page = pfn_to_page(pfn))) {
+ /* Nothing found */
+ if (++order > MAX_ORDER)
+ return start_pfn;
+ pfn &= ~0UL << order;
+ }
+
+ /*
+ * Found a preceding buddy, but does it straddle?
+ */
+ if (pfn + (1 << buddy_order(page)) > start_pfn)
+ return pfn;
+
+ /* Nothing found */
+ return start_pfn;
+}
+
/*
* Move the free pages in a range to the freelist tail of the requested type.
* Note that start_page and end_pages are not aligned on a pageblock
* boundary. If alignment is required, use move_freepages_block()
*/
-static int move_freepages(struct zone *zone, unsigned long start_pfn,
+int move_freepages(struct zone *zone, unsigned long start_pfn,
unsigned long end_pfn, int old_mt, int new_mt)
{
- struct page *page;
- unsigned long pfn;
- unsigned int order;
+ struct page *start_page = pfn_to_page(start_pfn);
int pages_moved = 0;
+ unsigned long pfn = start_pfn;
+
+ VM_WARN_ON(start_pfn & (pageblock_nr_pages - 1));
+ VM_WARN_ON(start_pfn + pageblock_nr_pages - 1 != end_pfn);
+
+ /* Move buddies within the block */
+ while (pfn <= end_pfn) {
+ struct page *page = pfn_to_page(pfn);
+ int order, nr_pages;

- for (pfn = start_pfn; pfn <= end_pfn;) {
- page = pfn_to_page(pfn);
if (!PageBuddy(page)) {
pfn++;
continue;
@@ -1613,16 +1643,20 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
VM_BUG_ON_PAGE(page_zone(page) != zone, page);

order = buddy_order(page);
+ nr_pages = 1 << order;
+
move_to_free_list(page, zone, order, old_mt, new_mt);
- pfn += 1 << order;
- pages_moved += 1 << order;
+
+ pfn += nr_pages;
+ pages_moved += nr_pages;
}
- set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
+
+ set_pageblock_migratetype(start_page, new_mt);

return pages_moved;
}

-static bool prep_move_freepages_block(struct zone *zone, struct page *page,
+bool prep_move_freepages_block(struct zone *zone, struct page *page,
unsigned long *start_pfn,
unsigned long *end_pfn,
int *num_free, int *num_movable)
@@ -6138,7 +6172,6 @@ int alloc_contig_range(unsigned long start, unsigned long end,
unsigned migratetype, gfp_t gfp_mask)
{
unsigned long outer_start, outer_end;
- int order;
int ret = 0;

struct compact_control cc = {
@@ -6212,28 +6245,13 @@ int alloc_contig_range(unsigned long start, unsigned long end,
* isolated thus they won't get removed from buddy.
*/

- order = 0;
- outer_start = start;
- while (!PageBuddy(pfn_to_page(outer_start))) {
- if (++order > MAX_ORDER) {
- outer_start = start;
- break;
- }
- outer_start &= ~0UL << order;
- }
-
- if (outer_start != start) {
- order = buddy_order(pfn_to_page(outer_start));
-
- /*
- * outer_start page could be small order buddy page and
- * it doesn't include start page. Adjust outer_start
- * in this case to report failed page properly
- * on tracepoint in test_pages_isolated()
- */
- if (outer_start + (1UL << order) <= start)
- outer_start = start;
- }
+ /*
+ * outer_start page could be small order buddy page and it doesn't
+ * include start page. outer_start is set to start in
+ * find_straddling_buddy() to report failed page properly on tracepoint
+ * in test_pages_isolated()
+ */
+ outer_start = find_straddling_buddy(start);

/* Make sure the range is really isolated. */
if (test_pages_isolated(outer_start, end, 0)) {
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 5f8c658c0853..c6a4e02ed588 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -178,15 +178,61 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
unmovable = has_unmovable_pages(check_unmovable_start, check_unmovable_end,
migratetype, isol_flags);
if (!unmovable) {
- int nr_pages;
int mt = get_pageblock_migratetype(page);
+ unsigned long start_pfn, end_pfn, free_page_pfn;
+ struct page *start_page;

- nr_pages = move_freepages_block(zone, page, mt, MIGRATE_ISOLATE);
/* Block spans zone boundaries? */
- if (nr_pages == -1) {
+ if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn, NULL, NULL)) {
spin_unlock_irqrestore(&zone->lock, flags);
return -EBUSY;
}
+
+ /*
+ * A free page may be comprised of 2^n blocks, which means our
+ * block of interest could be head or tail in such a page.
+ *
+ * If we're a tail, update the type of our block, then split
+ * the page into pageblocks. The splitting will do the leg
+ * work of sorting the blocks into the right freelists.
+ *
+ * If we're a head, split the page into pageblocks first. This
+ * ensures the migratetypes still match up during the freelist
+ * removal. Then do the regular scan for buddies in the block
+ * of interest, which will handle the rest.
+ *
+ * In theory, we could try to preserve 2^1 and larger blocks
+ * that lie outside our range. In practice, MAX_ORDER is
+ * usually one or two pageblocks anyway, so don't bother.
+ *
+ * Note that this only applies to page isolation, which calls
+ * this on random blocks in the pfn range! When we move stuff
+ * from inside the page allocator, the pages are coming off
+ * the freelist (can't be tail) and multi-block pages are
+ * handled directly in the stealing code (can't be a head).
+ */
+ start_page = pfn_to_page(start_pfn);
+
+ free_page_pfn = find_straddling_buddy(start_pfn);
+ /*
+ * 1) We're a tail: free_page_pfn != start_pfn
+ * 2) We're a head: free_page_pfn == start_pfn &&
+ * PageBuddy(start_page) &&
+ * buddy_order(start_page) > pageblock_order
+ *
+ * In both cases, the free page needs to be split.
+ */
+ if (free_page_pfn != start_pfn ||
+ (PageBuddy(start_page) &&
+ buddy_order(start_page) > pageblock_order)) {
+ struct page *free_page = pfn_to_page(free_page_pfn);
+
+ set_pageblock_migratetype(start_page, MIGRATE_ISOLATE);
+ split_free_page(free_page, buddy_order(free_page),
+ pageblock_nr_pages);
+ } else
+ move_freepages(zone, start_pfn, end_pfn, mt, MIGRATE_ISOLATE);
+
zone->nr_isolate_pageblock++;
spin_unlock_irqrestore(&zone->lock, flags);
return 0;
@@ -380,11 +426,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
if (PageBuddy(page)) {
int order = buddy_order(page);

- if (pfn + (1UL << order) > boundary_pfn) {
- /* free page changed before split, check it again */
- if (split_free_page(page, order, boundary_pfn - pfn))
- continue;
- }
+ VM_WARN_ONCE(pfn + (1UL << order) > boundary_pfn,
+ "a free page sits across isolation boundary");

pfn += 1UL << order;
continue;
@@ -408,8 +451,6 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
* can be migrated. Otherwise, fail the isolation.
*/
if (PageHuge(page) || PageLRU(page) || __PageMovable(page)) {
- int order;
- unsigned long outer_pfn;
int page_mt = get_pageblock_migratetype(page);
bool isolate_page = !is_migrate_isolate_page(page);
struct compact_control cc = {
@@ -427,9 +468,11 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
/*
* XXX: mark the page as MIGRATE_ISOLATE so that
* no one else can grab the freed page after migration.
- * Ideally, the page should be freed as two separate
- * pages to be added into separate migratetype free
- * lists.
+ * The page should be freed into separate migratetype
+ * free lists, unless the free page order is greater
+ * than pageblock order. It is not the case now,
+ * since gigantic hugetlb is freed as order-0
+ * pages and LRU pages do not cross pageblocks.
*/
if (isolate_page) {
ret = set_migratetype_isolate(page, page_mt,
@@ -451,25 +494,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,

if (ret)
goto failed;
- /*
- * reset pfn to the head of the free page, so
- * that the free page handling code above can split
- * the free page to the right migratetype list.
- *
- * head_pfn is not used here as a hugetlb page order
- * can be bigger than MAX_ORDER, but after it is
- * freed, the free page order is not. Use pfn within
- * the range to find the head of the free page.
- */
- order = 0;
- outer_pfn = pfn;
- while (!PageBuddy(pfn_to_page(outer_pfn))) {
- /* stop if we cannot find the free page */
- if (++order > MAX_ORDER)
- goto failed;
- outer_pfn &= ~0UL << order;
- }
- pfn = outer_pfn;
+
+ pfn = head_pfn + nr_pages;
continue;
} else
#endif
--
2.42.0

Attachment: signature.asc
Description: OpenPGP digital signature