Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene

From: Zi Yan
Date: Wed Sep 27 2023 - 23:22:46 EST


On 26 Sep 2023, at 14:19, David Hildenbrand wrote:

> On 21.09.23 16:47, Zi Yan wrote:
>> On 21 Sep 2023, at 6:19, David Hildenbrand wrote:
>>
>>> On 21.09.23 04:31, Zi Yan wrote:
>>>> On 20 Sep 2023, at 13:23, Zi Yan wrote:
>>>>
>>>>> On 20 Sep 2023, at 12:04, Johannes Weiner wrote:
>>>>>
>>>>>> On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote:
>>>>>>> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
>>>>>>>> On 9/20/23 03:38, Zi Yan wrote:
>>>>>>>>> On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
>>>>>>>>>
>>>>>>>>>> On 09/19/23 16:57, Zi Yan wrote:
>>>>>>>>>>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
>>>>>>>>>>>
>>>>>>>>>>>> --- a/mm/page_alloc.c
>>>>>>>>>>>> +++ b/mm/page_alloc.c
>>>>>>>>>>>> @@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>>>>>>>>>>>> end = pageblock_end_pfn(pfn) - 1;
>>>>>>>>>>>>
>>>>>>>>>>>> /* Do not cross zone boundaries */
>>>>>>>>>>>> +#if 0
>>>>>>>>>>>> if (!zone_spans_pfn(zone, start))
>>>>>>>>>>>> start = zone->zone_start_pfn;
>>>>>>>>>>>> +#else
>>>>>>>>>>>> + if (!zone_spans_pfn(zone, start))
>>>>>>>>>>>> + start = pfn;
>>>>>>>>>>>> +#endif
>>>>>>>>>>>> if (!zone_spans_pfn(zone, end))
>>>>>>>>>>>> return false;
>>>>>>>>>>>> I can still trigger warnings.
>>>>>>>>>>>
>>>>>>>>>>> OK. One thing to note is that the page type in the warning changed from
>>>>>>>>>>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Just to be really clear,
>>>>>>>>>> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
>>>>>>>>>> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
>>>>>>>>>> path WITHOUT your change.
>>>>>>>>>>
>>>>>>>>>> I am guessing the difference here has more to do with the allocation path?
>>>>>>>>>>
>>>>>>>>>> I went back and reran focusing on the specific migrate type.
>>>>>>>>>> Without your patch, and coming from the alloc_contig_range call path,
>>>>>>>>>> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
>>>>>>>>>> With your patch I got one 'page type is 0, passed migratetype is 1'
>>>>>>>>>> warning and one 'page type is 1, passed migratetype is 0' warning.
>>>>>>>>>>
>>>>>>>>>> I could be wrong, but I do not think your patch changes things.
>>>>>>>>>
>>>>>>>>> Got it. Thanks for the clarification.
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> One idea about recreating the issue is that it may have to do with size
>>>>>>>>>>>> of my VM (16G) and the requested allocation sizes 4G. However, I tried
>>>>>>>>>>>> to really stress the allocations by increasing the number of hugetlb
>>>>>>>>>>>> pages requested and that did not help. I also noticed that I only seem
>>>>>>>>>>>> to get two warnings and then they stop, even if I continue to run the
>>>>>>>>>>>> script.
>>>>>>>>>>>>
>>>>>>>>>>>> Zi asked about my config, so it is attached.
>>>>>>>>>>>
>>>>>>>>>>> With your config, I still have no luck reproducing the issue. I will keep
>>>>>>>>>>> trying. Thanks.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Perhaps try running both scripts in parallel?
>>>>>>>>>
>>>>>>>>> Yes. It seems to do the trick.
>>>>>>>>>
>>>>>>>>>> Adjust the number of hugetlb pages allocated to equal 25% of memory?
>>>>>>>>>
>>>>>>>>> I am able to reproduce it with the script below:
>>>>>>>>>
>>>>>>>>> while true; do
>>>>>>>>> echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
>>>>>>>>> echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
>>>>>>>>> wait
>>>>>>>>> echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>>>>>>>> echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>>>>>>>> done
>>>>>>>>>
>>>>>>>>> I will look into the issue.
>>>>>>>
>>>>>>> Nice!
>>>>>>>
>>>>>>> I managed to reproduce it ONCE, triggering it not even a second after
>>>>>>> starting the script. But I can't seem to do it twice, even after
>>>>>>> several reboots and letting it run for minutes.
>>>>>>
>>>>>> I managed to reproduce it reliably by cutting the nr_hugepages
>>>>>> parameters respectively in half.
>>>>>>
>>>>>> The one that triggers for me is always MIGRATE_ISOLATE. With some
>>>>>> printk-tracing, the scenario seems to be this:
>>>>>>
>>>>>> #0 #1
>>>>>> start_isolate_page_range()
>>>>>> isolate_single_pageblock()
>>>>>> set_migratetype_isolate(tail)
>>>>>> lock zone->lock
>>>>>> move_freepages_block(tail) // nop
>>>>>> set_pageblock_migratetype(tail)
>>>>>> unlock zone->lock
>>>>>> del_page_from_freelist(head)
>>>>>> expand(head, head_mt)
>>>>>> WARN(head_mt != tail_mt)
>>>>>> start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
>>>>>> for (pfn = start_pfn, pfn < end_pfn)
>>>>>> if (PageBuddy())
>>>>>> split_free_page(head)
>>>>>>
>>>>>> IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the
>>>>>> lock. The move_freepages_block() does nothing because the PageBuddy()
>>>>>> is set on the pageblock to the left. Once we drop the lock, the buddy
>>>>>> gets allocated and the expand() puts things on the wrong list. The
>>>>>> splitting code that handles MAX_ORDER blocks runs *after* the tail
>>>>>> type is set and the lock has been dropped, so it's too late.
>>>>>
>>>>> Yes, this is the issue I can confirm as well. But it is intentional to enable
>>>>> allocating a contiguous range at pageblock granularity instead of MAX_ORDER
>>>>> granularity. With your changes below, it no longer works, because if there
>>>>> is an unmovable page in
>>>>> [ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES), pageblock_start_pfn(start_pfn)),
>>>>> the allocation fails but it would succeed in current implementation.
>>>>>
>>>>> I think a proper fix would be to make move_freepages_block() split the
>>>>> MAX_ORDER page and put the split pages in the right migratetype free lists.
>>>>>
>>>>> I am working on that.
>>>>
>>>> After spending half a day on this, I think it is much harder than I thought
>>>> to get alloc_contig_range() working with the freelist migratetype hygiene
>>>> patchset. Because alloc_contig_range() relies on racy migratetype changes:
>>>>
>>>> 1. pageblocks in the range are first marked as MIGRATE_ISOLATE to prevent
>>>> another parallel isolation, but they are not moved to the MIGRATE_ISOLATE
>>>> free list yet.
>>>>
>>>> 2. later in the process, isolate_freepages_range() is used to actually grab
>>>> the free pages.
>>>>
>>>> 3. there was no problem when alloc_contig_range() works on MAX_ORDER aligned
>>>> ranges, since MIGRATE_ISOLATE cannot be set in the middle of free pages or
>>>> in-use pages. But it is not the case when alloc_contig_range() work on
>>>> pageblock aligned ranges. Now during isolation phase, free or in-use pages
>>>> will need to be split to get their subpages into the right free lists.
>>>>
>>>> 4. the hardest case is when a in-use page sits across two pageblocks, currently,
>>>> the code just isolate one pageblock, migrate the page, and let split_free_page()
>>>> to correct the free list later. But to strictly enforce freelist migratetype
>>>> hygiene, extra work is needed at free page path to split the free page into
>>>> the right freelists.
>>>>
>>>> I need more time to think about how to get alloc_contig_range() properly.
>>>> Help is needed for the bullet point 4.
>>>
>>>
>>> I once raised that we should maybe try making MIGRATE_ISOLATE a flag that preserves the original migratetype. Not sure if that would help here in any way.
>>
>> I have that in my backlog since you asked and have been delaying it. ;) Hopefully
>
> It's complicated and I wish I would have had more time to review it
> back then ... or now to clean it up later.
>
> Unfortunately, nobody else did have the time to review it back then ... maybe we can
> do better next time. David doesn't scale.
>
> Doing page migration from inside start_isolate_page_range()->isolate_single_pageblock()
> really is sub-optimal (and mostly code duplication from alloc_contig_range).

I felt the same when I wrote the code. But I thought it was the only way out.

>
>> I can do it after I fix this. That change might or might not help only if we make
>> some redesign on how migratetype is managed. If MIGRATE_ISOLATE does not
>> overwrite existing migratetype, the code might not need to split a page and move
>> it to MIGRATE_ISOLATE freelist?
>
> Did someone test how memory offlining plays along with that? (I can try myself
> within the next 1-2 weeks)
>
> There [mm/memory_hotplug.c:offline_pages] we always cover full MAX_ORDER ranges,
> though.
>
> ret = start_isolate_page_range(start_pfn, end_pfn,
> MIGRATE_MOVABLE,
> MEMORY_OFFLINE | REPORT_FAILURE,
> GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL);

Since a full MAX_ORDER range is passed, no free page split will happen.

>
>>
>> The fundamental issue in alloc_contig_range() is that to work at
>> pageblock level, a page (>pageblock_order) can have one part is isolated and
>> the rest is a different migratetype. {add_to,move_to,del_page_from}_free_list()
>> now checks first pageblock migratetype, so such a page needs to be removed
>> from its free_list, set MIGRATE_ISOLATE on one of the pageblock, split, and
>> finally put back to multiple free lists. This needs to be done at isolation stage
>> before free pages are removed from their free lists (the stage after isolation).
>
> One idea was to always isolate larger chunks, and handle movability checks/split/etc
> at a later stage. Once isolation would be decoupled from the actual/original migratetype,
> the could have been easier to handle (especially some corner cases I had in mind back then).

I think it is a good idea. When I coded alloc_contig_range() up, I tried to
accommodate existing set_migratetype_isolate(), which calls has_unmovable_pages().
If these two are decoupled, set_migrateype_isolate() can work on MAX_ORDER-aligned
ranges and has_unmovable_pages() can still work on pageblock-aligned ranges.
Let me give this a try.

>
>> If MIGRATE_ISOLATE is a separate flag and we are OK with leaving isolated pages
>> in their original migratetype and check migratetype before allocating a page,
>> that might help. But that might add extra work (e.g., splitting a partially
>> isolated free page before allocation) in the really hot code path, which is not
>> desirable.
>
> With MIGRATE_ISOLATE being a separate flag, one idea was to have not a single
> separate isolate list, but one per "proper migratetype". But again, just some random
> thoughts I had back then, I never had sufficient time to think it all through.

Got it. I will think about it.

One question on separate MIGRATE_ISOLATE:

the implementation I have in mind is that MIGRATE_ISOLATE will need a dedicated flag
bit instead of being one of migratetype. But now there are 5 migratetypes +
MIGRATE_ISOLATE and PB_migratetype_bits is 3, so an extra migratetype_bit is needed.
But current migratetype implementation is a word-based operation, requiring
NR_PAGEBLOCK_BITS to be divisor of BITS_PER_LONG. This means NR_PAGEBLOCK_BITS
needs to be increased from 4 to 8 to meet the requirement, wasting a lot of space.
An alternative is to have a separate array for MIGRATE_ISOLATE, which requires
additional changes. Let me know if you have a better idea. Thanks.



--
Best Regards,
Yan, Zi

Attachment: signature.asc
Description: OpenPGP digital signature