Re: [RFC PATCH 00/16] 1GB THP support on x86_64

From: Zi Yan
Date: Thu Sep 10 2020 - 17:26:38 EST


On 10 Sep 2020, at 4:27, David Hildenbrand wrote:

> On 10.09.20 09:32, Michal Hocko wrote:
>> [Cc Vlastimil and Mel - the whole email thread starts
>> http://lkml.kernel.org/r/20200902180628.4052244-1-zi.yan@xxxxxxxx
>> but this particular subthread has diverged a bit and you might find it
>> interesting]
>>
>> On Wed 09-09-20 15:43:55, David Hildenbrand wrote:
>>> On 09.09.20 15:19, Rik van Riel wrote:
>>>> On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote:
>>>>> On Tue 08-09-20 10:41:10, Rik van Riel wrote:
>>>>>> On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:
>>>>>>
>>>>>>> A global knob is insufficient. 1G pages will become a very
>>>>>>> precious
>>>>>>> resource as it requires a pre-allocation (reservation). So it
>>>>>>> really
>>>>>>> has
>>>>>>> to be an opt-in and the question is whether there is also some
>>>>>>> sort
>>>>>>> of
>>>>>>> access control needed.
>>>>>>
>>>>>> The 1GB pages do not require that much in the way of
>>>>>> pre-allocation. The memory can be obtained through CMA,
>>>>>> which means it can be used for movable 4kB and 2MB
>>>>>> allocations when not
>>>>>> being used for 1GB pages.
>>>>>
>>>>> That CMA has to be pre-reserved, right? That requires a
>>>>> configuration.
>>>>
>>>> To some extent, yes.
>>>>
>>>> However, because that pool can be used for movable
>>>> 4kB and 2MB
>>>> pages as well as for 1GB pages, it would be easy to just set
>>>> the size of that pool to eg. 1/3 or even 1/2 of memory for every
>>>> system.
>>>>
>>>> It isn't like the pool needs to be the exact right size. We
>>>> just need to avoid the "highmem problem" of having too little
>>>> memory for kernel allocations.
>>>>
>>>
>>> I am not sure I like the trend towards CMA that we are seeing, reserving
>>> huge buffers for specific users (and eventually even doing it
>>> automatically).
>>>
>>> What we actually want is ZONE_MOVABLE with relaxed guarantees, such that
>>> anybody who requires large, unmovable allocations can use it.
>>>
>>> I once played with the idea of having ZONE_PREFER_MOVABLE, which
>>> a) Is the primary choice for movable allocations
>>> b) Is allowed to contain unmovable allocations (esp., gigantic pages)
>>> c) Is the fallback for ZONE_NORMAL for unmovable allocations, instead of
>>> running out of memory
>>
>> I might be missing something but how can this work longterm? Or put in
>> another words why would this work any better than existing fragmentation
>> avoidance techniques that page allocator implements already - movability
>> grouping etc. Please note that I am not deeply familiar with those but
>> my high level understanding is that we already try hard to not mix
>> movable and unmovable objects in same page blocks as much as we can.
>
> Note that we group in pageblock granularity, which avoids fragmentation
> on a pageblock level, not on anything bigger than that. Especially
> MAX_ORDER - 1 pages (e.g., on x86-64) and gigantic pages.
>
> So once you run for some time on a system (especially thinking about
> page shuffling *within* a zone), trying to allocate a gigantic page will
> simply always fail - even if you always had plenty of free memory in
> your single zone.
>
>>
>> My suspicion is that a separate zone would work in a similar fashion. As
>> long as there is a lot of free memory then zone will be effectively
>> MOVABLE. Similar applies to normal zone when unmovable allocations are
>
> Note the difference to MOVABLE: if you really want, you *can* put
> movable allocations into that zone. So you can happily allocate gigantic
> pages from it. Or anything else you like. As the name suggests "prefer
> movable allocations".
>
>> in minority. As long as the Normal zone gets full of unmovable objects
>> they start overflowing to ZONE_PREFER_MOVABLE and it will resemble page
>> block stealing when unmovable objects start spreading over movable page
>> blocks.
>
> Right, the long-term goal would be
> 1. To limit the chance of that happening. (e.g., size it in a way that's
> safe for 99.9% of all setups, resize dynamically on demand)
> 2. To limit the physical area where that is happening (e.g., find lowest
> possible pageblock etc.). That's more tricky but I consider this a pure
> optimization on top.
>
> As long as we stay in safe zone boundaries you get a benefit in most
> scenarios. As soon as we would have a (temporary) workload that would
> require more unmovable allocations we would fallback to polluting some
> pageblocks only.

The idea would work well until unmoveable pages begin to overflow into
ZONE_PREFER_MOVABLE or we move the boundary of ZONE_PREFER_MOVABLE to
avoid unmoveable page overflow. The issue comes from the lifetime of
the unmoveable pages. Since some long-live ones can be around the boundary,
there is no guarantee that ZONE_PREFER_MOVABLE cannot grow back
even if other unmoveable pages are deallocated. Ultimately,
ZONE_PREFER_MOVABLE would be shrink to a small size and the situation is
back to what we have now.

OK. I have a stupid question here. Why not just grow pageblock to a larger
size, like 1GB? So the fragmentation of unmoveable pages will be at larger
granularity. But it is less likely unmoveable pages will be allocated at
a movable pageblock, since the kernel has 1GB pageblock for them after
a pageblock stealing. If other kinds of pageblocks run out, moveable and
reclaimable pages can fall back to unmoveable pageblocks.
What am I missing here?

Thanks.



Best Regards,
Yan Zi

Attachment: signature.asc
Description: OpenPGP digital signature