Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64

From: Roman Gushchin
Date: Mon Oct 05 2020 - 14:26:00 EST


On Mon, Oct 05, 2020 at 07:27:47PM +0200, David Hildenbrand wrote:
> On 05.10.20 19:16, Roman Gushchin wrote:
> > On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote:
> >> On 2 Oct 2020, at 4:30, David Hildenbrand wrote:
> >>
> >>> On 02.10.20 10:10, Michal Hocko wrote:
> >>>> On Fri 02-10-20 09:50:02, David Hildenbrand wrote:
> >>>>>>>> - huge page sizes controllable by the userspace?
> >>>>>>>
> >>>>>>> It might be good to allow advanced users to choose the page sizes, so they
> >>>>>>> have better control of their applications.
> >>>>>>
> >>>>>> Could you elaborate more? Those advanced users can use hugetlb, right?
> >>>>>> They get a very good control over page size and pool preallocation etc.
> >>>>>> So they can get what they need - assuming there is enough memory.
> >>>>>>
> >>>>>
> >>>>> I am still not convinced that 1G THP (TGP :) ) are really what we want
> >>>>> to support. I can understand that there are some use cases that might
> >>>>> benefit from it, especially:
> >>>>
> >>>> Well, I would say that internal support for larger huge pages (e.g. 1GB)
> >>>> that can transparently split under memory pressure is a useful
> >>>> funtionality. I cannot really judge how complex that would be
> >>>
> >>> Right, but that's then something different than serving (scarce,
> >>> unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing
> >>> wrong about *real* THP support, meaning, e.g., grouping consecutive
> >>> pages and converting them back and forth on demand. (E.g., 1GB ->
> >>> multiple 2MB -> multiple single pages), for example, when having to
> >>> migrate such a gigantic page. But that's very different from our
> >>> existing gigantic page code as far as I can tell.
> >>
> >> Serving 1GB PUD THPs from CMA is a compromise, since we do not want to
> >> bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator,
> >> which needs section size increase. In addition, unmoveable pages cannot
> >> be allocated in CMA, so allocating 1GB pages has much higher chance from
> >> it than from ZONE_NORMAL.
> >
> > s/higher chances/non-zero chances
>
> Well, the longer the system runs (and consumes a significant amount of
> available main memory), the less likely it is.
>
> >
> > Currently we have nothing that prevents the fragmentation of the memory
> > with unmovable pages on the 1GB scale. It means that in a common case
> > it's highly unlikely to find a continuous GB without any unmovable page.
> > As now CMA seems to be the only working option.
> >
>
> And I completely dislike the use of CMA in this context (for example,
> allocating via CMA and freeing via the buddy by patching CMA when
> splitting up PUDs ...).
>
> > However it seems there are other use cases for the allocation of continuous
> > 1GB pages: e.g. secretfd ( https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_831628_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=mdcwiGna7gQ4-RC_9XdaxFZ271PEQ09M0YtCcRoCkf8&s=4KlK2p0AVh1QdL8XDVeWyXPz4F63pdbbSCoxQlkNaa4&e= ), where using
> > 1GB pages can reduce the fragmentation of the direct mapping.
>
> Yes, see RFC v1 where I already cced Mike.
>
> >
> > So I wonder if we need a new mechanism to avoid fragmentation on 1GB/PUD scale.
> > E.g. something like a second level of pageblocks. That would allow to group
> > all unmovable memory in few 1GB blocks and have more 1GB regions available for
> > gigantic THPs and other use cases. I'm looking now into how it can be done.
>
> Anything bigger than sections is somewhat problematic: you have to track
> that data somewhere. It cannot be the section (in contrast to pageblocks)

Well, it's not a large amount of data: the number of 1GB regions is not that
high even on very large machines.

>
> > If anybody has any ideas here, I'll appreciate a lot.
>
> I already brought up the idea of ZONE_PREFER_MOVABLE (see RFC v1). That
> somewhat mimics what CMA does (when sized reasonably), works well with
> memory hot(un)plug, and is immune to misconfiguration. Within such a
> zone, we can try to optimize the placement of larger blocks.

Thank you for pointing at it!

The main problem with it is the same as with ZONE_MOVABLE: it does require
a boot-time educated guess on a good size. I admit that the CMA does too.

But I really hope that a long-term solution will not require a pre-configuration.
I do not see why fundamentally we can't group unmovable allocations in (few)
1GB regions. Basically all we need to do is to choose a nearby 2MB block if we
don't have enough free pages in the unmovable free list and going to steal a new
2MB block. I know, it doesn't work this way, but just as an illustration.
In the reality, when stealing a block, under some conditions we might want
to steal the whole 1GB region. In this case the following unmovable allocations
will not lead to stealing of new blocks from (potentially) different 1GB regions.
I have no working code yet, just thinking into this direction.

Thanks!