Re: [PATCH] mm/hugetlb: optionally pre-zero hugetlb pages
From: Frank van der Linden
Date: Tue Dec 03 2024 - 15:22:23 EST
On Tue, Dec 3, 2024 at 10:43 AM Frank van der Linden <fvdl@xxxxxxxxxx> wrote:
>
> On Tue, Dec 3, 2024 at 6:26 AM Joao Martins <joao.m.martins@xxxxxxxxxx> wrote:
> >
> > On 03/12/2024 12:06, Michal Hocko wrote:
> > > On Mon 02-12-24 14:50:49, Frank van der Linden wrote:
> > >> On Mon, Dec 2, 2024 at 1:58 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
> > >>> Any games with "background zeroing" are notoriously crappy and I would
> > >>> argue one should exhaust other avenues before going there -- at the end
> > >>> of the day the cost of zeroing will have to get paid.
> > >>
> > >> I understand that the concept of background prezeroing has been, and
> > >> will be, met with some resistance. But, do you have any specific
> > >> concerns with the patch I posted? It's pretty well isolated from the
> > >> rest of the code, and optional.
> > >
> > > The biggest concern I have is that the overhead is payed by everybody on
> > > the system - it is considered to be a system overhead regardless only
> > > part of the workload benefits from hugetlb pages. In other words the
> > > workload using those pages is not accounted for the use completely.
> > >
> > > If the startup latency is a real problem is there a way to workaround
> > > that in the userspace by preallocating hugetlb pages ahead of time
> > > before those VMs are launched and hand over already pre-allocated pages?
> >
> > It should be relatively simple to actually do this. Me and Mike had experimented
> > ourselves a couple years back but we never had the chance to send it over. IIRC
> > if we:
> >
> > - add the PageZeroed tracking bit when a page is zeroed
> > - clear it in the write (fixup/non-fixup) fault-path
> >
> > [somewhat similar to this series I suspect]
> >
> > Then what's left is to change the lookup of free hugetlb pages
> > (dequeue_hugetlb_folio_node_exact() I think) to search first for non-zeroed
> > pages. Provided we don't track its 'cleared' state, there's no UAPI change in
> > behaviour. A daemon can just allocate/mmap+touch/etc them with read-only and
> > free them back 'as zeroed' to implement a userspace scrubber. And in principle
> > existing apps should see no difference. The amount of changes is consequently
> > significantly smaller (or it looked as such in a quick PoC years back).
>
> This would work, and is easy to do, but:
> * You now have a userspace daemon that depends on kernel-internal behavior.
> * It has no way to track how much work is left to do or what needs
> to be done (unless it is part of an application that is the only user
> of hugetlbfs on the system).
>
> >
> > Something extra on the top would perhaps be the ability so select a lookup
> > heuristic such that we can pick the search method of
> > non-zero-first/only-nonzero/zeroed pages behind ioctl() (or a better generic
> > UAPI) to allow a scrubber to easily coexist with hugepage user (e.g. a VMM, etc)
> > without too much of a dance.
>
> Again, that would probably work, but if you take a step back: you now
> have a kernel behavior that can be guided in certain directions, but
> no guarantees and no stats to see if things are working out. And an
> explicit allocation method option (basically: take from the head or
> the tail of the freelist). The picture is getting murkier. At least
> with the patch I sent you have a clearly defined, optional, behavior
> that can be switched on or off, and stats to see if it's working.
>
> I do understand the argument against having pre-zeroing not being
> accounted to the current thread. I would counter that benefiting from
> work by kernel threads is not unheard of in the kernel today already.
> Also, the other proposals so far also have another thread doing the
> zeroing - it just is explicitly started by userspace. So, the cost is
> still not paid by the user of the pages. You just end up with
> explicitly controlling who does pay the cost. Which I suppose is
> better, but it's still not trivial to get it completely right (you
> perhaps could do it at the container level with some trickery).
>
> What we have done so far is to bind the khzerod threads introduced in
> this patch to CPUs in such a way that it doesn't interfere with the
> rest of the system. Which you would also have to do with any userspace
> solution.
>
> Again, this is optional - if you are a system manager who prefers to
> have the resources used by zeroing hugetlb pages to be explicitly
> accounted to the actual user, you can not enable this behavior (it's
> off by default).
>
> I guess I can summarize my thoughts like this: while I understand the
> argument against doing this outside of the context of the actual user
> of the pages, this is 1) optional, and 2) so far the other solutions
> introduce interfaces that I don't think are that great, or would
> require maintaining a hugetlb 'shadow pool' in userspace through
> hugetlbfs files.
One more thing: any userspace solution has one basic problem: the
hugetlb pages will be unavailable while they are being zeroed out, as
the userspace process or thread will have to map+touch them, taking
them off the freelist. So now another process that needs the hugetlb
pages, expecting them to be there based on initial configuration and
what it's done so far, may end up getting unexpected -ENOMEM because
one or more pages have been temporarily allocated by userspace prezero
threads.
My patch doesn't have that issue - the pages stay on the freelist, the
total number of available pages does not change. In the rare case that
a freshly allocated page is being prezeroed, you'll just have to wait
until it's done (taking up no more time than doing it yourself).
Now, you can implement something like this in userspace (if I get
-ENOMEM, check with the prezero thread or process), but it's not
great.
- Frank