Re: [PATCH] mm/hugetlb: optionally pre-zero hugetlb pages

From: Mateusz Guzik
Date: Tue Dec 03 2024 - 11:02:51 EST


On Tue, Dec 3, 2024 at 3:26 PM Joao Martins <joao.m.martins@xxxxxxxxxx> wrote:
>
> On 03/12/2024 12:06, Michal Hocko wrote:
> > If the startup latency is a real problem is there a way to workaround
> > that in the userspace by preallocating hugetlb pages ahead of time
> > before those VMs are launched and hand over already pre-allocated pages?
>
> It should be relatively simple to actually do this. Me and Mike had experimented
> ourselves a couple years back but we never had the chance to send it over. IIRC
> if we:
>
> - add the PageZeroed tracking bit when a page is zeroed
> - clear it in the write (fixup/non-fixup) fault-path
>
> [somewhat similar to this series I suspect]
>
> Then what's left is to change the lookup of free hugetlb pages
> (dequeue_hugetlb_folio_node_exact() I think) to search first for non-zeroed
> pages. Provided we don't track its 'cleared' state, there's no UAPI change in
> behaviour. A daemon can just allocate/mmap+touch/etc them with read-only and
> free them back 'as zeroed' to implement a userspace scrubber. And in principle
> existing apps should see no difference. The amount of changes is consequently
> significantly smaller (or it looked as such in a quick PoC years back).
>
> Something extra on the top would perhaps be the ability so select a lookup
> heuristic such that we can pick the search method of
> non-zero-first/only-nonzero/zeroed pages behind ioctl() (or a better generic
> UAPI) to allow a scrubber to easily coexist with hugepage user (e.g. a VMM, etc)
> without too much of a dance.
>

Ye after the qemu prefaulting got pointed out I started thinking about
a userlevel daemon which would do the work proposed here.

Except I got stuck at a good way to do it. The mmap + load from the
area + munmap triple does work but also entails more overhead than
necessary, but I only have some handwaving how to not do it. :)

Suppose a daemon of the sort exists and there is a machine with 4 or
more NUMA domains to deal with. Further suppose it spawns at least one
thread per such domain and tasksets them accordingly.

Then perhaps an ioctl somewhere on hugetlbfs(?) could take a parameter
indicating how many pages to zero out (or even just accept one page).
This would avoid crap on munmap.

This would still need majority of the patch, but all the zeroing
policy would be taken out. Key point being that whatever specific
behavior one sees fit, they can implement it in userspace, preventing
future kernel patches to add more tweaks.
--
Mateusz Guzik <mjguzik gmail.com>