Re: [PATCH] mm/hugetlb: optionally pre-zero hugetlb pages

From: Joao Martins
Date: Tue Dec 03 2024 - 11:24:02 EST


On 03/12/2024 15:57, Mateusz Guzik wrote:
> On Tue, Dec 3, 2024 at 3:26 PM Joao Martins <joao.m.martins@xxxxxxxxxx> wrote:
>>
>> On 03/12/2024 12:06, Michal Hocko wrote:
>>> If the startup latency is a real problem is there a way to workaround
>>> that in the userspace by preallocating hugetlb pages ahead of time
>>> before those VMs are launched and hand over already pre-allocated pages?
>>
>> It should be relatively simple to actually do this. Me and Mike had experimented
>> ourselves a couple years back but we never had the chance to send it over. IIRC
>> if we:
>>
>> - add the PageZeroed tracking bit when a page is zeroed
>> - clear it in the write (fixup/non-fixup) fault-path
>>
>> [somewhat similar to this series I suspect]
>>
>> Then what's left is to change the lookup of free hugetlb pages
>> (dequeue_hugetlb_folio_node_exact() I think) to search first for non-zeroed
>> pages. Provided we don't track its 'cleared' state, there's no UAPI change in
>> behaviour. A daemon can just allocate/mmap+touch/etc them with read-only and
>> free them back 'as zeroed' to implement a userspace scrubber. And in principle
>> existing apps should see no difference. The amount of changes is consequently
>> significantly smaller (or it looked as such in a quick PoC years back).
>>
>> Something extra on the top would perhaps be the ability so select a lookup
>> heuristic such that we can pick the search method of
>> non-zero-first/only-nonzero/zeroed pages behind ioctl() (or a better generic
>> UAPI) to allow a scrubber to easily coexist with hugepage user (e.g. a VMM, etc)
>> without too much of a dance.
>>
>
> Ye after the qemu prefaulting got pointed out I started thinking about
> a userlevel daemon which would do the work proposed here.
>
> Except I got stuck at a good way to do it. The mmap + load from the
> area + munmap triple does work but also entails more overhead than
> necessary, but I only have some handwaving how to not do it. :)
>
What I was trying to suggest above is that it would be no different that how you
use hugetlb. I am not enterily sure I follow the triple work part on unmap.

> Suppose a daemon of the sort exists and there is a machine with 4 or
> more NUMA domains to deal with. Further suppose it spawns at least one
> thread per such domain and tasksets them accordingly.
>
> Then perhaps an ioctl somewhere on hugetlbfs(?) could take a parameter
> indicating how many pages to zero out (or even just accept one page).
> This would avoid crap on munmap.
>
> This would still need majority of the patch, but all the zeroing> policy would
be taken out. Key point being that whatever specific
> behavior one sees fit, they can implement it in userspace, preventing
> future kernel patches to add more tweaks.

Kernel should still ensure it tracks if it's cleared or not -- so what I said
above was just letting the allocation zero out the page or not (if it's not
zeroed already) and just tweak the dirtyness of pages it picks before installing
PTEs. A scrubber would pick only dirty pages (and maybe fail if there aren't
any), and a VMM would pick clean pages (taking advantage of the scrubber work).
An explicit zero sounds a somewhat limiting ... but hmm

What throws all this away (in primary MM) is the prefaulting with write as we
would clear the PageCleared bit all the time (I think that's what you mean 'crap
on unmap'?).

But there could be hope for systems with a secondary pagetables (with paging),
where the secondary faulting is the one in control of the cleared status. That
is because reads inside the VM ultimately trigger secondary-VM read-faults and
get fixed up later with write on writes.

Well, at least it would work given we don't prefault secondary page tables yet...