Re: [PATCH] mm/hugetlb: optionally pre-zero hugetlb pages

From: Frank van der Linden
Date: Wed Dec 04 2024 - 12:18:01 EST


On Tue, Dec 3, 2024 at 4:05 PM Ankur Arora <ankur.a.arora@xxxxxxxxxx> wrote:
>
>
> Mateusz Guzik <mjguzik@xxxxxxxxx> writes:
>
> > On Mon, Dec 02, 2024 at 08:20:58PM +0000, Frank van der Linden wrote:
> >> Fresh hugetlb pages are zeroed out when they are faulted in,
> >> just like with all other page types. This can take up a good
> >> amount of time for larger page sizes (e.g. around 40
> >> milliseconds for a 1G page on a recent AMD-based system).
> >>
> >> This normally isn't a problem, since hugetlb pages are typically
> >> mapped by the application for a long time, and the initial
> >> delay when touching them isn't much of an issue.
> >>
> >> However, there are some use cases where a large number of hugetlb
> >> pages are touched when an application (such as a VM backed by these
> >> pages) starts. For 256 1G pages and 40ms per page, this would take
> >> 10 seconds, a noticeable delay.
> >
> > The current huge page zeroing code is not that great to begin with.
>
> Yeah definitely suboptimal. The current huge page zeroing code is
> both slow and it trashes the cache while zeroing.
>
> > There was a patchset posted some time ago to remedy at least some of it:
> > https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@xxxxxxxxxx/
> >
> > but it apparently fell through the cracks.
>
> As Joao mentioned that got side tracked due to the preempt-lazy stuff.
> Now that lazy is in, I plan to follow up on the zeroing work.
>
> > Any games with "background zeroing" are notoriously crappy and I would
> > argue one should exhaust other avenues before going there -- at the end
> > of the day the cost of zeroing will have to get paid.
>
> Yeah and the background zeroing has dual cost: the cost in CPU time plus
> the indirect cost to other processes due to the trashing of L3 etc.

I'm not sure what you mean here - any caching side effects of zeroing
happen regardless of who does it, right? It doesn't matter if it's a
kthread or the calling thread.

If you're concerned about the caching side effects in general, using
non-temporal instructions helps (e.g. movnti on x86). See the link I
mentioned for a patch that was sent years ago (
https://lore.kernel.org/all/20180725023728.44630-1-cannonmatthews@xxxxxxxxxx/
). Using movnti on x86 definitely helps performance (up to 50% in my
experiments). Which is great, but it still leaves considerable delay
for the use case I mentioned.

- Frank