Re: What size anonymous folios should we allocate?

From: Ryan Roberts
Date: Wed Mar 01 2023 - 05:25:12 EST


I'd like to throw in my 2p here. Although quick disclaimer first; I'm new to mm
so I'm sure I'll say a bunch of dumb stuff - please go easy ;-)

In the few workloads that I'm focused on, I can see a disparity in performance
between a kernel configured for 4K vs 16K pages. My goal is to release the extra
performance we see in the 16K variant into the 4K variant.

My results show that a ~half of the uplift is down to SW efficiency savings in
the kernel; 4x fewer data aborts (vast majority for anon memory) and less effort
spent in mm-heavy syscalls (as expected). And the other ~half is down to HW; the
TLB has less pressure, which causes everything to speed up a bit. But this "bit"
is important, given most of the time is spent in user space, which only benefits
from the HW part. See [1] for more details.

Newer Arm CPUs have a uarch feature called Hardware Page Aggregation (HPA). This
allows the TLB to aggregate 2-8 physically- and virtually- contiguous pages into
a single TLB entry to reduce pressure. (Note this is separate from the contig
bit and is invisible from a SW programming perspective).

So my hope is that I can get the equivalent SW efficiencies with a 4K base page
size and large anonymous folios, and also benefit from a lot of the HW
performance due to it all naturally fitting HPA's requirements.


On 21/02/2023 21:49, Matthew Wilcox wrote:
> In a sense this question is premature, because we don't have any code
> in place to handle folios which are any size but PMD_SIZE or PAGE_SIZE,
> but let's pretend that code already exists and is just waiting for us
> to answer this policy question.
>
> I'd like to reject three ideas up front: 1. a CONFIG option, 2. a boot
> option and 3. a sysfs tunable. It is foolish to expect the distro
> packager or the sysadmin to be able to make such a decision. The
> correct decision will depend upon the instantaneous workload of the
> entire machine and we'll want different answers for different VMAs.
>
> I'm open to applications having some kind of madvise() call they can
> use to specify hints, but I would prefer to handle memory efficiently
> for applications which do not.

Firmly agree.

>
> For pagecache memory, we use the per-fd readahead code; if readahead has
> been successful in the past we bump up the folio size until it reaches
> its maximum. There is no equivalent for anonymous memory.
>
> I'm working my way towards a solution that looks a little like this:
>
> A. We modify khugepaged to quadruple the folio size each time it scans.
> At the moment, it always attempts to promote straight from order 0
> to PMD size. Instead, if it finds four adjacent order-0 folios,
> it will allocate an order-2 folio to replace them. Next time it
> scans, it finds four order-2 folios and replaces them with a single
> order-4 folio. And so on, up to PMD order.


>From the SW efficiencies perspective, what is the point of doing a replacement
after you have allocated all the order-0 folios? Surely that just adds more
overhead? I think the aim has to be to try to allocate the correct order up
front to cut down the allocation cost; one order-2 allocation is ~4x less
expensive than 4x order-0, right?

I wonder if it is preferable to optimistically allocate a mid-order folio to
begin with, then later choose to split or merge from there? Perhaps these folios
could initially go on a separate list to make them faster to split and reclaim
the unused portions when under memory pressure? (My data/workloads suggest 16K
allocations are knee, and making them bigger than that doesn't proportionally
improve performance).

>
> B. A further modification is that it will require three of the four
> folios being combined to be on the active list. If two (or more)
> of the four folios are inactive, we should leave them alone; either
> they will remain inactive and eventually be evicted, or they will be
> activated and eligible for merging in a future pass of khugepaged.
>
> C. We add a new wrinkle to the LRU handling code. When our scan of the
> active list examines a folio, we look to see how many of the PTEs
> mapping the folio have been accessed. If it is fewer than half, and
> those half are all in either the first or last half of the folio, we
> split it. The active half stays on the active list and the inactive
> half is moved to the inactive list.
>
> I feel that these three changes should allow us to iterate towards a
> solution for any given VMA that is close to optimal, and adapts to a
> changing workload with no intervention from a sysadmin, or even hint
> from a program.
>
> There are three different circumstances where we currently allocate
> anonymous memory. The first is for mmap(MAP_ANONYMOUS), the second is
> COW on a file-backed MAP_PRIVATE and the third is COW of a post-fork
> anonymous mapping.
>
> For the first option, the only hint we have is the size of the VMA.
> I'm tempted to suggest our initial guess at the right size folio to
> allocate should be scaled to that, although I don't have a clear idea
> about what the scale factor should be.

Ahh - perhaps I misunderstood what you were saying above. My experience has been
that order-2 seems to be the knee in terms of performance gain, so perhaps one
approach would be to start with order-2 allocations, then adjust based on the
observed page fault pattern within the VMA? i.e. if you're getting mostly
in-order faults, increase the VMA's scaling factor, and if its mostly random,
decrease. (just a suggestion based on intuition, so feel free to shoot it down).

>
> For the second case, I want to strongly suggest that the size of the
> folio allocated by the page cache should be of no concern. It is largely
> irrelevant to the application's usage pattern what size the page cache
> has chosen to cache the file. I might start out very conservatively
> here with an order-0 allocation.
>
> For the third case, in contrast, the parent had already established
> an appropriate size folio to use for this VMA before calling fork().
> Whether it is the parent or the child causing the COW, it should probably
> inherit that choice and we should default to the same size folio that
> was already found.
>
>
> I don't stay current with the research literature, so if someone wants
> to point me to a well-studied algorithm and let me know that I can stop
> thinking about this, that'd be great. And if anyone wants to start
> working on implementing this, that'd also be great.
>
> P.S. I didn't want to interrupt the flow of the above description to
> note that allocation of any high-order folio can and will fail, so
> there will definitely be fallback points to order-0 folios, which will
> be no different from today. Except that maybe we'll be able to iterate
> towards the correct folio size in the new khugepaged.
>
> P.P.S. I still consider myself a bit of a novice in the handling of
> anonymous memory, so don't be shy to let me know what I got wrong.
>
>

Thanks,
Ryan

[1] https://lore.kernel.org/linux-mm/4c991dcb-c5bb-86bb-5a29-05df24429607@xxxxxxx/