Re: THP backed thread stacks

From: William Kucharski
Date: Thu Mar 09 2023 - 20:42:39 EST




> On Mar 9, 2023, at 17:05, Zach O'Keefe <zokeefe@xxxxxxxxxx> wrote:
>
>> I think the hugepage alignment in their environment was somewhat luck.
>> One suggestion made was to change stack size to avoid alignment and
>> hugepage usage. That 'works' but seems kind of hackish.
>
> That was my first thought, if the alignment was purely due to luck,
> and not somebody manually specifying it. Agreed it's kind of hackish
> if anyone can get bit by this by sheer luck.

I don't agree it's "hackish" at all, but I go more into that below.

>
>> Also, David H pointed out the somewhat recent commit to align sufficiently
>> large mappings to THP boundaries. This is going to make all stacks huge
>> page aligned.
>
> I think that change was reverted by Linus in commit 0ba09b173387
> ("Revert "mm: align larger anonymous mappings on THP boundaries""),
> until it's perf regressions were better understood -- and I haven't
> seen a revamp of it.

It's too bad it was reverted, though I understand the concerns regarding it.

>From my point of view, if an address is properly aligned and a caller is
asking for 2M+ to be mapped, it's going to be advantageous from a purely
system-focused point of view to do that mapping with a THP. It's less work
for the kernel, generates fewer future page faults, involves less page
table manipulation and in general means less hassle all around in the
generic case. Of course there are all sorts of cases where it may not be
the best solution from a performance point of view, but in general I've
always preferred the approach of "do it if you CAN" rather than "do it
only if asked" for such mappings.

You can make a similar bloat argument to the original concern regarding
text mappings; you may map a large text region with a THP, and locality
of reference may be such that the application actually references little
of the mapped space. It still seems that on average you're better off
mapping via a THP when possible.

It's difficult to heuristically determine whether a caller is "really"
going to use a 2M+ space it wants or if it's just being "greedy" and/or is
trying to reserve space for "growth later" unless the system receives
specific madvise() hints from the caller, so I would prefer an approach
where callers would madvise() to shut off rather than enable the behavior.

But that's just my $.02 in a discussion where lots of pennies are already
being scattered about. :-)