Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance

From: David Hildenbrand
Date: Fri Jul 07 2023 - 12:07:14 EST

Next message: Conor Dooley: "Re: [External] Re: [PATCH v3 0/4] Obtain SMBIOS and ACPI entry from FFI"
Previous message: Paul E. McKenney: "Re: [PATCH] srcu: Make srcu_might_be_idle() take early return if rcu_gp_is_normal() return true"
In reply to: Ryan Roberts: "Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance"
Next in thread: Ryan Roberts: "Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 07.07.23 17:13, Ryan Roberts wrote:

On 07/07/2023 15:07, David Hildenbrand wrote:

On 07.07.23 15:57, Matthew Wilcox wrote:

On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:

On 07.07.23 11:52, Ryan Roberts wrote:

On 07/07/2023 09:01, Huang, Ying wrote:

Although we can use smaller page order for FLEXIBLE_THP, it's hard to
avoid internal fragmentation completely. So, I think that finally we
will need to provide a mechanism for the users to opt out, e.g.,
something like "always madvise never" via
/sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's
a good idea to reuse the existing interface of THP.

I wouldn't want to tie this to the existing interface, simply because that
implies that we would want to follow the "always" and "madvise" advice too;
That
means that on a thp=madvise system (which is certainly the case for android and
other client systems) we would have to disable large anon folios for VMAs that
haven't explicitly opted in. That breaks the intention that this should be an
invisible performance boost. I think it's important to set the policy for
use of

It will never ever be a completely invisible performance boost, just like
ordinary THP.

Using the exact same existing toggle is the right thing to do. If someone
specify "never" or "madvise", then do exactly that.

It might make sense to have more modes or additional toggles, but
"madvise=never" means no memory waste.

I hate the existing mechanisms. They are an abdication of our
responsibility, and an attempt to blame the user (be it the sysadmin
or the programmer) of our code for using it wrongly. We should not
replicate this mistake.

I don't agree regarding the programmer responsibility. In some cases the
programmer really doesn't want to get more memory populated than requested --
and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do.

Regarding the madvise=never/madvise/always (sys admin decision), memory waste
(and nailing down bugs or working around them in customer setups) have been very
good reasons to let the admin have a word.

Our code should be auto-tuning. I posted a long, detailed outline here:
https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@xxxxxxxxxxxxxxxxxxxx/

Well, "auto-tuning" also should be perfect for everybody, but once reality
strikes you know it isn't.

If people don't feel like using THP, let them have a word. The "madvise" config
option is probably more controversial. But the "always vs. never" absolutely
makes sense to me.

I remember I raised it already in the past, but you *absolutely* have to
respect the MADV_NOHUGEPAGE flag. There is user space out there (for
example, userfaultfd) that doesn't want the kernel to populate any
additional page tables. So if you have to respect that already, then also
respect MADV_HUGEPAGE, simple.

Possibly having uffd enabled on a VMA should disable using large folios,

There are cases where we enable uffd *after* already touching memory (postcopy
live migration in QEMU being the famous example). That doesn't fly.

I can get behind that. But the notion that userspace knows what it's
doing ... hahaha. Just ignore the madvise flags. Userspace doesn't
know what it's doing.

If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in
some cases. And these include cases I care about messing with sparse VM memory :)

I have strong opinions against populating more than required when user space set
MADV_NOHUGEPAGE.

I can see your point about honouring MADV_NOHUGEPAGE, so think that it is
reasonable to fallback to allocating an order-0 page in a VMA that has it set.
The app has gone out of its way to explicitly set it, after all.

I think the correct behaviour for the global thp controls (cmdline and sysfs)
are less obvious though. I could get on board with disabling large anon folios
globally when thp="never". But for other situations, I would prefer to keep
large anon folios enabled (treat "madvise" as "always"), with the argument that
their order is much smaller than traditional THP and therefore the internal
fragmentation is significantly reduced. I really don't want to end up with user
space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large
anon folios.

I was briefly playing with a nasty idea of an additional "madvise-pmd" option (that could be the new default), that would use PMD THP only in madvise'd regions, and ordinary everywhere else. But let's disregard that for now. I think there is a bigger issue (below).

I still feel that it would be better for the thp and large anon folio controls
to be independent though - what's the argument for tying them together?

Thinking about desired 2 MiB flexible THP on aarch64 (64k kernel) vs, 2 MiB PMD THP on aarch64 (4k kernel), how are they any different? Just the way they are mapped ...

It's easy to say "64k vs. 2 MiB" is a difference and we want separate controls, but how is "2MiB vs. 2 MiB" different?

Having that said, I think we have to make up our mind how much control we want to give user space. Again, the "2MiB vs. 2 MiB" case nicely shows that it's not trivial: memory waste is a real issue on some systems where we limit THP to madvise().

Just throwing it out for discussing:

What about keeping the "all / madvise / never" semantics (and MADV_NOHUGEPAGE ...) but having an additional config knob that specifies in which cases we *still* allow flexible THP even though the system was configured for "madvise".

I can't come up with a good name for that, but something like "max_auto_size=64k" could be something reasonable to set. We could have an arch+hw specific default.

(we all hate config options, I know, but I think there are good reasons to have such bare-minimum ones)

--
Cheers,

David / dhildenb

Next message: Conor Dooley: "Re: [External] Re: [PATCH v3 0/4] Obtain SMBIOS and ACPI entry from FFI"
Previous message: Paul E. McKenney: "Re: [PATCH] srcu: Make srcu_might_be_idle() take early return if rcu_gp_is_normal() return true"
In reply to: Ryan Roberts: "Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance"
Next in thread: Ryan Roberts: "Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]