Re: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance

From: David Hildenbrand
Date: Thu Aug 31 2023 - 04:10:16 EST


On 31.08.23 10:02, Yin, Fengwei wrote:


On 8/31/2023 3:57 PM, David Hildenbrand wrote:
On 31.08.23 03:40, Huang, Ying wrote:
Ryan Roberts <ryan.roberts@xxxxxxx> writes:

On 15/08/2023 22:32, Huang, Ying wrote:
Hi, Ryan,

Ryan Roberts <ryan.roberts@xxxxxxx> writes:

Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
allocated in large folios of a determined order. All pages of the large
folio are pte-mapped during the same page fault, significantly reducing
the number of page faults. The number of per-page operations (e.g. ref
counting, rmap management lru list management) are also significantly
reduced since those ops now become per-folio.

The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
which defaults to disabled for now; The long term aim is for this to
defaut to enabled, but there are some risks around internal
fragmentation that need to be better understood first.

Large anonymous folio (LAF) allocation is integrated with the existing
(PMD-order) THP and single (S) page allocation according to this policy,
where fallback (>) is performed for various reasons, such as the
proposed folio order not fitting within the bounds of the VMA, etc:

                 | prctl=dis | prctl=ena   | prctl=ena     | prctl=ena
                 | sysfs=X   | sysfs=never | sysfs=madvise | sysfs=always
----------------|-----------|-------------|---------------|-------------
no hint         | S         | LAF>S       | LAF>S         | THP>LAF>S
MADV_HUGEPAGE   | S         | LAF>S       | THP>LAF>S     | THP>LAF>S
MADV_NOHUGEPAGE | S         | S           | S             | S

IMHO, we should use the following semantics as you have suggested
before.

                 | prctl=dis | prctl=ena   | prctl=ena     | prctl=ena
                 | sysfs=X   | sysfs=never | sysfs=madvise | sysfs=always
----------------|-----------|-------------|---------------|-------------
no hint         | S         | S           | LAF>S         | THP>LAF>S
MADV_HUGEPAGE   | S         | S           | THP>LAF>S     | THP>LAF>S
MADV_NOHUGEPAGE | S         | S           | S             | S

Or even,

                 | prctl=dis | prctl=ena   | prctl=ena     | prctl=ena
                 | sysfs=X   | sysfs=never | sysfs=madvise | sysfs=always
----------------|-----------|-------------|---------------|-------------
no hint         | S         | S           | S             | THP>LAF>S
MADV_HUGEPAGE   | S         | S           | THP>LAF>S     | THP>LAF>S
MADV_NOHUGEPAGE | S         | S           | S             | S

 From the implementation point of view, PTE mapped PMD-sized THP has
almost no difference with LAF (just some small sized THP).  It will be
confusing to distinguish them from the interface point of view.

So, IMHO, the real difference is the policy.  For example, prefer
PMD-sized THP, prefer small sized THP, or fully auto.  The sysfs
interface is used to specify system global policy.  In the long term, it
can be something like below,

never:      S               # disable all THP
madvise:                    # never by default, control via madvise()
always:     THP>LAF>S       # prefer PMD-sized THP in fact
small:      LAF>S           # prefer small sized THP
auto:                       # use in-kernel heuristics for THP size

But it may be not ready to add new policies now.  So, before the new
policies are ready, we can add a debugfs interface to override the
original policy in /sys/kernel/mm/transparent_hugepage/enabled.  After
we have tuned enough workloads, collected enough data, we can add new
policies to the sysfs interface.

I think we can all imagine many policy options. But we don't really have much
evidence yet for what it best. The policy I'm currently using is intended to
give some flexibility for testing (use LAF without THP by setting sysfs=never,
use THP without LAF by compiling without LAF) without adding any new knobs at
all. Given that, surely we can defer these decisions until we have more data?

In the absence of data, your proposed solution sounds very sensible to me. But
for the purposes of scaling up perf testing, I don't think its essential given
the current policy will also produce the same options.

If we were going to add a debugfs knob, I think the higher priority would be a
knob to specify the folio order. (but again, I would rather avoid if possible).

I totally understand we need some way to control PMD-sized THP and LAF
to tune the workload, and nobody likes debugfs knob.

My concern about interface is that we have no way to disable LAF
system-wise without rebuilding the kernel.  In the future, should we add
a new policy to /sys/kernel/mm/transparent_hugepage/enabled to be
stricter than "never"?  "really_never"?

Let's talk about that in a bi-weekly MM session. (I proposed it as a topic for next week).

The time slot of the meeting is not friendly to our timezone. Like
it's 1 or 2 AM. Yes. I know it's very hard to find a good time slot
for US, EU and Asia. :(.

:/

Yeah, even for me in Germany it's usually already around 6-7pm.


So maybe we still need to discuss it through mail?
I don't think we'll be done discussing that in one session. One of the main goals is to get some input from the wider MM community.

--
Cheers,

David / dhildenb