Re: [v5 PATCH] arm64: mm: force write fault for atomic RMW instructions

From: David Hildenbrand
Date: Tue Jul 02 2024 - 08:58:36 EST

Next message: Stanimir Varbanov: "Re: [PATCH v1 3/8] PCI: brcmstb: Use bridge reset if available"
Previous message: Audra Mitchell: "[PATCH] Disable Clk-gate_tests for s390"
In reply to: Ryan Roberts: "Re: [v5 PATCH] arm64: mm: force write fault for atomic RMW instructions"
Next in thread: Ryan Roberts: "Re: [v5 PATCH] arm64: mm: force write fault for atomic RMW instructions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 02.07.24 14:36, Ryan Roberts wrote:

On 02/07/2024 12:22, David Hildenbrand wrote:

On 02.07.24 12:26, Ryan Roberts wrote:

On 01/07/2024 20:43, Catalin Marinas wrote:

On Fri, Jun 28, 2024 at 11:20:43AM -0700, Yang Shi wrote:

On 6/28/24 10:24 AM, Catalin Marinas wrote:

This patch does feel a bit like working around a non-optimal user choice
in kernel space. Who knows, madvise() may even be quicker if you do a
single call for a larger VA vs touching each page.

IMHO, I don't think so. I viewed this patch to solve or workaround some ISA
inefficiency in kernel. Two faults are not necessary if we know we are
definitely going to write the memory very soon, right?

I agree the Arm architecture behaviour is not ideal here and any
timelines for fixing it in hardware, if they do happen, are far into the
future. Purely from a kernel perspective, what I want though is make
sure that longer term (a) we don't create additional maintenance burden
and (b) we don't keep dead code around.

Point (a) could be mitigated if the architecture is changed so that any
new atomic instructions added to this range would also come with
additional syndrome information so that we don't have to update the
decoding patterns.

Point (b), however, depends on the OpenJDK and the kernel versions in
distros. Nick Gasson kindly provided some information on the OpenJDK
changes. The atomic_add(0) change happened in early 2022, about 5-6
months after MADV_POPULATE_WRITE support was added to the kernel. What's
interesting is Ampere already contributed MADV_POPULATE_WRITE support to
OpenJDK a few months ago:

https://github.com/openjdk/jdk/commit/a65a89522d2f24b1767e1c74f6689a22ea32ca6a

The OpenJDK commit lacks explanation but what I gathered from the diff
is that this option is the preferred one in the presence of THP (which
most/all distros enable by default). If we merge your proposed kernel
patch, it will take time before it makes its way into distros. I'm
hoping that by that time, distros would have picked a new OpenJDK
version already that doesn't need the atomic_add(0) pattern. If that's
the case, we end up with some dead code in the kernel that's almost
never exercised.

I don't follow OpenJDK development but I heard that updates are dragging
quite a lot. I can't tell whether people have picked up the
atomic_add(0) feature and whether, by the time a kernel patch would make
it into distros, they'd also move to the MADV_POPULATE_WRITE pattern.

There's a point (c) as well on the overhead of reading the faulting
instruction. I hope that's negligible but I haven't measured it.

Just to add to this, I note the existing kernel behaviour is that if a write
fault happens in a region that has a (RO) huge zero page mapped at PMD level,
then the PMD is shattered, the PTE of the fault address is populated with a
writable page and the remaining PTEs are populated with order-0 zero pages
(read-only).

That also recently popped up in [1]. CCing Jinjiang. Ever since I
replied there, I also thought some more about that handling in regard to the
huge zeropage.

This seems like odd behaviour to me. Surely it would be less effort and more
aligned with the app's expectations to notice the huge zero page in the PMD,
remove it, and install a THP, as would have been done if pmd_none() was true? I
don't think there is a memory bloat argument here because, IIUC, with the
current behaviour, khugepaged would eventually upgrade it to a THP anyway?

One detail: depending on the setting of khugepaged_max_ptes_none. zeropages
are treated like pte_none. But in the common case, that setting is left alone.

Ahh, got it. So in the common case, khugepaged won't actually collapse
unless/until a bunch more write faults occur in the 2M region, and in that case
there is a risk that changing this behaviour could lead to a memory bloat
regression.

Changing to this new behaviour would only be a partial solution for your use
case, since you would still have 2 faults. But it would remove the cost of the
shattering and ensure you have a THP immediately after the write fault. But I
can't think of a reason why this wouldn't be a generally useful change
regardless? Any thoughts?

The "let's read before we write" as used by QEMU migration code is the desire
to not waste memory by populating the zeropages. Deferring consuming memory
until really required.

    /*
     * We read one byte of each page; this will preallocate page tables if
     * required and populate the shared zeropage on MAP_PRIVATE anonymous memory
     * where no page was populated yet. This might require adaption when
     * supporting other mappings, like shmem.
     */

So QEMU is concerned with preallocatiing page tables? I would have thought you
could make that a lot more efficient with an explicit MADV_POPULATE_PGTABLE
call? (i.e. 1 kernel call vs 1 call per 2M, allocate all the pages in one trip
through the allocator, fewer pud/pmd lock/unlocks, etc).

I think we are only concerned about the "shared zeropage" part. Everything else is just unnecessary detail that adds confusion here :) One requires the other.

Note that this is from migration code where we're supposed to write a single page we received from the migration source right now (not more). And we want to avoid allcoating memory if it can be avoided (usually for overcommit).

TBH I always assumed in the past the that huge zero page is only useful because
its a placeholder for a real THP that would be populated on write. But that's
obviously not the case at the moment. So other than a hack to preallocate the
pgtables with only 1 fault per 2M, what other benefits does it have?

I don't quite udnerstand that question. [2] has some details why the huge zeropage was added -- because we would have never otherwise received huge zeropages with THP enabled but always anon THP directly on read.

Without THP this works as expected. With THP this currently also works as
expected, but of course with the price [1] of not getting anon THP
immediately, which usually we don't care about. As you note, khugepaged might
fix this up later.

If we disable the huge zeropage, we would get anon THPs when reading instead of
small zeropages.

I wasn't aware of that behaviour either. Although that sounds like another
reason why allocating a THP over the huge zero page on write fault should be the
"more consistent" behaviour.

Reading [2] I think the huge zeropage was added to avoid the allocation of THP on read. Maybe for really only large readable regions, not sure why exactly.

As reply to [1], I suggested using preallcoation (using MADV_POPULATE_WRITE)
when we really care about that performance difference, which would also
avoid the huge zeropage completely, but it's also not quite optimal in some cases.

I could imagine some cases could benefit from a MADV_POPULATE_WRITE_ON_FAULT,
which would just mark the VMA so that any read fault is upgraded to write.

I don't really know what to do here: changing the handling for the huge zeropage
only unconditionally does not sound too wrong, but the change in behavior
might (or might not) be desired for some use cases.

Reading from unpopulated memory can be a clear sign that really the shared zeropage
is desired (as for QEMU), and concurrent memory preallcoation/population should
ideally use MADV_POPULATE_WRITE. Maybe there are some details buried in [2]
regarding
the common use cases for the huge zeropage back than.

The current huge zero page behavior on write fault sounds wonky to me. But I
agree there are better and more complete solutions to the identified use cases.
So unless something pops up where the change is a clear benefit, I guess better
to be safe and leave as is.

We've had that behavior for a quite a while ... so it's rather surprising to see multiple people reporting this right now.

I guess most use cases don't read from uninitialized memory barely write to it and care about getting THPs immediately.

For preallocation, MADVISE_POPULATE_WRITE is better. For QEMU migration? not sure what's really better. Maybe replacing the huge zeropage by a THP would be faster in some cases, but result in more memory consumption (and more page zeroing?) during migration in other cases.

--
Cheers,

David / dhildenb

Next message: Stanimir Varbanov: "Re: [PATCH v1 3/8] PCI: brcmstb: Use bridge reset if available"
Previous message: Audra Mitchell: "[PATCH] Disable Clk-gate_tests for s390"
In reply to: Ryan Roberts: "Re: [v5 PATCH] arm64: mm: force write fault for atomic RMW instructions"
Next in thread: Ryan Roberts: "Re: [v5 PATCH] arm64: mm: force write fault for atomic RMW instructions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]