Re: [v5 PATCH] arm64: mm: force write fault for atomic RMW instructions

From: Yang Shi
Date: Tue Jul 09 2024 - 13:57:29 EST




On 7/4/24 3:03 AM, Catalin Marinas wrote:
On Tue, Jul 02, 2024 at 03:21:41PM -0700, Yang Shi wrote:
On 7/1/24 12:43 PM, Catalin Marinas wrote:
I don't follow OpenJDK development but I heard that updates are dragging
quite a lot. I can't tell whether people have picked up the
atomic_add(0) feature and whether, by the time a kernel patch would make
it into distros, they'd also move to the MADV_POPULATE_WRITE pattern.
As Christopher said there may be similar use of atomic in other
applications, so I don't worry too much about dead code problem IMHO.
OpenJDK is just the usecase that we know. There may be unknown unknowns. And
the distros typically backport patches from mainline kernel to their kernel
so there should be combos like old kernel + backported patch + old OpenJDK.
That's a somewhat valid argument I heard internally as well. People tend
to change or patch kernel versions more often than OpenJDK versions
because of the risk of breaking their Java stack. But, arguably, one can
backport the madvise() OpenJDK change since it seems to have other
benefits on x86 as well.

AFAICT, the users do expect similar behavior as x86 (one fault instead of
two faults). Actually we noticed this problem due to a customer report.
It's not a correctness problem, only a performance one. Big part of that
could be mitigated by some adjustment to how THP pages are allocated on
a write fault (though we'd still have an unnecessary read fault and some
TLBI). See Ryan's sub-thread.

Yes, the TLBI is still sub-optimal and it is quite noticeable.


There's a point (c) as well on the overhead of reading the faulting
instruction. I hope that's negligible but I haven't measured it.
I think I showed benchmark data requested by Anshuman in the earlier email
discussion.
Do you mean this:

https://lore.kernel.org/r/328c4c86-96c8-4896-8b6d-94f2facdac9a@xxxxxxxxxxxxxxxxxxxxxx

I haven't figured out what the +24% case is in there, it seems pretty
large.

I think I ran the test much more iterations and I didn't see such outlier anymore.

>> To rule out the worst case, I also ran the test 100 iterations with 160 threads then compared the worst case:
>>
>>      N           Min           Max        Median Avg Stddev
>>   100         34770         84979         65536 63537.7 10358.873
>>   100         38077         87652         65536 63119.02 8792.7399


What you haven't benchmarked (I think) is the case where the instruction
is in an exec-only mapping. The subsequent instruction read will fault
and it adds to the overhead. Currently exec-only mappings are not
widespread but I heard some people planning to move in this direction as
a default build configuration.

I tested exec-only on QEMU tcg, but I don't have a hardware supported EPAN. I don't think performance benchmark on QEMU tcg makes sense since it is quite slow, such small overhead is unlikely measurable on it.


It could be worked around with a new flavour of get_user() that uses the
non-T LDR instruction and the user mapping is readable by the kernel
(that's the case with EPAN, prior to PIE and I think we can change this
for PIE configurations as well). But it adds to the complexity of this
patch when the kernel already offers a MADV_POPULATE_WRITE solution.