Re: [v5 PATCH] arm64: mm: force write fault for atomic RMW instructions

From: Catalin Marinas
Date: Wed Aug 21 2024 - 07:28:17 EST


On Thu, Jul 11, 2024 at 11:17:51AM -0700, Yang Shi wrote:
> On 7/11/24 10:43 AM, Catalin Marinas wrote:
> > On Wed, Jul 10, 2024 at 11:43:18AM -0700, Yang Shi wrote:
> > > On 7/10/24 2:22 AM, Catalin Marinas wrote:
> > > > > diff --git a/arch/arm64/mm/mmap.c b/arch/arm64/mm/mmap.c
> > > > > index 642bdf908b22..d30265d424e4 100644
> > > > > --- a/arch/arm64/mm/mmap.c
> > > > > +++ b/arch/arm64/mm/mmap.c
> > > > > @@ -19,7 +19,7 @@ static pgprot_t protection_map[16] __ro_after_init = {
> > > > >         [VM_WRITE]                                      = PAGE_READONLY,
> > > > >         [VM_WRITE | VM_READ]                            = PAGE_READONLY,
> > > > >         /* PAGE_EXECONLY if Enhanced PAN */
> > > > > -        [VM_EXEC]                                       = PAGE_READONLY_EXEC,
> > > > > +        [VM_EXEC]                                       = PAGE_EXECONLY,
> > > > >         [VM_EXEC | VM_READ]                             = PAGE_READONLY_EXEC,
> > > > >         [VM_EXEC | VM_WRITE]                            = PAGE_READONLY_EXEC,
> > > > >         [VM_EXEC | VM_WRITE | VM_READ]                  = PAGE_READONLY_EXEC,
> > > > In theory you'd need to change the VM_SHARED | VM_EXEC entry as well.
> > > > Otherwise it looks fine.
> > > Thanks. I just ran the same benchmark. Ran the modified page_fault1_thread
> > > (trigger read fault) in 100 iterations with 160 threads on 160 cores. This
> > > should be the worst contention case and collected the max data (worst
> > > latency). It shows the patch may incur ~30% overhead for exec-only case. The
> > > overhead should just come from the permission fault.
> > >
> > >     N           Min           Max        Median           Avg Stddev
> > > x 100        163840        219083        184471        183262 12593.229
> > > + 100        211198        285947        233608     238819.98 15253.967
> > > Difference at 95.0% confidence
> > >     55558 +/- 3877
> > >     30.3161% +/- 2.11555%
> > >
> > > This is a very extreme benchmark, I don't think any real life workload will
> > > spend that much time (sys vs user) in page fault, particularly read fault.
> > >
> > > With my atomic fault benchmark (populate 1G memory with atomic instruction
> > > then manipulate the value stored in the memory in 100 iterations so the user
> > > time is much longer than sys time), I saw around 13% overhead on sys time
> > > due to the permission fault, but no noticeable change for user and real
> > > time.
> > Thanks for running these tests.
> >
> > > So the permission fault does incur noticeable overhead for read fault on
> > > exec-only, but it may be not that bad for real life workloads.
> > So you are saying 3 faults is not that bad for real life workloads but
> > the 2 faults behaviour we have currently is problematic for OpenJDK. For
> > the OpenJDK case, I don't think the faulting is the main issue affecting
> > run-time performance but, over a longer run (or with more iterations in
> > this benchmark after the faulting in), you'd notice the breaking up of
> > the initial THP pages.
>
> I meant the extra permission fault for exec-only should be ok since the
> current implementation can't force write fault for exec-only anyway. It does
> incur noticeable overhead for read fault, but I'm not aware of any real life
> workloads are sensitive to read fault. The benchmark is for a very extreme
> worst case.

I agree microbenchmarks like this are not realistic but I wouldn't be
surprised if we hear of some large file mmap() read or very sparse
arrays read being affected.

Sorry, I forgot the details from the previous discussions. Are there any
benchmarks that show OpenJDK performing badly _and_ what the improvement
is with this patch? We should do similar comparison with the THP kernel
fix. If the THP fix gets us close to the same level of improvement, I
don't think we should bother with this patch.

> > Do you have any OpenJDK benchmarks that show the problem? It might be
> > worth running them with this patch:
> >
> > https://lore.kernel.org/r/rjudrmg7nkkwfgviqeqluuww6wu6fdrgdsfimtmpjee7lkz2ej@iosd2f6pk4f7
> >
> > Or, if not, do you see any difference in the user time in your benchmark
> > with the above mm patch? In subsequent iterations, linear accesses are
> > not ideal for testing. Better to have some random accesses this 1GB of
> > memory (after the initial touching). That would mimic heap accesses a
> > bit better.
>
> I didn't try that patch. I think we discussed this before. This patch can
> remove the THP shattering, we should be able to see some improvement, but
> TBLI is still needed and its cost should be still noticeable because the
> write protection fault still happens.

Yeah, with two faults we do need the subsequent TLBI. But if it's only
one per huge page, it might not be that bad.

> > Anyway, as it stands, I don't think we should merge this patch since it
> > does incur an additional penalty with exec-only mappings and it would
> > make things even worse for OpenJDK if distros change their default
> > toolchain flags at some point to generate exec-only ELF text sections.
> > While we could work around this by allowing the kernel to read the
> > exec-only user mapping with a new flavour of get_user() (toggling PAN as
> > well), I don't think it's worth it. Especially with the above mm change,
> > I think the benefits of this patch aren't justified. Longer term, I hope
> > that customers upgrade to OpenJDK v22 or, for proprietary tools, they
> > either follow the madvise() approach or wait for the Arm architects to
> > change the hardware behaviour.
>
> If the overhead for exec-only is a concern, I think we can just skip
> exec-only segment for now, right? The exec-only is not that popular yet. And
> if the users prefer security, typically they may be not that sensitive to
> performance.

I don't think we should penalise something that currently works when we
already have solutions for the problematic case (madvise()) and a kernel
improvement on the THP handling. At least not without real-world numbers
to back it up. I don't want to advise people not to compile OpenJDK (or
other software) with exec-only permissions just because the in-kernel
RMW handling will get worse.

Currently exec-only binaries are not wide spread but as FEAT_EPAN (a
precondition) starts to appear in hardware, software may move in this
direction, at least for the pre-built executables (JIT'ed code may incur
additional size penalty to keep the literal pool separated).

My view (as kernel maintainer) is that there aren't strong enough
arguments for merging this patch. It does show an improvement in
specific cases but the downside is the risk of affecting other cases
(plus additional maintenance cost, though I'd expect the opcode checking
to remain the same). I do think the architecture should be changed so
that the kernel gets the right information in the fault handler but it
would be a while before such feature appears in hardware.

--
Catalin