Re: [v5 PATCH] arm64: mm: force write fault for atomic RMW instructions

From: Yang Shi
Date: Wed Jul 10 2024 - 14:46:47 EST




On 7/10/24 2:22 AM, Catalin Marinas wrote:
On Tue, Jul 09, 2024 at 03:29:58PM -0700, Yang Shi wrote:
On 7/9/24 11:35 AM, Catalin Marinas wrote:
On Tue, Jul 09, 2024 at 10:56:55AM -0700, Yang Shi wrote:
On 7/4/24 3:03 AM, Catalin Marinas wrote:
I tested exec-only on QEMU tcg, but I don't have a hardware supported EPAN.
I don't think performance benchmark on QEMU tcg makes sense since it is
quite slow, such small overhead is unlikely measurable on it.
Yeah, benchmarking under qemu is pointless. I think you can remove some
of the ARM64_HAS_EPAN checks (or replaced them with ARM64_HAS_PAN) just
for testing. For security reason, we removed this behaviour in commit
24cecc377463 ("arm64: Revert support for execute-only user mappings")
but it's good enough for testing. This should give you PROT_EXEC-only
mappings on your hardware.
Thanks for the suggestion. IIUC, I still can emulate exec-only even though
hardware doesn't support EPAN? So it means reading exec-only area in kernel
still can trigger fault, right?
Yes, it's been supported since ARMv8.0. We limited it to EPAN only since
setting a PROT_EXEC mapping still allowed the kernel to access the
memory even if PSTATE.PAN was set.

And 24cecc377463 ("arm64: Revert support for execute-only user mappings")
can't be reverted cleanly by git revert, so I did it manually as below.
Yeah, I wasn't expecting that to work.

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 6a8b71917e3b..0bdedd415e56 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -573,8 +573,8 @@ static int __kprobes do_page_fault(unsigned long far,
unsigned long esr,
                /* Write implies read */
                vm_flags |= VM_WRITE;
                /* If EPAN is absent then exec implies read */
-               if (!alternative_has_cap_unlikely(ARM64_HAS_EPAN))
-                       vm_flags |= VM_EXEC;
+               //if (!alternative_has_cap_unlikely(ARM64_HAS_EPAN))
+               //      vm_flags |= VM_EXEC;
        }

        if (is_ttbr0_addr(addr) && is_el1_permission_fault(addr, esr, regs))
{
diff --git a/arch/arm64/mm/mmap.c b/arch/arm64/mm/mmap.c
index 642bdf908b22..d30265d424e4 100644
--- a/arch/arm64/mm/mmap.c
+++ b/arch/arm64/mm/mmap.c
@@ -19,7 +19,7 @@ static pgprot_t protection_map[16] __ro_after_init = {
        [VM_WRITE]                                      = PAGE_READONLY,
        [VM_WRITE | VM_READ]                            = PAGE_READONLY,
        /* PAGE_EXECONLY if Enhanced PAN */
-       [VM_EXEC]                                       = PAGE_READONLY_EXEC,
+       [VM_EXEC]                                       = PAGE_EXECONLY,
        [VM_EXEC | VM_READ]                             = PAGE_READONLY_EXEC,
        [VM_EXEC | VM_WRITE]                            = PAGE_READONLY_EXEC,
        [VM_EXEC | VM_WRITE | VM_READ]                  = PAGE_READONLY_EXEC,
In theory you'd need to change the VM_SHARED | VM_EXEC entry as well.
Otherwise it looks fine.

Thanks. I just ran the same benchmark. Ran the modified page_fault1_thread (trigger read fault) in 100 iterations with 160 threads on 160 cores. This should be the worst contention case and collected the max data (worst latency). It shows the patch may incur ~30% overhead for exec-only case. The overhead should just come from the permission fault.

    N           Min           Max        Median           Avg Stddev
x 100        163840        219083        184471        183262 12593.229
+ 100        211198        285947        233608     238819.98 15253.967
Difference at 95.0% confidence
    55558 +/- 3877
    30.3161% +/- 2.11555%

This is a very extreme benchmark, I don't think any real life workload will spend that much time (sys vs user) in page fault, particularly read fault.

With my atomic fault benchmark (populate 1G memory with atomic instruction then manipulate the value stored in the memory in 100 iterations so the user time is much longer than sys time), I saw around 13% overhead on sys time due to the permission fault, but no noticeable change for user and real time.

So the permission fault does incur noticeable overhead for read fault on exec-only, but it may be not that bad for real life workloads.