The atomic RMW instructions, for example, ldadd, actually does load +
add + store in one instruction, it will trigger two page faults per the
ARM64 architecture spec, the first fault is a read fault, the second
fault is a write fault.
Some applications use atomic RMW instructions to populate memory, for
example, openjdk uses atomic-add-0 to do pretouch (populate heap memory
at launch time) between v18 and v22 in order to permit use of memory
concurrently with pretouch.
But the double page fault has some problems:
1. Noticeable TLB overhead. The kernel actually installs zero page with
readonly PTE for the read fault. The write fault will trigger a
write-protection fault (CoW). The CoW will allocate a new page and
make the PTE point to the new page, this needs TLB invalidations. The
tlb invalidation and the mandatory memory barriers may incur
significant overhead, particularly on the machines with many cores.
2. Break up huge pages. If THP is on the read fault will install huge
zero pages. The later CoW will break up the huge page and allocate
base pages instead of huge page. The applications have to rely on
khugepaged (kernel thread) to collapse huge pages asynchronously.
This also incurs noticeable performance penalty.
3. 512x page faults with huge page. Due to #2, the applications have to
have page faults for every 4K area for the write, this makes the speed
up by using huge page actually gone.