[RESEND PATCH v3 0/5] mm/mprotect: avoid unnecessary TLB flushes

From: Nadav Amit
Date: Fri Mar 11 2022 - 14:07:07 EST


From: Nadav Amit <namit@xxxxxxxxxx>

This patch-set is intended to remove unnecessary TLB flushes during
mprotect() syscalls. Once this patch-set make it through, similar
and further optimizations for MADV_COLD and userfaultfd would be
possible.

Sorry for the time between it took me to get to v3.

Basically, there are 3 optimizations in this patch-set:
1. Use TLB batching infrastructure to batch flushes across VMAs and
do better/fewer flushes. This would also be handy for later
userfaultfd enhancements.
2. Avoid TLB flushes on permission demotion. This optimization is
the one that provides most of the performance benefits. Note that
the previous batching infrastructure changes are needed for that to
happen.
3. Avoiding TLB flushes on change_huge_pmd() that are only needed to
prevent the A/D bits from changing.

Andrew asked for some benchmark numbers. I do not have an easy
determinate macrobenchmark in which it is easy to show benefit. I therre
ran a microbenchmark: a loop that does the following on anonymous
memory, just as a sanity check to see that time is saved by avoiding TLB
flushes. The loop goes:

mprotect(p, PAGE_SIZE, PROT_READ)
mprotect(p, PAGE_SIZE, PROT_READ|PROT_WRITE)
*p = 0; // make the page writable

The test was run in KVM guest with 1 or 2 threads (the second thread
was busy-looping). I measured the time (cycles) of each operation:

1 thread 2 threads
mmots +patch mmots +patch
PROT_READ 3494 2725 (-22%) 8630 7788 (-10%)
PROT_READ|WRITE 3952 2724 (-31%) 9075 2865 (-68%)

[ mmots = v5.17-rc6-mmots-2022-03-06-20-38 ]

The exact numbers are really meaningless, but the benefit is clear.
There are 2 interesting results though.

(1) PROT_READ is cheaper, while one can expect it not to be affected.
This is presumably due to TLB miss that is saved

(2) Without memory access (*p = 0), the speedup of the patch is even
greater. In that scenario mprotect(PROT_READ) also avoids the TLB flush.
As a result both operations on the patched kernel take roughly ~1500
cycles (with either 1 or 2 threads), whereas on mmotm their cost is as
high as presented in the table.

--

v2 -> v3:
* Fix orders of patches (order could lead to breakage)
* Better comments
* Clearer KNL detection [Dave]
* Assertion on PF error-code [Dave]
* Comments, code, function names improvements [PeterZ]
* Flush on access-bit clearing on PMD changes to follow the way
flushing on x86 is done today in the kernel.

v1 -> v2:
* Wrong detection of permission demotion [Andrea]
* Better comments [Andrea]
* Handle THP [Andrea]
* Batching across VMAs [Peter Xu]
* Avoid open-coding PTE analysis
* Fix wrong use of the mmu_gather()


*** BLURB HERE ***

Nadav Amit (5):
x86: Detection of Knights Landing A/D leak
x86/mm: check exec permissions on fault
mm/mprotect: use mmu_gather
mm/mprotect: do not flush on permission promotion
mm: avoid unnecessary flush on change_huge_pmd()

arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/pgtable.h | 5 ++
arch/x86/include/asm/pgtable_types.h | 2 +
arch/x86/include/asm/tlbflush.h | 82 ++++++++++++++++++++++++
arch/x86/kernel/cpu/intel.c | 5 ++
arch/x86/mm/fault.c | 22 ++++++-
arch/x86/mm/pgtable.c | 10 +++
fs/exec.c | 6 +-
include/asm-generic/tlb.h | 14 +++++
include/linux/huge_mm.h | 5 +-
include/linux/mm.h | 5 +-
include/linux/pgtable.h | 20 ++++++
mm/huge_memory.c | 19 ++++--
mm/mprotect.c | 94 +++++++++++++++-------------
mm/pgtable-generic.c | 8 +++
mm/userfaultfd.c | 6 +-
16 files changed, 248 insertions(+), 56 deletions(-)

--
2.25.1