Re: [PATCH 2/2] mm/mprotect: do not flush on permission promotion

From: David Hildenbrand
Date: Thu Oct 07 2021 - 13:07:35 EST


On 07.10.21 18:16, Nadav Amit wrote:

On Oct 7, 2021, at 5:13 AM, David Hildenbrand <david@xxxxxxxxxx> wrote:

On 25.09.21 22:54, Nadav Amit wrote:
From: Nadav Amit <namit@xxxxxxxxxx>
Currently, using mprotect() to unprotect a memory region or uffd to
unprotect a memory region causes a TLB flush. At least on x86, as
protection is promoted, no TLB flush is needed.
Add an arch-specific pte_may_need_flush() which tells whether a TLB
flush is needed based on the old PTE and the new one. Implement an x86
pte_may_need_flush().
For x86, PTE protection promotion or changes of software bits does
require a flush, also add logic that considers the dirty-bit. Changes to
the access-bit do not trigger a TLB flush, although architecturally they
should, as Linux considers the access-bit as a hint.

Is the added LOC worth the benefit? IOW, do we have some benchmark that really benefits from that?

So you ask whether the added ~10 LOC (net) worth the benefit?

I read "3 files changed, 46 insertions(+), 1 deletion(-)" to optimize something without proof, so I naturally have to ask. So this is just a "usually we optimize and show numbers to proof" comment.


Let’s start with the cost of this patch.

If you ask about complexity, I think that it is a rather simple
patch and documented as needed. Please be more concrete if you
think otherwise.

It is most certainly added complexity, although documented cleanly.


If you ask about the runtime overhead, my experience is that
such code, which mostly does bit operations, has negligible cost.
The execution time of mprotect code, and other similar pieces of
code, is mostly dominated by walking the page-tables & getting
the pages (which might require cold or random memory accesses),
acquiring the locks, and of course the TLB flushes that this
patch tries to eliminate.

I'm absolutely not concerned about runtime overhead :)


As for the benefit: TLB flush on x86 of a single PTE has an
overhead of ~200 cycles. If a TLB shootdown is needed, for instance
on multithreaded applications, this overhead can grow to few
microseconds or even more, depending on the number of sockets,
whether the workload runs in a VM (and worse if CPUs are
overcommitted) and so on.

This overhead is completely unnecessary on many occasions. If
you run mprotect() to add permissions, or as I noted in my case,
to do something similar using userfaultfd. Note that the
potentially unnecessary TLB flush/shootdown takes place while
you hold the mmap-lock for write in the case of mprotect(),
thereby potentially preventing other threads from making
progress during that time.

On my in-development workload it was a considerable overhead
(I didn’t collect numbers though). Basically, I track dirty
pages using uffd, and every page-fault that can be easily
resolved by unprotecting cause a TLB flush/shootdown.

Any numbers would be helpful.


If you want, I will write a microbenchmarks and give you numbers.
If you look for further optimizations (although you did not indicate
so), such as doing the TLB batching from do_mprotect_key(),
(i.e. batching across VMAs), we can discuss it and apply it on
top of these patches.

I think this patch itself is sufficient if we can show a benefit; I do wonder if existing benchmarks could already show a benefit, I feel like they should if this makes a difference. Excessive mprotect() usage (protect<>unprotect) isn't something unusual.

--
Thanks,

David / dhildenb