Re: [PATCH 2/2] mm/mprotect: do not flush on permission promotion

From: Nadav Amit
Date: Thu Oct 07 2021 - 12:16:07 EST



> On Oct 7, 2021, at 5:13 AM, David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> On 25.09.21 22:54, Nadav Amit wrote:
>> From: Nadav Amit <namit@xxxxxxxxxx>
>> Currently, using mprotect() to unprotect a memory region or uffd to
>> unprotect a memory region causes a TLB flush. At least on x86, as
>> protection is promoted, no TLB flush is needed.
>> Add an arch-specific pte_may_need_flush() which tells whether a TLB
>> flush is needed based on the old PTE and the new one. Implement an x86
>> pte_may_need_flush().
>> For x86, PTE protection promotion or changes of software bits does
>> require a flush, also add logic that considers the dirty-bit. Changes to
>> the access-bit do not trigger a TLB flush, although architecturally they
>> should, as Linux considers the access-bit as a hint.
>
> Is the added LOC worth the benefit? IOW, do we have some benchmark that really benefits from that?

So you ask whether the added ~10 LOC (net) worth the benefit?

Let’s start with the cost of this patch.

If you ask about complexity, I think that it is a rather simple
patch and documented as needed. Please be more concrete if you
think otherwise.

If you ask about the runtime overhead, my experience is that
such code, which mostly does bit operations, has negligible cost.
The execution time of mprotect code, and other similar pieces of
code, is mostly dominated by walking the page-tables & getting
the pages (which might require cold or random memory accesses),
acquiring the locks, and of course the TLB flushes that this
patch tries to eliminate.

As for the benefit: TLB flush on x86 of a single PTE has an
overhead of ~200 cycles. If a TLB shootdown is needed, for instance
on multithreaded applications, this overhead can grow to few
microseconds or even more, depending on the number of sockets,
whether the workload runs in a VM (and worse if CPUs are
overcommitted) and so on.

This overhead is completely unnecessary on many occasions. If
you run mprotect() to add permissions, or as I noted in my case,
to do something similar using userfaultfd. Note that the
potentially unnecessary TLB flush/shootdown takes place while
you hold the mmap-lock for write in the case of mprotect(),
thereby potentially preventing other threads from making
progress during that time.

On my in-development workload it was a considerable overhead
(I didn’t collect numbers though). Basically, I track dirty
pages using uffd, and every page-fault that can be easily
resolved by unprotecting cause a TLB flush/shootdown.

If you want, I will write a microbenchmarks and give you numbers.
If you look for further optimizations (although you did not indicate
so), such as doing the TLB batching from do_mprotect_key(),
(i.e. batching across VMAs), we can discuss it and apply it on
top of these patches.