答复: [PATCH] mm: Add RWH_RMAP_EXCLUDE flag to exclude files from rmap sharing

From: Yibin Liu

Date: Wed Apr 22 2026 - 09:12:44 EST

> On Tue, Apr 21, 2026 at 4:11 AM Yibin Liu <liuyibin@xxxxxxxx> wrote:
> >
> > UnixBench execl/shellscript (dynamically linked binaries) at 64+ cores are
> > bottlenecked on the i_mmap_rwsem semaphore due to heavy vma
> insert/remove
> > operations on the i_mmap tree, where libc.so.6 is the most frequent,
> > followed by ld-linux-x86-64.so.2 and the test executable itself.
> >
> > This patch marks such files to skip rmap operations, avoiding frequent
> > interval tree insert/remove that cause i_mmap_rwsem lock contention.
> > The downside is these files can no longer be reclaimed (along with compact
> > and ksm), but since they are small and resident anyway, it's acceptable.
> > When all mapping processes exit, files can still be reclaimed normally.
> >
> > Performance testing shows ~80% improvement in UnixBench execl/shellscript
> > scores on Hygon 7490, AMD zen4 9754 and Intel emerald rapids platform.
> >
>
> The other responders have been a little harsh and despite raising
> valid points I don't think they gave a proper review.
>
> The bigger picture is that the problematic rwsem is taken several
> times during fork + exec + exit cycle. Normally you end up with 5
> distinct mappings per binary/so, each created with a separate lock
> acquire.
>
> Some time ago I patched exit to batch processing, leaving 1 acquire in
> that codepath. fork can and should be patched in a similar vein, but I
> don't know if unixbench runs it in this benchmark (i.e., real
> workloads certainly suffer from it, I don't know if this particular
> bench includes that aspect). This is on top of forking itself being
> avoidable should the kernel grow a better interface for executing
> binaries.
>
Thank you for your opnions and advices, I'll try this way
> This leaves us with mapping creation on exec. This problem is
> unfixable without introduction of better APIs for userspace, which
> constitutes quite a challenge.
>
> The end result is the absolutely horrible case of multiple acquires of
> the same lock per iteration.
>
> One common idea how to reduce contention boils down to shortening lock
> hold time. This has very limited effect in face of the aforementioned
> multiple acquires and is at best a stop gap -- no matter what, the
> ceiling is dictated by the extra acquires and it is incredibly low.
>
> Your patch keeps the problematic acquire pattern intact and while the
> 80% win might sound encouraging, the end result is still severely
> underperforming even a state where the lock is taken once in total
> during exec.
>
> Besides that, the internally-visible side effect of non-functional
> rmap is pretty bad (and thus e.g., truncate) is pretty bad in its own
> right, but let's ignore it. The primary problem here is that the patch
> exposes a mechanism for userspace to dictate this in the first place.
> Even ignoring the question of who should be using it and when, the
> real solution to the problem would be confined to the kernel. Suppose
> this patch lands and such a solution is implemented later -- now the
> kernel is stuck having to support a now-useless (if not outright
> harmful) feature.
OK. I understand it now.
>
> What will fix the problem is sharding the state in some capacity,
> provided no unfixable stopgap shows up.
>
> Any other approach is putting small bandaids on it and can be a
> consideration only if the decentralizing locking is proven too
> problematic.
>
> Pedro apparently volunteered to do the work, so I think we can wait to
> see what he is going to end up cooking.
>
> I hope this helps.
>