Re: 答复: 答复: [PATCH] mm: Add RWH_RMAP_EXCLUDE flag to exclude files from rmap sharing

From: Mateusz Guzik

Date: Fri Apr 24 2026 - 02:20:22 EST

On Fri, Apr 24, 2026 at 5:20 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> On Fri, Apr 24, 2026 at 01:08:35AM +0000, Yibin Liu wrote:
> > On an Intel Emerald Rapids server (112 cores), run the execl benchmark from
> > UnixBench with the command: ./Run -c 220 execl
> > Then perf top shows:
> >
> > 91.53% [kernel] [k] osq_lock
> > 0.50% [kernel] [k] rwsem_spin_on_owner
>
> OK, but does this represent a realistic workload? It's pretty easy to
> construct workloads that hammer on particular locks; the question is
> whether it's a relevant performance bottleneck that customers care about.

This is a genuine problem when doing large-scale package building.
I'll say upfront I have extensive experience with this crap on
FreeBSD, I did not run it on Linux myself, but bear with me here --
while FreeBSD is in doubt a less scalable kernel, Linux demonstrated
to be suffering from the same problems.

Say you have a box with a core count of 100 and get it to work
building up to 100 packages at a time. Further, even if you use some
form of separation from file-system standpoint on userspace level, you
still want to share the common binaries to reduce memory + cache
footprint so you at least --bind them. Then you are susceptible to
contention issues at least on paper.

Granted, building a pig like chromium scales great because it is
written in c++ and almost all of the time is spent in userspace, with
forks and execs of the compiler highly spread out in time, in turn
putting very little pressure on the locks.

However, vast majority of packages is very tiny in comparison
(literally a few .c files) and this is where things go south as they
engage in exec frenzy, looking like a borderline microbenchmark. The
primary culprit is configure scripts, issuing an idiotic number of
back-to-back execs of short-lived processes (notably sed, but also
grep, rm and others). There is a lot of evil in makefiles as well.

I don't have numbers handy, but in case of the FreeBSD ports tree we
are talking about over 10 000 ports which on their own take few
seconds to build. Since these are largely single-threaded, if you have
package-building machinery which can saturate the box, you easily end
up with parallel builds matching your core count. And when they engage
in exec-frenzy for the duration, you may as well be microbenchmarking
it.

A sufficiently pessimized workload is indistinguishable from a
microbenchmark and this here is an example of one.

iow this is a real problem, but I don't have specific numbers for Linux.