Re: [PATCH 3/4] mm/mprotect: un-inline folio_pte_batch_flags()

From: Lorenzo Stoakes (Oracle)

Date: Fri Mar 20 2026 - 07:30:22 EST

On Fri, Mar 20, 2026 at 10:59:55AM +0000, Pedro Falcato wrote:
> On Fri, Mar 20, 2026 at 10:36:59AM +0000, Lorenzo Stoakes (Oracle) wrote:
> > On Thu, Mar 19, 2026 at 10:41:54PM +0100, David Hildenbrand (Arm) wrote:
> > >
> > > >
> > > > This is all seems VERY delicate, and subject to somebody else coming along and
> > > > breaking it/causing some of these noinline/__always_inline invocations to make
> > > > things far worse.
> > > >
> > > > I also reserve the right to seriously rework this pile of crap software.
> > > >
> > > > I'd rather we try to find less fragile ways to optimise!
> > > >
> > > > Maybe there's some steps that are bigger wins than others?
> > >
> > > What we can do is, collect similar folio_pte_batch_*() variants and
> > > centralize them in mm/utils.c.
> >
> > I'm not sure that addresses any of the comments above?
> >
> > Putting logic specific to components of mm away from where those components
> > are and into mm/util.c seems like a complete regression in terms of
> > fragility and code separation.
> >
> > And for what reason would you want to do that? To force a noinline of an
> > inline and people 'just have to know' that's why you randomly separated the
> > two?
> >
> > Doesn't sound appealing overall.
> >
> > I'd rather we find a way to implement the batching such that it doesn't
> > exhibit bad inlining decisions in the first place.
>
> Yes, you make a good point. At the end of the day (taking change_protection()
> as the example at hand here), after these changes:
> change_protection()
> loop over p4ds
> loop over puds
> loop over pmds
> loop over ptes
> nr_ptes = loop over ptes and find out how many we have
> if (making write) {
> loop over nr_ptes
> loop over ptes and find out how many are anonexclusive or not, in a row
> loop over ptes and set them
> } else {
> loop over ptes and set them
> }
>
> which the compiler FWIW tries to inline it all into one function, but then
> does a poor job at figuring things out. And the CPU gets confused too. It
> was frankly shocking how much performance I could squeeze out of a
>
> if (nr_ptes == 1) {
> /* don't bother with the loops and the funny logic */
> }

Let's maybe focus on generalising this then to start?

I think David + I are in agreement that these are obvious (TM) improvements. But
let's see if we can generalise them?

>
> I would not be surprised if the other syscalls have similar problems.
>
> >
> > I mean mprotect_folio_batch() having !folio, !folio_test_large() checks
> > only there is already silly, we should have a _general_ function that does
> > optimisations like that.
> >
> > Isn't the issue rather than folio_pte_batch_flags() shouldn't be an inline
> > function in internal.h but rather a function in mm/util.c?
> >
> > >
> > > For
> > >
> > > nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte,
> > > max_nr_ptes, /* flags = */ 0)
> > >
> > > We might just be able to use folio_pte_batch()?
> >
> > Yup.
> >
> > >
> > > For the other variant (soft-dirt+write) we'd have to create a helper like
> > >
> > > folio_pte_batch_sd_w() [better name suggestion welcome]
> > >
> > > That will reduce the code footprint overall I guess.
> >
> > I mean yeah that's a terrible name so obviously it'd have to be something
> > better.
> >
> > But again, this seems pretty stupid, now we're writing a bunch of duplicate
> > per-case code to force noinline because we made the central function inline
> > no?
>
> Yeah.

I guess we can discuss on other part of thread :)

>
> >
> > That's also super fragile, because an engineer might later decide that
> > pattern is horrible and fix it, and we regress this.
> >
> > But I mean overall, is the perf here really all that important? Are people
> > really that dependent on mprotect() et al. performing brilliantly fast?
>
> Obviously no one truly depends on mprotect, but I believe in fast primitives :)
>
> > Couldn't we do this with any mm interface and end up making efforts that
> > degrade code quality, increase fragility for dubious benefit?
>
> Yes, which is why I don't want to degrade code quality :) It would be ideal to
> find something that works both ways. Per my description of the existing code
> above, you can tell that it's neither fast, nor beautiful :p

I mean the code is terrible in most of these places (mremap() is much better
because I put a lot of time into de-insane-it-ising it), but it's less degrading
quality but rather:

static noinline void blahdy_blah(a trillion parameters)
{
... lots of code ...
}

Developer comes along, modifies code/params/etc. or uses function in another
place and degrades perf somehow and suddenly what made sense at X point of time
no longer makes sense at Y point of time, but develoeprs don't understand it so
are scared to remove it and etc.

Which is why it's better to try to abstract as much as possible and have some
way of then putting the sensitive stuff in a specific place like:

/*
* Lorenzo-style overly long comment with lots of exposition, a beginning,
* middle and end, ASCII diagrams and all sorts...
*
* ...
*/
static noinline void __super_special_perf_bit_of_whatever_dont_touch_please(...)
{
}

That's wrapped up in some saner interface that people can use at will with the
perf-y component safely locked away in an insane asylum^W^W another compilation
unit or header.

>
> --
> Pedro

Cheers, Lorenzo