Re: [PATCH 3/4] mm/mprotect: un-inline folio_pte_batch_flags()

From: Lorenzo Stoakes (Oracle)

Date: Fri Mar 20 2026 - 06:51:53 EST

On Fri, Mar 20, 2026 at 10:34:29AM +0000, Pedro Falcato wrote:
> On Thu, Mar 19, 2026 at 07:14:38PM +0000, Lorenzo Stoakes (Oracle) wrote:
> > On Thu, Mar 19, 2026 at 06:31:07PM +0000, Pedro Falcato wrote:
> > > Hoist some previous, important to be inlined checks to the main loop
> > > and noinline the entirety of folio_pte_batch_flags(). The loop itself is
> > > quite large and whether it is inlined or not should not matter, as we
> > > are ideally dealing with larger orders.
> > >
> > > Signed-off-by: Pedro Falcato <pfalcato@xxxxxxx>
> > > ---
> > > mm/mprotect.c | 16 +++++++---------
> > > 1 file changed, 7 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > > index 8d4fa38a8a26..aa845f5bf14d 100644
> > > --- a/mm/mprotect.c
> > > +++ b/mm/mprotect.c
> > > @@ -103,16 +103,9 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> > > return can_change_shared_pte_writable(vma, pte);
> > > }
> > >
> > > -static __always_inline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
> > > +static noinline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
> >
> > Hmm I'm iffy about noinline-ing a 1 line function now?
>
> Well, it's a 1 line function that itself (probably) has an inlined
> folio_pte_batch_flags() with a bunch of inlined functions, calls, etc. But yes, I
> see your point.

I mean again the weird thing is folio_pte_batch_flags() being inline in a header
file in the first place. It even says:

* This function will be inlined to optimize based on the input parameters;
* consider using folio_pte_batch() instead if applicable.

In the comment.

noinline'ing to not inline an inline function that explicitly says it's inline
for performance seems... silly.

>
> >
> > And now yu're noinlining something that contains an inline (but not guaranteed
> > to be function, it's a bit strange overall?
> >
> > > pte_t pte, int max_nr_ptes, fpb_t flags)
> > > {
> > > - /* No underlying folio, so cannot batch */
> > > - if (!folio)
> > > - return 1;
> > > -
> > > - if (!folio_test_large(folio))
> > > - return 1;
> > > -
> > > return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr_ptes, flags);
> > > }
> > >
> > > @@ -333,7 +326,12 @@ static long change_pte_range(struct mmu_gather *tlb,
> > > continue;
> > > }
> > >
> >
> > ^^^ up here is a mprotect_folio_pte_batch() invocation for the prot_numa case,
> > which now won't be doing:
>
> Yep, this is a bug (that sashiko also caught!)
>
> >
> > if (!folio)
> > return 1;
> >
> > if (!folio_test_large(folio))
> > return 1;
> >
> > At all, so isn't that potentially a pessimisation in itself?
> >
> >
> > > - nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags);
> > > + /* No underlying folio (or not large), so cannot batch */
> > > + if (likely(!folio || !folio_test_large(folio)))
> >
> > We actually sure of this likely()? Is there data to support it? Am not a fan of
> > using likely()/unlikey() without something to back it.
>
> I did a small little test yesterday with bpftrace + a kernel build +
> filemap_get_folio. I got a staggering majority of order-0 folios, with some
> smaller folios spread out across a bunch of orders (up to 8, iirc). Of course
> I'm on x86 so I don't have mTHP, etc enabled, so those will all be order-0 or
> PMD_ORDER folios (which we won't see here).
>
> >
> > > + nr_ptes = 1;
> > > + else
> > > + nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte,
> > > + max_nr_ptes, flags);
> >
> > It's also pretty gross to throw this out into a massive function.
> >
> > >
> > > oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
> > > ptent = pte_modify(oldpte, newprot);
> > > --
> > > 2.53.0
> > >
> >
> > This is all seems VERY delicate, and subject to somebody else coming along and
> > breaking it/causing some of these noinline/__always_inline invocations to make
> > things far worse.
>
> Who told you optimization isn't delicate?!

I don't want to accept changes that will randomly break in future, it's fairly
pointless, because I guarantee somebody will break things at some point if
they're fragile.

And we _really_ do not want or need to have functions where a bot will shout at
you and you spend X hours figuring out why you reduced the time on a
microbenchmark that does something nobody cares about.

BUT, if we can _abstract_ the bigger wins by rethinking batching somewhat then
we could achieve something here, and that'd definitely be worthwhile.

I think that really the batching implementation is a bit of a mess because it
was done without fixing any existing technical debt first.

We need to stop accepting series that pile more code on top of existing code
that is already a mess like that.

>
> >
> > I also reserve the right to seriously rework this pile of crap software.
> >
> > I'd rather we try to find less fragile ways to optimise!
>
> But yes, I understand your point. Hm. I'll need to think about this some more.
> There's nothing I would love more than to simply slap a
>
> if (pte_batch_hint(ptep, pte) == 1)
> nr_ptes = 1;
>
> on !contpte archs. I don't see where most of the wins for those architectures
> would exist, but apparently they do. Confusing :/
>
> >
> > Maybe there's some steps that are bigger wins than others?
>
> Definitely :) So yeah I'll probably be dropping this patch, or at least
> reworking this one a good bit.

Thanks, sorry to be negative - I appreciate the detailed perf analysis, but I
want to make sure we do this in such a way that future Lorenzo and future Pedro
and future David and future whoever can remember WHY we did X and NOT to do Y
that breaks it :)

So that means abstracting it in a way where that's made easy for us (abstraction
stays stable and handles the perf details for us which are documented in e.g. a
comment, callers call, everybody's happy-ish).

>
> --
> Pedro

Thanks, Lorenzo