Re: [PATCH] mm/rmap: fix incorrect pte restoration for lazyfree folios

From: Barry Song

Date: Thu Feb 26 2026 - 05:24:48 EST


On Thu, Feb 26, 2026 at 6:09 PM David Hildenbrand (Arm)
<david@xxxxxxxxxx> wrote:
>
> On 2/26/26 08:01, Barry Song wrote:
> > On Wed, Feb 25, 2026 at 12:01 AM David Hildenbrand (Arm)
> > <david@xxxxxxxxxx> wrote:
> >>
> >> On 2/24/26 12:43, Lorenzo Stoakes wrote:
> >>>
> >>> Sorry I misread the original mail rushing through this is old... so this is less
> >>> pressing than I thought (for some reason I thought it was merged last cycle...!)
> >>> but it's a good example of how stuff can go unnoticed for a while.
> >>>
> >>> In that case maybe a revert is a bit much and we just want the simplest possible
> >>> fix for backporting.
> >
> > Apologies for the mess I caused, and thanks to Dev for catching this bug.
> >
> >>
> >> Dev volunteered to un-messify some of the stuff here. In particular, to
> >> extend batching to all cases, not just some hand-selected ones.
> >>
> >> Support for file folios is on the way.
> >>
> >>>
> >>> But is the proposed 'just assume wrprotect' sensible? David?
> >>
> >> In general, I think so. If PTEs were writable, they certainly have
> >> PAE set. The write-fault handler can fully recover from that (as PAE is
> >> set). If it's ever a performance problem (doubt), we can revisit.
> >>
> >> I'm wondering whether we should just perform the wrprotect earlier:
> >>
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index 0f00570d1b9e..19b875ee3fad 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -2150,6 +2150,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>
> >> /* Nuke the page table entry. */
> >> pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
> >> +
> >> + /*
> >> + * Our batch might include writable and read-only
> >> + * PTEs. When we have to restore the mapping, just
> >> + * assume read-only to not accidentally upgrade
> >> + * write permissions for PTEs that must not be
> >> + * writable.
> >> + */
> >> + pteval = pte_wrprotect(pteval);
> >> +
> >> /*
> >> * We clear the PTE but do not flush so potentially
> >> * a remote CPU could still be writing to the folio
> >>
> >>
> >> Given that nobody asks for writability (pte_write()) later.
> >>
> >> Or does someone care?
> >>
> >> Staring at set_tlb_ubc_flush_pending()->pte_accessible() I am
> >> not 100% sure. Could pte_wrprotect() turn a PTE inaccessible on some
> >> architecture (write-only)? I don't think so.
> >>
> >>
> >> We have the following options:
> >>
> >> 1) pte_wrprotect(): fake that all was read-only.
> >>
> >> Either we do it like Dev suggests, or we do it as above early.
> >>
> >> The downside is that any code that might later want to know "was
> >> this possibly writable" would get that information. Well, it wouldn't
> >> get that information reliably *today* already (and that sounds a bit shaky).
> >>
> >> 2) Tell batching logic to honor pte_write()
> >>
> >> Sounds suboptimal for some cases that really don't care in the future.
> >
> > I'm still curious what the downside would be to applying the
> > simple fix instead of introducing more "hacks". As I assume,
> > cases where a folio has both writable and non-writable PTEs
> > are not common?
>
> With "in the future" I thought about file folios, where I'd assume ti
> could happen more often.
>
> For lazyfree, I agree.
>
> In the end, batching as much as possible is nice, but obviously, once it
> gets too shaky in corner cases we might not care that much.

Assuming 90% of folios have consistent PTEs, perhaps we don’t
need to worry too much about the remaining 10% of inconsistent
folios. We’ve already gained performance benefits for the
consistent 90%, and while the remaining 10% may not receive the
full batch, they are still getting some batching.

I don’t have the exact number, but it’s likely 90% or higher :-)

>
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index bff8f222004e..48ad3435593a 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1955,7 +1955,7 @@ static inline unsigned int
> > folio_unmap_pte_batch(struct folio *folio,
> > if (userfaultfd_wp(vma))
> > return 1;
> >
> > - return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
> > + return folio_pte_batch_flags(folio, NULL, pvmw->pte, &pte,
> > max_nr, FPB_RESPECT_WRITE);
> > }
>
> If we already go for this approach assume we should then just set
> FPB_RESPECT_SOFT_DIRTY as well and have it all handled properly.

I would vote for this, as supporting those inconsistent PTE
cases could become an endless and painful task :-)

Thanks
Barry