Re: [PATCH] mm/rmap: fix incorrect pte restoration for lazyfree folios

From: David Hildenbrand (Arm)

Date: Thu Feb 26 2026 - 05:07:41 EST


On 2/26/26 08:09, Lance Yang wrote:
>
> On Tue, Feb 24, 2026 at 05:01:50PM +0100, David Hildenbrand (Arm) wrote:
>> On 2/24/26 12:43, Lorenzo Stoakes wrote:
>>>
>>> Sorry I misread the original mail rushing through this is old... so this is less
>>> pressing than I thought (for some reason I thought it was merged last cycle...!)
>>> but it's a good example of how stuff can go unnoticed for a while.
>>>
>>> In that case maybe a revert is a bit much and we just want the simplest possible
>>> fix for backporting.
>>
>> Dev volunteered to un-messify some of the stuff here. In particular, to
>> extend batching to all cases, not just some hand-selected ones.
>>
>> Support for file folios is on the way.
>>
>>>
>>> But is the proposed 'just assume wrprotect' sensible? David?
>>
>> In general, I think so. If PTEs were writable, they certainly have
>> PAE set. The write-fault handler can fully recover from that (as PAE is
>> set). If it's ever a performance problem (doubt), we can revisit.
>>
>> I'm wondering whether we should just perform the wrprotect earlier:
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 0f00570d1b9e..19b875ee3fad 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -2150,6 +2150,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>
>> /* Nuke the page table entry. */
>> pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
>> +
>> + /*
>> + * Our batch might include writable and read-only
>> + * PTEs. When we have to restore the mapping, just
>> + * assume read-only to not accidentally upgrade
>> + * write permissions for PTEs that must not be
>> + * writable.
>> + */
>> + pteval = pte_wrprotect(pteval);
>> +
>> /*
>> * We clear the PTE but do not flush so potentially
>> * a remote CPU could still be writing to the folio
>>
>>
>> Given that nobody asks for writability (pte_write()) later.
>>
>> Or does someone care?
>>
>> Staring at set_tlb_ubc_flush_pending()->pte_accessible() I am
>> not 100% sure. Could pte_wrprotect() turn a PTE inaccessible on some
>> architecture (write-only)? I don't think so.
>>
>>
>> We have the following options:
>>
>> 1) pte_wrprotect(): fake that all was read-only.
>>
>> Either we do it like Dev suggests, or we do it as above early.
>>
>> The downside is that any code that might later want to know "was
>> this possibly writable" would get that information. Well, it wouldn't
>> get that information reliably *today* already (and that sounds a bit shaky).
>
> Makes sense to me :)
>
>> 2) Tell batching logic to honor pte_write()
>>
>> Sounds suboptimal for some cases that really don't care in the future.
>>
>> 3) Tell batching logic to tell us if any pte was writable: FPB_MERGE_WRITE
>>
>> ... then we know for sure whether any PTE was writable and we could
>>
>> (a) Pass it as we did before around to all checks, like pte_accessible().
>>
>> (b) Have an explicit restore PTE where we play save.
>>
>>
>> I raised to Dev in private that softdirty handling is also shaky, as we
>> batch over that. Meaning that we could lose or gain softdirty PTE bits in
>> a batch.
>
> I guess we won't lose soft_dirty bits - only gain them (false positive):
>
> 1) get_and_clear_ptes() merges dirty bits from all PTEs via pte_mkdirty()
> 2) pte_mkdirty() atomically sets both _PAGE_DIRTY and _PAGE_SOFT_DIRTY on
> all architectures that support soft_dirty (x86, s390, powerpc, riscv)
> 3) set_ptes() uses pte_advance_pfn() which keeps all flags intact
>
> So if any PTE in the batch was dirty, all PTEs become soft_dirty after
> restore.

PTEs can be softdirty without being dirty. That over-complicates the
situation.

--
Cheers,

David