Re: [PATCH 7.2 v2 02/12] mm/khugepaged: add folio dirty check after try_to_unmap_flush()
From: Zi Yan
Date: Thu Apr 16 2026 - 22:09:30 EST
On 14 Apr 2026, at 11:55, Zi Yan wrote:
> On 14 Apr 2026, at 6:38, David Hildenbrand (Arm) wrote:
>
>> On 4/13/26 21:20, Zi Yan wrote:
>>> This check ensures the correctness of collapse read-only THPs for FSes
>>> after READ_ONLY_THP_FOR_FS is enabled by default for all FSes supporting
>>> PMD THP pagecache.
>>>
>>> READ_ONLY_THP_FOR_FS only supports read-only fd and uses mapping->nr_thps
>>> and inode->i_writecount to prevent any write to read-only to-be-collapsed
>>> folios. In upcoming commits, READ_ONLY_THP_FOR_FS will be removed and the
>>> aforementioned mechanism will go away too. To ensure khugepaged functions
>>> as expected after the changes, rollback if any folio is dirty after
>>> try_to_unmap_flush() to , since a dirty folio means this read-only folio
>>> got some writes via mmap can happen between try_to_unmap() and
>>> try_to_unmap_flush() via cached TLB entries and khugepaged does not support
>>> collapse writable pagecache folios.
>>>
>>> Signed-off-by: Zi Yan <ziy@xxxxxxxxxx>
>>> ---
>>> mm/khugepaged.c | 18 ++++++++++++++++++
>>> 1 file changed, 18 insertions(+)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index d2f0acd2dac2..ec609e53082e 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -2121,6 +2121,24 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>>> */
>>> try_to_unmap_flush();
>>>
>>> + /*
>>> + * At this point, all folios are locked, unmapped, and all cached
>>> + * mappings in TLBs are flushed. No one else is able to write to these
>>> + * folios, since
>>> + * 1. writes via FS ops require folio locks (see write_begin_get_folio());
>>> + * 2. writes via mmap require taking a fault and locking folio locks.
>>> + *
>>
>> maybe simplify to "folios, since that would require taking the folio lock first."
>
> Sure.
>
>>
>>> + * khugepaged only works for read-only fd, make sure all folios are
>>> + * clean, since writes via mmap can happen between try_to_unmap() and
>>> + * try_to_unmap_flush() via cached TLB entries.
>>
>>
>> IIRC, after successful try_to_unmap() the PTE dirty bit would be synced to the
>> folio. That's what you care about, not about any stale TLB entries.
>
> Right. I missed the PTE dirty bit to folio dirtiness part.
>
>>
>> The important part is that the
>>
>> So can't we simply test for dirty folios after the refcount check (where
>> we made sure the folio is no longer mapped)?
>>
>
> I think so.
After more thoughts, this still works but the reasoning is more complicated.
After try_to_unmap(), PTE dirty bit is transferred to folio dirty flag if exists.
The code bails out if the folio is dirty. A clean folio means the removed
PTEs were clean, thus TLB entries caching these PTEs have dirty bit not set.
Any write to the folio after try_to_unmap() causes a page table walk and
a page fault due to invalid PTEs, and they cannot proceed since collapse_file()
is holding the folio lock. As a result, we can skip marking folio dirty
after the collapse.
If we allow a dirty folio after try_to_unmap(), we need to mark the folio dirty
after the collapse like shmem. And a potential optimization is that we only
mark the after-collapse folio dirty if any old folio is dirty after try_to_unmap().
>
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index b2ac28ddd480..920e16067134 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -2089,6 +2089,14 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>> goto out_unlock;
>> }
>>
>> + /* ... */
>> + if (!is_shmem && folio_test_dirty(folio)) {
>> + result = SCAN_PAGE_DIRTY_OR_WRITEBACK;
>> + xas_unlock_irq(&xas);
>> + folio_putback_lru(folio);
>> + goto out_unlock;
>> + }
>> +
>> /*
>> * Accumulate the folios that are being collapsed.
>>
>>
>> I guess we don't have to recheck folio_test_writeback() ?
>
> Right, writeback needs to take folio lock and we are holding it, so
> others can only dirty the folio but are not able to initiate write backs.
>
> Best Regards,
> Yan, Zi
--
Best Regards,
Yan, Zi