Re: [PATCH v1 03/10] mm: fs: remove filemap_nr_thps*() functions and their users
From: Zi Yan
Date: Thu Apr 02 2026 - 10:43:13 EST
On 2 Apr 2026, at 10:35, David Hildenbrand (Arm) wrote:
> On 4/1/26 22:33, Zi Yan wrote:
>> On 1 Apr 2026, at 15:15, David Hildenbrand (Arm) wrote:
>>
>>> On 4/1/26 17:32, Zi Yan wrote:
>>>>
>>>>
>>>> Let me think.
>>>>
>>>> do_dentry_open() -> file_get_write_access() -> get_write_access() bumps
>>>> inode->i_writecount atomically and it turns inode_is_open_for_write()
>>>> to true. Then, do_dentry_open() also truncates all pages
>>>> if filemap_nr_thps() is not zero. This pairs with khugepaged’s first
>>>> filemap_nr_thps_inc() then inode_is_open_for_write() to prevent opening
>>>> a fd with write when there is a read-only THP.
>>>>
>>>> After removing READ_ONLY_THP_FOR_FS, khugepaged only creates read-only THPs
>>>> on FSes with large folio support (to be precise THP support). If a fd
>>>> is opened for write before inode_is_open_for_write() check, khugepaged
>>>> will stop. It is fine. But if a fd is opened for write after
>>>> inode_is_open_for_write() check, khugepaged will try to collapse a read-only
>>>> THP and the fd can be written at the same time.
>>>
>>> Exactly, that's the race I mean.
>>>
>>>>
>>>> I notice that fd write requires locking the to-be-written folio first
>>>> (I see it from f_ops->write_iter() -> write_begin_get_folio() and assume
>>>> f_ops->write() has the same locking requirement) and khugepaged has already
>>>> locked the to-be-collapsed folio before inode_is_open_for_write(). So if the
>>>> fd is opened for write after inode_is_open_for_write() check, its write
>>>> will wait for khugepaged collapse and see a new THP. Since the FS
>>>> supports THP, writing to the new THP should be fine.
>>>>
>>>> Let me know if my analysis above makes sense. If yes, I will add it
>>>> to the commit message and add a succinct comment about it before
>>>> inode_is_open_for_write().
>>>
>>> khugepaged code is the only code that replaces folios in the pagecache
>>> by other folios. So my main concern is if that is problematic on
>>> concurrent write access.
>>
>> folio_split() does it too, although it replaces a large folio with
>> a bunch of after-split folios. It is a kinda reverse process of
>> collapse_file().
>
> Right. You won't start looking at a small folio and suddenly there is
> something larger.
>
>>
>>
>>>
>>> You argue that the folio lock is sufficient. That's certainly true for
>>> individual folios, but I am more concerned about the replacement part.
>>
>> For the replacement part, both old and new folios are locked during
>> the process. A parallel writer uses filemap_get_entry() to get the folio
>> from mapping, but all of them check folio->mapping after acquiring the
>> folio lock, except mincore_page() which is a reader. A writer can see
>> either old folio or new folio during the process, but
>>
>> 1. if it sees the old one, it waits on the old folio lock. After
>> it acquires the lock, it sees old_folio->mapping is NULL, no longer
>> matches the original mapping. The writer will try again.
>>
>> 2. if it sees the new one, it waits on the new folio lock. After
>> it acquires the lock, it sees new_folio->mapping matches the
>> original mapping and proceeds to its writes.
>>
>> 3. if khugepaged needs to do a rollback, the old folio will stay
>> the same and the writer will see the old one after it gets the old
>> folio lock.
>
> I am primarily wondering about what would happen if someone traverses
> the pageache, and found+processed 3 small folios. Suddenly there is a
> large folio that covers the 3 small folios processes before.
>
> I suspect that is fine, because the code likely had to deal with
> concurrent truncation+population if relevant locks are dropped already.
>
> Just raising it.
>
>>
>>>
>>> I don't have anything concrete, primarily just pointing out that this is
>>> a change that might unlock some code paths that could not have been
>>> triggered before.
>>
>> Yes, the concern makes sense.
>>
>> BTW, Claude is trying to convince me that even inode_is_open_for_write()
>> is unecessary since 1) folio_test_dirty() before it has
>> made sure the folio is clean, 2) try_to_unmap() and the locked folio prevents
>> further writes.
>>
>> But then we find a hole between folio_test_dirty() and
>> try_to_unmap() where a write via a writable mmap PTE can dirty the folio
>> after folio_test_dirty() and try_to_unmap(). To remove that hole,
>> the “if (!is_shmem && (folio_test_dirty(...) || folio_test_writeback(...))”
>> needs to be moved after try_to_unmap(). With that, all to-be-collapsed
>> folios will be clean, unmapped, and locked, where unmapped means
>> writes via mmap need to fault and take the folio lock, locked means
>> writes via mmap and write() need to wait until the folio is unlocked.
>>
>> Let me know if my reasoning makes sense. It is definitely worth the time
>> and effort to ensure this patchset does not introduce any unexpected race
>> condition or issue.
>
> Makes sense.
>
> Please clearly spell out that there is a slight change now, where we
> might be collapsing after the file has been opened for write. Then you
> can document that the folio locks should be protecting us from that.
>
> Implying that collapsing in writable files could likely "easily" done in
> the future.
Definitely. Thank you for all the inputs. :)
Best Regards,
Yan, Zi