Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed text pages
From: Lance Yang
Date: Fri Nov 07 2025 - 05:09:41 EST
On 2025/11/7 17:12, David Hildenbrand (Red Hat) wrote:
5. Yes, I'm calling madvise(MADV_COLLAPSE) on the text portion of the executable, using the address
range obtained from /proc/self/maps. IIUC, this should benefit applications by reducing ITLB pressure.
I agree with the suggestions to either Return EAGAIN instead of EINVAL or At minimum, document the
EINVAL return for dirty pages. I'm happy to work on a patch.
Of course, we could detect that we are in MADV_COLLAPSE and simply writeback ourselves. After all,
user space asked for a collapse, and it's not khugepaged that will simple revisit it later.
I did something similar in
commit ab73b29efd36f8916c6cc9954e912c4723c9a1b0
Author: David Hildenbrand <david@xxxxxxxxxx>
Date: Fri May 16 14:39:46 2025 +0200
s390/uv: Improve splitting of large folios that cannot be split while dirty
Currently, starting a PV VM on an iomap-based filesystem with large
folio support, such as XFS, will not work. We'll be stuck in
unpack_one()->gmap_make_secure(), because we can't seem to make progress
splitting the large folio.
Where I effectively use filemap_write_and_wait_range().
It could be used early to writeback the whole range to collapse once, possibly.
Exactly!
Since MADV_COLLAPSE is a best-effort thing, having the kernel use
something like filemap_write_and_wait_range() to writeback the pages
before collapsing is likely what users would expect.
Anyway, they just want to get a THP, whether the pages are dirty or
clean :)