Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed text pages

From: Lance Yang

Date: Fri Nov 07 2025 - 05:09:41 EST




On 2025/11/7 17:12, David Hildenbrand (Red Hat) wrote:


5. Yes, I'm calling madvise(MADV_COLLAPSE) on the text portion of the executable, using the address
    range obtained from /proc/self/maps. IIUC, this should benefit applications by reducing ITLB pressure.

I agree with the suggestions to either Return EAGAIN instead of EINVAL or At minimum, document the
EINVAL return for dirty pages. I'm happy to work on a patch.

Of course, we could detect that we are in MADV_COLLAPSE and simply writeback ourselves. After all,
user space asked for a collapse, and it's not khugepaged that will simple revisit it later.

I did something similar in

commit ab73b29efd36f8916c6cc9954e912c4723c9a1b0
Author: David Hildenbrand <david@xxxxxxxxxx>
Date:   Fri May 16 14:39:46 2025 +0200

    s390/uv: Improve splitting of large folios that cannot be split while dirty
    Currently, starting a PV VM on an iomap-based filesystem with large
    folio support, such as XFS, will not work. We'll be stuck in
    unpack_one()->gmap_make_secure(), because we can't seem to make progress
    splitting the large folio.

Where I effectively use filemap_write_and_wait_range().

It could be used early to writeback the whole range to collapse once, possibly.

Exactly!

Since MADV_COLLAPSE is a best-effort thing, having the kernel use
something like filemap_write_and_wait_range() to writeback the pages
before collapsing is likely what users would expect.

Anyway, they just want to get a THP, whether the pages are dirty or
clean :)