Re: [PATCH v1 12/12] mm/rmap: keep mapcount untouched for device-exclusive entries

From: David Hildenbrand
Date: Thu Jan 30 2025 - 10:43:26 EST

Next message: Alice Ryhl: "Re: [PATCH] rust: sync: add accessor for the lock behind a given guard"
Previous message: Willem de Bruijn: "Re: [PATCH v2 net] udp: gso: do not drop small packets when PMTU reduces"
In reply to: Simona Vetter: "Re: [PATCH v1 12/12] mm/rmap: keep mapcount untouched for device-exclusive entries"
Next in thread: Simona Vetter: "Re: [PATCH v1 12/12] mm/rmap: keep mapcount untouched for device-exclusive entries"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Assume you have a THP (or any mTHP today). You can easily trigger the
scenario that folio_mapcount() != 0 with active device-exclusive entries,
and you start doing rmap walks and stumble over these device-exclusive
entries and *not* handle them properly. Note that more and more systems are
configured to just give you THP unless you explicitly opted-out using
MADV_NOHUGEPAGE early.

Note that b756a3b5e7ea added that hunk that still walks these
device-exclusive entries in rmap code, but didn't actually update the rmap
walkers:

@@ -102,7 +104,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)

/* Handle un-addressable ZONE_DEVICE memory */
entry = pte_to_swp_entry(*pvmw->pte);
- if (!is_device_private_entry(entry))
+ if (!is_device_private_entry(entry) &&
+ !is_device_exclusive_entry(entry))
return false;

pfn = swp_offset(entry);

That was the right thing to do, because they resemble PROT_NONE entries and
not migration entries or anything else that doesn't hold a folio reference).

Yeah I got that part. What I meant is that doubling down on this needs a
full audit and cannot rely on "we already have device private entries
going through these paths for much longer", which was the impression I
got. I guess it worked, thanks for doing that below :-)

I know I know, I shouldn't have touched it ... :)

So yeah, I'll spend some extra work on sorting out the other cases.

And at least from my very rough understanding of mm, at least around all
this gpu stuff, tracking device exclusive mappings like real cpu mappings
makes sense, they do indeed act like PROT_NONE with some magic to restore
access on fault.

I do wonder a bit though what else is all not properly tracked because
they should be like prot_none except arent. I guess we'll find those as we
hit them :-/

Likely a lot of stuff. But more in a "entry gets ignored -- functionality not implemented, move along" way, because all page table walkers have to care about !pte_present() already; it's just RMAP code that so far never required it.

[...]

If thp constantly reassembles a pmd entry because hey all the

memory is contig and userspace allocated a chunk of memory to place
atomics that alternate between cpu and gpu nicely separated by 4k pages,
then we'll thrash around invalidating ptes to no end. So might be more
fallout here.

khugepaged will back off once it sees an exclusive entry, so collapsing
could only happen once everything is non-exclusive. See
__collapse_huge_page_isolate() as an example.

Ah ok. I think might be good to add that to the commit message, so that
people who don't understand mm deeply (like me) aren't worried when they
stumble over this change in the future again when digging around.

Will do, thanks for raising that concern!

It's really only page_vma_mapped_walk() callers that are affected by this
change, not any other page table walkers.

I guess my mm understanding is just not up to that, but I couldn't figure
out why just looking at page_vma_mapped_walk() only is good enough?

See above: these never had to handle !page_present() before -- in contrast to the other page table walkers.

So nothing bad happens when these page table walkers traverse these PTEs, it's just that the functionality will usually be implemented.

Take MADV_PAGEOUT as an example: madvise_cold_or_pageout_pte_range() will simply skip "!pte_present()", because it wouldn't know what to do in that case.

Of course, there could be page table walkers that check all cases and bail out if they find something unexpected: do_swap_page() cannot make forward progress and will inject a VM_FAULT_SIGBUS if it doesn't recognize the entry. But these are rather rare.

We could enlighten selected page table walkers to handle device-exclusive where it really makes sense later.

--
Cheers,

David / dhildenb

Next message: Alice Ryhl: "Re: [PATCH] rust: sync: add accessor for the lock behind a given guard"
Previous message: Willem de Bruijn: "Re: [PATCH v2 net] udp: gso: do not drop small packets when PMTU reduces"
In reply to: Simona Vetter: "Re: [PATCH v1 12/12] mm/rmap: keep mapcount untouched for device-exclusive entries"
Next in thread: Simona Vetter: "Re: [PATCH v1 12/12] mm/rmap: keep mapcount untouched for device-exclusive entries"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]