Assume you have a THP (or any mTHP today). You can easily trigger the
scenario that folio_mapcount() != 0 with active device-exclusive entries,
and you start doing rmap walks and stumble over these device-exclusive
entries and *not* handle them properly. Note that more and more systems are
configured to just give you THP unless you explicitly opted-out using
MADV_NOHUGEPAGE early.
Note that b756a3b5e7ea added that hunk that still walks these
device-exclusive entries in rmap code, but didn't actually update the rmap
walkers:
@@ -102,7 +104,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
/* Handle un-addressable ZONE_DEVICE memory */
entry = pte_to_swp_entry(*pvmw->pte);
- if (!is_device_private_entry(entry))
+ if (!is_device_private_entry(entry) &&
+ !is_device_exclusive_entry(entry))
return false;
pfn = swp_offset(entry);
That was the right thing to do, because they resemble PROT_NONE entries and
not migration entries or anything else that doesn't hold a folio reference).
Yeah I got that part. What I meant is that doubling down on this needs a
full audit and cannot rely on "we already have device private entries
going through these paths for much longer", which was the impression I
got. I guess it worked, thanks for doing that below :-)
And at least from my very rough understanding of mm, at least around all
this gpu stuff, tracking device exclusive mappings like real cpu mappings
makes sense, they do indeed act like PROT_NONE with some magic to restore
access on fault.
I do wonder a bit though what else is all not properly tracked because
they should be like prot_none except arent. I guess we'll find those as we
hit them :-/
If thp constantly reassembles a pmd entry because hey all the
memory is contig and userspace allocated a chunk of memory to place
atomics that alternate between cpu and gpu nicely separated by 4k pages,
then we'll thrash around invalidating ptes to no end. So might be more
fallout here.
khugepaged will back off once it sees an exclusive entry, so collapsing
could only happen once everything is non-exclusive. See
__collapse_huge_page_isolate() as an example.
Ah ok. I think might be good to add that to the commit message, so that
people who don't understand mm deeply (like me) aren't worried when they
stumble over this change in the future again when digging around.
It's really only page_vma_mapped_walk() callers that are affected by this
change, not any other page table walkers.
I guess my mm understanding is just not up to that, but I couldn't figure
out why just looking at page_vma_mapped_walk() only is good enough?