Re: [PATCH] mm: migrate: Fix THP's mapcount on isolation
From: Alistair Popple
Date: Thu Dec 01 2022 - 17:55:45 EST
David Hildenbrand <david@xxxxxxxxxx> writes:
>>>>> I think you're right. Without a page reference I don't think it is even
>>>>> safe to look at struct page, at least not without synchronisation
>>>>> against memory hot unplug which could remove the struct page. From a
>>>>> quick glance I didn't see anything here that obviously did that though.
>>>> Memory hotplug is the offending party here. It has to make sure
>>>> that
>>>> everything else is definitely quiescent before removing the struct pages.
>>>> Otherwise you can't even try_get a refcount.
>> Sure, I might be missing something but how can memory hotplug
>> possibly
>> synchronise against some process (eg. try_to_compact_pages) doing
>> try_get(pfn_to_page(random_pfn)) without that process either informing
>> memory hotplug that it needs pages to remain valid long enough to get a
>> reference or detecting that memory hotplug is in the process of
>> offlining pages?
>
> It currently doesn't, and it's been mostly a known theoretical problem
> so far. We've been ignoring these kind of problems because they are
> not easy to sort out and so far never popped up in practice.
>
> First, the correct approach is using pfn_to_online_page() instead of
> pfn_to_page() when in doubt whether the page might already be
> offline. While we're using pfn_to_online_page() already in quite some
> PFN walkers, changes are good that we're still missing some.
>
> Second, essentially anybody (PFN walker) doing a pfn_to_online_page()
> is prone to races with memory offlining. I've discussed this in the
> past with Michal and Dan and one idea was to use RCU to synchronize
> PFN walkers and pfn_to_online_page(), essentially synchronizing
> clearing of the SECTION_IS_ONLINE flag.
>
> Nobody worked on that so far because we've never had actual BUG
> reports. These kind of races are rare, but they are theoretically
> possible.
Thanks for providing some background context here. I have been digging
into these details a bit because GPU drivers using device-private memory
tend to excersise memory hot unplug paths more frequently than what I
suspect is normally the case. I haven't actually seen any real BUG
reports caused by this though, and it's likely the places where they're
likely would involve device-private pages at the moment anyway. So agree
such races are rare.
[...]
>> try_to_compact_pages() -> compact_zone_order() -> compact_zone() ->
>> isolate_migratepages() -> isolate_migratepages_block() ->
>> PageHuge(pfn_to_page(pfn))
>> Again I couldn't see anything in that path that would hold the page
>> stable long enough to safely perform the pfn_to_page() and get a
>> reference. And after a bit of fiddling I was able to trigger the below
>> oops running the compaction_test selftest whilst offlining memory so I
>> don't think it is safe:
>
> Thanks for finding a reproducer. This is exactly the kind of BUG that
> we speculated about in the past but that we haven't seen in practice
> yet.
No worries. The reproducer consisted of running the compaction_test
selftest in a loop on a single NUMA node whilst cycling every memory
block in that node online/offline in parallel (ie. not a particularly
realistic real world workload). Eventually that led to a couple of
different kernel OOPS/BUG backtraces.
> Having that said, I'd be very happy if someone would have time to work
> on proper synchronization and I'd be happy to help
> brainstorming/reviewing :)
I would like to see the synchronisation cleaned up here and am happy to
work on it. I'm unlikely to have much bandwidth to do so before the new
year but any brainstorming/review would certainly be appreciated when I
do find some time though!