Re: [PATCH] mm/sparse: Fix race on mem_section->usage in pfn walkers
From: David Hildenbrand (Arm)
Date: Tue Apr 21 2026 - 09:43:14 EST
On 4/21/26 14:55, Muchun Song wrote:
>
>
>> On Apr 21, 2026, at 19:21, David Hildenbrand (Arm) <david@xxxxxxxxxx> wrote:
>>
>> On 4/15/26 11:20, Muchun Song wrote:
>>>
>>>
>>>
>>> Agree. When I first saw the commit message for 5ec8e8ea8b77, I was curious
>>> because the goal of this commit was to fix an access issue with ms->usage.
>>> Looking at the race diagram, I realized that while this only addresses the
>>> ->usage access, subsequent accesses to struct page will still be problematic.
>>> It's just that the former issue happened to be triggered first in this specific
>>> commit.
>>>
>>>
>>> Glad to know my analysis wasn't off! It seems I've just stumbled upon a
>>> 'well-known secret' within the community. :)
>>
>> Heh, yes.
>>
>>>
>>>
>>> I am not sure if it is worth fixing, especially since I just realized the
>>> community has been aware of this issue for many years. If we do decide to
>>> fix it, I think the most straightforward approach would be to protect it
>>> using RCU, something like:
>>>
>>> # the user side of pfn_to_online_page():
>>> rcu_read_lock();
>>> page = pfn_to_online_page();
>>> if (!get_page_unless_zero(page))
>>> goto out_unlock;
>>> rcu_read_unlock();
>>
>>
>> Right, but we'd have to protect against the sections being marked as
>> offline as well here, though. So against a pure concurrent offline_pages().
>
> Right.
>
>>
>> If you're looking for a project, this is really one worth doing! :)
>>
>
> Initially, I wasn't sure if this issue was worth fixing, but it seems
> we are moving in the right direction. I'll give it some more thought
> in my spare time.
The kernel would definitely be a better place if pfn_to_online_page()
could no longer race with concurrent memory offlining+hotunplug. :)
--
Cheers,
David