Re: [PATCH] mm/sparse: Fix race on mem_section->usage in pfn walkers

From: Muchun Song

Date: Wed Apr 15 2026 - 05:46:57 EST




> On Apr 15, 2026, at 16:37, Oscar Salvador <osalvador@xxxxxxx> wrote:
>
> On Wed, Apr 15, 2026 at 10:23:26AM +0800, Muchun Song wrote:
>> When memory is hot-removed, section_deactivate() can tear down
>> mem_section->usage while concurrent pfn walkers still inspect the
>> subsection map via pfn_section_valid() or pfn_section_first_valid().
>>
>> After commit 5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing
>> memory_section->usage") converted the teardown to an RCU-based
>> scheme, the code still relies on SECTION_HAS_MEM_MAP becoming visible
>> to readers before ms->usage is cleared and queued for freeing.
>>
>> That ordering is not guaranteed. section_deactivate() can clear
>> ms->usage and queue kfree_rcu() before another CPU observes the
>> SECTION_HAS_MEM_MAP clear. A concurrent pfn walker can therefore see
>> valid_section() return true, enter its sched-RCU read-side critical
>> section after kfree_rcu() has already been queued, and then dereference
>> a stale ms->usage pointer.
>>
>> And pfn_to_online_page() can call pfn_section_valid() without its
>> own sched-RCU read-side critical section, which has similar problem.
>>
>> The race looks like this:
>>
>> compact_zone() memunmap_pages
>> ============== ==============
>> __remove_pages()->
>> sparse_remove_section()->
>> section_deactivate():
>> a) [ Clear SECTION_HAS_MEM_MAP
>> is reordered to b) ]
>> kfree_rcu(ms->usage)
>> __pageblock_pfn_to_page
>> ......
>> pfn_valid():
>> rcu_read_lock_sched()
>> valid_section() // return true
>> pfn_section_valid()
>> [Access ms->usage which is UAF]
>> WRITE_ONCE(ms->usage, NULL)
>> rcu_read_unlock_sched() b) Clear SECTION_HAS_MEM_MAP
>>
>> Fix this by using rcu_replace_pointer() when clearing ms->usage in
>> section_deactivate(), then it does not rely on the order of clearing
>> of SECTION_HAS_MEM_MAP.
>
> The fix itself does not look too intrusive and I guess it kind of makes
> sense when you think about the ordering issue, so if we want to be
> rock solid, why not.

You are right. Generally speaking, where RCU is used, the pointer should
be cleared first, and then the memory corresponding to the pointer should
be released via kfree_rcu(). Therefore, the correct sequence of use here
should be:

WRITE_ONCE(ms->usage, NULL);
kfree_rcu(ms->usage);

The actual code has the order reversed. To prevent such errors, the RCU
mechanism provides the rcu_replace_pointer() interface. If you look at its
implementation, this interface is essentially equivalent to the code sequence
mentioned above.

> Does it slow down operations a lot?

Regarding section_deactivate, I believe there is no functional difference.

>
> I would also point out that you rcu-protect pfn_section_valid().

rcu_dereference_sched() is equivalent to the previous READ_ONCE(), but for
RCU-protected resources, it is recommended to use the RCU-specific interfaces
to avoid having to manually account for memory ordering issues.

>
> Regarding the pfn_to_online_page() race, that is something that every now
> and then pops up, but as David said, we never seen that happening in the
> wild so I guess no one really made the time to look into that.

After taking a closer look at commit 5ec8e8ea8b77, it’s clear that it was
intended to resolve a race between __pageblock_pfn_to_page and section_deactivate.
While the issue surfaced this time due to ms->usage, I’m concerned that even
with ms->usage fixed, the underlying race condition in the execution path
still exists. This suggests that subsequent accesses to struct page might
run into similar trouble down the road.

That said, I’d love to hear what everyone thinks about whether this warrants a
full fix.

Thanks,
Muchun.

>
>
>
> --
> Oscar Salvador
> SUSE Labs