Re: [PATCH v9 0/4] mm/page_owner: add per-fd filter infrastructure for print_mode and NUMA filtering

From: Vlastimil Babka (SUSE)

Date: Tue Jun 23 2026 - 02:53:03 EST

On 6/18/26 10:15, zhen.ni wrote:
>
>
> 在 2026/6/18 15:27, Vlastimil Babka (SUSE) 写道:
>> On 6/17/26 10:52, zhen.ni wrote:
>>> 在 2026/5/26 03:58, Andrew Morton 写道:
>>>> On Mon, 25 May 2026 16:16:48 +0800 Zhen Ni <zhen.ni@xxxxxxxxxxxx> wrote:
>>>>
>>>>> This patch series introduces per-file-descriptor filtering capabilities to the
>>>>> page_owner feature.
>>>>
>>>> Thanks again. AI review has found a bunch of new things to get worried
>>>> about:
>>>> https://sashiko.dev/#/patchset/20260525081652.2210206-1-zhen.ni@xxxxxxxxxxxx
>>>>
>>>>
>>> Hi,
>>>
>>> Can this lead to an out-of-bounds memory read?
>>>
>>> The NUMA filter in page_owner (mm/page_owner.c:790-798) bypasses
>>> PF_POISONED_CHECK() to avoid triggering VM_BUG_ON during concurrent page
>>> allocation/free:
>>>
>>> int page_nid = memdesc_nid(page->flags);
>>>
>>> When NODE_NOT_IN_PAGE_FLAGS is defined, memdesc_nid() performs unchecked
>>> array access:
>>>
>>> int memdesc_nid(memdesc_flags_t mdf)
>>> {
>>> return section_to_node_table[memdesc_section(mdf)];
>>> }
>>>
>>> If page->flags is poisoned, memdesc_section() can return a garbage
>>> section_nr that causes out-of-bounds access.
>>>
>>> ## Lockless Access Safety Principle
>>>
>>> The page_owner iterator runs without locks, meaning pages can be
>>> allocated or freed concurrently. The fundamental design principle should be:
>>>
>>> "It's acceptable to skip a small number of abnormal pages, but panics
>>> must be prevented."
>>>
>>> In lockless iteration, TOCTOU is unavoidable - even with reference
>>> counting or RCU, page->flags can still be modified concurrently during
>>> access. Zone locks prevent this but are prohibitively expensive.
>>>
>>> ## Proposed Solution: Add nid to struct page_owner
>>>
>>> Record nid at allocation time when page state is stable, eliminating the
>>> need to extract it from page->flags during iteration:
>>>
>>> ### 1. Modify struct page_owner
>>>
>>> struct page_owner {
>>> unsigned short order;
>>> short last_migrate_reason;
>>> ...
>>> pid_t tgid;
>>> pid_t free_pid;
>>> pid_t free_tgid;
>>> int nid; // NEW
>>> };
>>>
>>> ### 2. Record nid during allocation
>>>
>>> static inline void __update_page_owner_handle(struct page *page, ...)
>>> {
>>> int nid = page_to_nid(page); // Safe in allocation context
>>>
>>> for_each_page_ext(page, 1 << order, page_ext, iter) {
>>> page_owner = get_page_owner(page_ext);
>>> page_owner->nid = nid;
>>> // ... other fields ...
>>> }
>>> }
>>>
>>> ### 3. Use saved nid in NUMA filter
>>>
>>> if (state->nid_filter_enabled) {
>>> int page_nid = page_owner->nid; // Direct read, safe
>>>
>>> if (!node_isset(page_nid, state->nid_filter)) {
>>> spin_unlock_irqrestore(&state->lock, flags);
>>> goto ext_put_continue;
>>> }
>>> }
>>>
>>> ### 4. Update nid on page migration
>>>
>>> // In split_page_owner() when page migrates
>>> page_owner->nid = page_to_nid(&newfolio->page);
>>>
>>
>> This (presumably LLM) suggestion is a, let's say "lazy" solution to the
>> problem, leading to more memory usage. I'd be surprised if it's not possible
>> to read the nid in a way that avoids the hazards. If page_to_nid() can
>> trigger a VM_BUG_ON(), then I'd add a version without that VM_BUG_ON(),
>> handling the poisoned state gracefully - if it's poisoned, return e.g.
>> NUMA_NO_NODE and skip the page, or something.
>>
>>> The remaining two issues can also be improved. If there are no
>>> additional comments, I will proceed with sending v10.
>>>
>>>
>>> Thanks,
>>> Zhen
>>
>>
>>
>
> Thank you for the review.
>
> I'd like to clarify that the "nid" member approach was my own design
> after careful consideration of alternatives, not a suggestion from
> automated tools.:)

Alright, good :)

> In fact, LLMs suggested approaches like "check then access" which I had
> already implemented and rejected in earlier versions due to TOCTOU
> issues. The key insight is that page_owner serves as a buffering layer
> for struct page, eliminating lockless access inconsistency entirely.

Sure, but is the elimination necessary if it has a memory cost?

> Think about it this way:
> Even with extensive checks, page_owner->handle cannot guarantee the page
> won't be freed when printed. The time window between checking
> page->flags and accessing page->flags is unavoidable in lockless
> iteration.

Of course, but that's an argument saying that page owner as a whole can't be
a perfect snapshot anyway. Then it follows that "nid" snapshot doesn't need
to be perfect either.

> By recording nid at allocation time (when page->flags is stable),
> page_owner becomes a "consistent snapshot" that can be safely accessed
> without locks. This is the same principle that makes page_owner work as
> a debug feature in the first place - it accepts a small inconsistency
> window in exchange for lockless access.

And that means we can accept some more small inconsistency for nid, without
occupying more memory.

> Alternative approaches (adding poison checks, bounds checking) cannot
> fully eliminate TOCTOU in lockless code - they just reduce the
> probability. The buffering approach is the way to provide both
> safety and lockless performance.

Please explain how the following "page_to_nid_robust()" (or similar name)
code cannot eliminate the out of bounds accesses completely. The section/nid
in page flags is possibly the most stable part of the whole struct page, it
doesn't change as page is allocated, freed, or how it's used. The only
problem is the poison.

memdesc_flags_t flags = READ_ONCE(page->flags);

// our flags variable is no longer subject to TOCTOU

if (flags.f == PAGE_POISON_PATTERN)
return NUMA_NO_NODE;

return memdesc_nid(flags);

-
> Thanks,
> Zhen