Re: [PATCH] mm/arm: pgtable: remove young bit check for pte_valid_user

From: Russell King (Oracle)

Date: Fri Apr 10 2026 - 07:24:36 EST


On Fri, Apr 10, 2026 at 02:01:41PM +0300, Brian Ruley wrote:
> Thank you for the clarification, this is very educational for me.
> I understand your scepticism, and I can't explain what's going on based
> on what you replied. However, I do honestly believe there is a problem
> here. I'll share the exact testing details and the instrumentation
> we added that convinced us to reach out at the end. One idea we also
> had was that could cache aliasing be happening here.
>
> To clarify any potential misunderstanding, we've observed the
> following:
>
> - Sporadic SIGILL and SIGSEGV under memory pressure
> - Scales with core count, i.e., quad core more likely to reproduce
> than dual core. We haven't observed an issue on single core.
> - Coredumps show valid instructions at the faulting PC.
> The CPU executed something different from what's in memory.
> This pointed us to stale I-cache.
> - Instrumentation indicates a correlation.
> A per-CPU ring buffer tracking exec page migrations was dumped on
> SIGILL. The faulting PC matched a recently migrated pages.
> - We started seeing this after upgrade 6.1->6.12->6.18. We bisected
> two commits which had an impact, but we weren't convinced that
> either was the root cause: 5dfab109d5193e6c224d96cabf90e9cc2c039884
> and 6faea3422e3b4e8de44a55aa3e6e843320da66d2.
> - Failed processes include systemd, tar, bash, ...
> - Debug options, e.g., page poisoning, seems to hide the bug
>
>
> > So you're saying that stress-ng doesn't reproduce this bug but
> triggers the OOM-killer... confused.
>
> Apologies for the confusion. I meant that with `stress-ng' we created
> the memory pressure and OOM might have played a role in exposing the
> "bug" as we (at the time) believed that anything that would trigger
> memory free/reclaims and page migration was the key. One note I'll add
> is that in our test we invoked stress-ng for 2 minutes (--timeout 2m)
> and after each we would reboot the device. We had observed that reboots
> seemed to have a discernible effect on the occurence in earlier testing
> so we kept that in. I'm beginning to doubt if it had an effect now,
> and unfortunately it's all anecdotal.
>
> One more thing, even if you don't accept the patch, is this patch
> harmful in any way or is it just sub-optimal?
>
> I'll send the instrumentation patch as a follow-up, migh be there's a
> flaw in it.

I'll try it - I have Cortex A9 systems (some which I rely on...)

Please can you also try to track the history of what happens for
the PTEs corresponding to the old and new PFN?

Thanks.

--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!