Re: [PATCH v2 00/13] Dynamic Kernel Stacks

From: David Stevens

Date: Sat Jun 20 2026 - 01:03:04 EST

On Fri, Jun 19, 2026 at 2:59 PM Thomas Gleixner <tglx@xxxxxxxxxx> wrote:
>
> Zach!
>
> On Fri, Jun 19 2026 at 12:20, Zach O'Keefe wrote:
> > While it seems common opinion that the IST-based solution is fragile,
> > what of FRED? It seems like this is exactly the kind of support needed
> > to avoid some of the aforementioned sw "mess" in various x86 exception
> > handling paths. I agree that it's less-than-ideal that we are forced
> > to downgrade exception levels in the common #PF case, but is that an
> > unsurmountable problem? Pardon my ignorance.
>
> The #PF path is considered perfomance critical. But how much the
> downgrade matters needs actual numbers to analyze under various workload
> scenarios.
>
> I've not seen numbers to that effect anywhere. The only numbers provided
> are marketing material about the memory savings on a freshly booted idle
> machine. There are _zero_ numbers about the actual real world savings,
> but claims about the PETABYTE savings possible.
>
> Seriously?
>
> > Lastly, I just want to clarify what folks have meant by "extraordinary
> > claims" or "evidence". Aside from the above discussion on FRED
> > exception handling, the "only" other part of this is the allocation.
>
> Clearly anything which is explained with "shouldn't happen" and
> "unlikely". At cloud scale nothing is unlikely anymore. That's simply the
> reality of statistical math.
>
> As I pointed out before the same applies to the unexplained
> upgrade/downgrade game with external interrupts. Such issues cannot be
> papered over without understanding the root cause as from decades long
> experience they come inevitably back some time down the road. Cloud
> scale even guarantees that.
>
> > Are people concerned about memory unavailability, deadlocking-type
> > issues, or something else? We have considerable design freedom here to
> > avoid certain classes of unreliability, but—barring any clever
> > tricks—I don't know if the allocation can be guaranteed to succeed in
> > all conceivable circumstances. I want to ensure that reality does not
> > present a hard blocker.
>
> First of all the failure scenario has to be clearly defined.
>
> Right now, if I'm reading the patches correctly this simply can end up
> killing the wrong tasks/processes just because an OOM situation results
> in a depletion of the per CPU cache and the very wrong task which runs
> into the deep call stack situation ends up in the creek without a paddle.
>
> Given that you even fail to abort a CPU bringup when the allocation of
> the per CPU stack page cache fails, makes it pretty clear that there has
> been spent exactly zero thoughts about this problem.
>
> Why the heck does this cache refill call have to be unconditionally in
> __schedule() where preemption is disabled and therefore GFP_ATOMIC
> is mandatory? I know "Works for me" (most of the time).
> And just because I was looking at the patch in question I found this
> other insanity:
>
> > + /*
> > + * Most likely we faulted in the page right next to the last mapped
> > + * page in the stack, however, it is possible (but very unlikely) that
> > + * the faulted page is actually skips some pages in the stack. Make sure
> > + * we do not create more than one holes in the stack, and map every
> > + * page between the current fault address and the last page that is
> > + * mapped in the stack.
> > + */
>
> Can anyone with a sane mind and the most minimal understanding of the
> kernel's inner working explain to me how the kernel can skip "some
> pages" on the stack?
>
> If the kernel skips a whole page or more then there is a serious bug
> somewhere. I might be missing something, but again the "very unlikely"
> wording which handwaves about it is just disgustingly useless.

FRAME_WARN accepts values up to 8192 bytes, and it can always be
ignored or simply disabled. If a stack frame is larger than 4k, then
it's entirely possible for the code and compiler to align in a way
where the first access in the frame skips a page in the stack. I think
we agree that such code would be highly suspect and (hopefully) would
only exist in out-of-tree drivers. But it's something the kernel build
system accepts today. Dynamic kernel stacks suddenly turning that into
a runtime kernel panic seems like exactly the sort of edge case that
we would get yelled at for not addressing.

-David

> I disagree with Dave on the RFC status of this series. It's not even
> close to RFC, it's at PoC status.
>
> Thanks,
>
> tglx