Re: [PATCH v2 00/13] Dynamic Kernel Stacks

From: Thomas Gleixner

Date: Fri Jun 19 2026 - 17:59:34 EST

Zach!

On Fri, Jun 19 2026 at 12:20, Zach O'Keefe wrote:
> While it seems common opinion that the IST-based solution is fragile,
> what of FRED? It seems like this is exactly the kind of support needed
> to avoid some of the aforementioned sw "mess" in various x86 exception
> handling paths. I agree that it's less-than-ideal that we are forced
> to downgrade exception levels in the common #PF case, but is that an
> unsurmountable problem? Pardon my ignorance.

The #PF path is considered perfomance critical. But how much the
downgrade matters needs actual numbers to analyze under various workload
scenarios.

I've not seen numbers to that effect anywhere. The only numbers provided
are marketing material about the memory savings on a freshly booted idle
machine. There are _zero_ numbers about the actual real world savings,
but claims about the PETABYTE savings possible.

Seriously?

> Lastly, I just want to clarify what folks have meant by "extraordinary
> claims" or "evidence". Aside from the above discussion on FRED
> exception handling, the "only" other part of this is the allocation.

Clearly anything which is explained with "shouldn't happen" and
"unlikely". At cloud scale nothing is unlikely anymore. That's simply the
reality of statistical math.

As I pointed out before the same applies to the unexplained
upgrade/downgrade game with external interrupts. Such issues cannot be
papered over without understanding the root cause as from decades long
experience they come inevitably back some time down the road. Cloud
scale even guarantees that.

> Are people concerned about memory unavailability, deadlocking-type
> issues, or something else? We have considerable design freedom here to
> avoid certain classes of unreliability, but—barring any clever
> tricks—I don't know if the allocation can be guaranteed to succeed in
> all conceivable circumstances. I want to ensure that reality does not
> present a hard blocker.

First of all the failure scenario has to be clearly defined.

Right now, if I'm reading the patches correctly this simply can end up
killing the wrong tasks/processes just because an OOM situation results
in a depletion of the per CPU cache and the very wrong task which runs
into the deep call stack situation ends up in the creek without a paddle.

Given that you even fail to abort a CPU bringup when the allocation of
the per CPU stack page cache fails, makes it pretty clear that there has
been spent exactly zero thoughts about this problem.

Why the heck does this cache refill call have to be unconditionally in
__schedule() where preemption is disabled and therefore GFP_ATOMIC
is mandatory? I know "Works for me" (most of the time).

And just because I was looking at the patch in question I found this
other insanity:

> + /*
> + * Most likely we faulted in the page right next to the last mapped
> + * page in the stack, however, it is possible (but very unlikely) that
> + * the faulted page is actually skips some pages in the stack. Make sure
> + * we do not create more than one holes in the stack, and map every
> + * page between the current fault address and the last page that is
> + * mapped in the stack.
> + */

Can anyone with a sane mind and the most minimal understanding of the
kernel's inner working explain to me how the kernel can skip "some
pages" on the stack?

If the kernel skips a whole page or more then there is a serious bug
somewhere. I might be missing something, but again the "very unlikely"
wording which handwaves about it is just disgustingly useless.

I disagree with Dave on the RFC status of this series. It's not even
close to RFC, it's at PoC status.

Thanks,

tglx