Re: [PATCH v2 00/13] Dynamic Kernel Stacks

From: H. Peter Anvin

Date: Sat Jun 20 2026 - 16:02:53 EST


On June 20, 2026 12:33:35 PM PDT, Zach O'Keefe <zokeefe@xxxxxxxxxx> wrote:
>On Fri, Jun 19, 2026 at 2:59 PM Thomas Gleixner <tglx@xxxxxxxxxx> wrote:
>
>Thomas, thanks again for taking the time to look into this and help out.
>
>> Zach!
>>
>> On Fri, Jun 19 2026 at 12:20, Zach O'Keefe wrote:
>> > While it seems common opinion that the IST-based solution is fragile,
>> > what of FRED? It seems like this is exactly the kind of support needed
>> > to avoid some of the aforementioned sw "mess" in various x86 exception
>> > handling paths. I agree that it's less-than-ideal that we are forced
>> > to downgrade exception levels in the common #PF case, but is that an
>> > unsurmountable problem? Pardon my ignorance.
>>
>> The #PF path is considered perfomance critical. But how much the
>> downgrade matters needs actual numbers to analyze under various workload
>> scenarios.
>
>Ya, that's my concern as well, as I don't have a good intuition for
>how perf critical kernel #PF is for real workloads. If this is your
>primary concern, I'll take that as a _good_ thing ; i.e. there's
>nothing architecturally stopping us from doing this downgrade safely.
>We'll still need the analysis, but that can be a later stage -- we're
>more than happy to get this data for all.
>
>> I've not seen numbers to that effect anywhere. The only numbers provided
>> are marketing material about the memory savings on a freshly booted idle
>> machine. There are _zero_ numbers about the actual real world savings,
>> but claims about the PETABYTE savings possible.
>>
>> Seriously?
>
>This is actually the most understood aspect. With O(100B) active tasks
>fleetwide at any point, it only takes an average savings of O(10KiB)
>per task to get to 1PiB. At least for our fleet, we know the % of
>tasks that use only 4KiB, 8KiB, or require the full 16KiB, and the
>math confirms that we expect O(PiB) aggregate savings. The % of stacks
>requiring the full 16KiB is minuscule, but it still occurs at a rate
>higher than what we can tolerate for SO panics. Given the vast
>majority of stacks never exceed the first 4KiB, this enables the
>significant opportunity.
>
>> > Lastly, I just want to clarify what folks have meant by "extraordinary
>> > claims" or "evidence". Aside from the above discussion on FRED
>> > exception handling, the "only" other part of this is the allocation.
>>
>> Clearly anything which is explained with "shouldn't happen" and
>> "unlikely". At cloud scale nothing is unlikely anymore. That's simply the
>> reality of statistical math.
>>
>> As I pointed out before the same applies to the unexplained
>> upgrade/downgrade game with external interrupts. Such issues cannot be
>> papered over without understanding the root cause as from decades long
>> experience they come inevitably back some time down the road. Cloud
>> scale even guarantees that.
>>
>> > Are people concerned about memory unavailability, deadlocking-type
>> > issues, or something else? We have considerable design freedom here to
>> > avoid certain classes of unreliability, but—barring any clever
>> > tricks—I don't know if the allocation can be guaranteed to succeed in
>> > all conceivable circumstances. I want to ensure that reality does not
>> > present a hard blocker.
>>
>> First of all the failure scenario has to be clearly defined.
>>
>> Right now, if I'm reading the patches correctly this simply can end up
>> killing the wrong tasks/processes just because an OOM situation results
>> in a depletion of the per CPU cache and the very wrong task which runs
>> into the deep call stack situation ends up in the creek without a paddle.
>>
>> Given that you even fail to abort a CPU bringup when the allocation of
>> the per CPU stack page cache fails, makes it pretty clear that there has
>> been spent exactly zero thoughts about this problem.
>>
>> Why the heck does this cache refill call have to be unconditionally in
>> __schedule() where preemption is disabled and therefore GFP_ATOMIC
>> is mandatory? I know "Works for me" (most of the time).
>>
>> And just because I was looking at the patch in question I found this
>> other insanity:
>>
>> > + /*
>> > + * Most likely we faulted in the page right next to the last mapped
>> > + * page in the stack, however, it is possible (but very unlikely) that
>> > + * the faulted page is actually skips some pages in the stack. Make sure
>> > + * we do not create more than one holes in the stack, and map every
>> > + * page between the current fault address and the last page that is
>> > + * mapped in the stack.
>> > + */
>>
>> Can anyone with a sane mind and the most minimal understanding of the
>> kernel's inner working explain to me how the kernel can skip "some
>> pages" on the stack?
>>
>> If the kernel skips a whole page or more then there is a serious bug
>> somewhere. I might be missing something, but again the "very unlikely"
>> wording which handwaves about it is just disgustingly useless.
>>
>> I disagree with Dave on the RFC status of this series. It's not even
>> close to RFC, it's at PoC status.
>
>Absolutely understood. I'm more interested in constructively working
>together (as we can see, we'll need your help) to figure out how the
>x86 experts want to approach this vs discussing _this_ series. Perhaps
>it was my mistake to necro this thread instead of starting a new,
>general discussion. Appologies.
>
>To that end, how would you like to proceed? You may understand the x86
>complexities better than anyone, so hopefully you can guide this in
>the right direction. How would you like us to approach this?
>
>Thanks again for your time, help, and support,
>Zach
>
>
>> Thanks,
>>
>> tglx
>>
>>
>>
>>
>>
>>
>>
>

1 PiB for a fleet only makes sense in the context of the size of that fleet.

But it's more than that.

You WILL slow down the general case by this stuff, and so how much actual gain does this imply? What is the mark needed to even get to a break-even point?

To be honest, this and the multikernel proposal are the worst motivated massive changes for no demonstrated value I have seen in a very, very long time.