Re: [PATCH v2 00/13] Dynamic Kernel Stacks

From: Thomas Gleixner

Date: Fri Jun 19 2026 - 08:45:57 EST

On Thu, Jun 18 2026 at 11:53, Dave Hansen wrote:
> On 6/18/26 07:50, Zach O'Keefe wrote:
>> Overall, are there any particular painpoints you'd like to see flushed
>> out, first?
>
> Handing exceptions in the kernel is hard. Period. That's the pain point.
> Just look at NMIs, #VC, #MC and the rest of that mess. Just look at how
> we've moved away from ever taking random page faults in the kernel. Or,
> heck, randomly taking faults at *all*. We've concentrated them in very
> specific places, not in general code.
>
> Now you're arguing that the kernel can pretty much take a fault *AND*
> allocate memory reliably at any point*.
>
> I just don't see the collateral in this series to justify that claim.

There is none because it's simply impossible to guarantee and when
reading through the series even a CPU hotplug operation happily
continues with success when the stack page cache of the upcoming CPU
can't be filled....

> The NMI entry code is a disaster because NMIs can happen anywhere. The
> #VC code is a disaster because #VCs can happen anywhere. Once #PF can
> happen anywhere*, why won't #PF become a disaster?

It's already a disaster. See kvm_handle_async_pf() and the cute issues
vs. taking a #PF in NMI or some other IST handler.

> It would be a completely different story if there was a track record of
> finding and fixing bugs in the x86 entry code from the authors of this
> series. But I don't think I've ever seen a single email from your folks
> before this, much less a review tag or a patch. I'd be much happier if
> you got Andy L's blessing on this, for example.
>
>> How would you like to proceed? Would explicitly marking this as an
>> experimental config, in the interim, be more attractive?
> No.
>
> The enemy here is complexity. *Maintenance* complexity. Being able to
> compile out some of the complexity helps with debugging. But it doesn't
> help maintaining the code.

Correct.

Aside of that the part which worries me most is the IDT hackery. That's
fragile as hell and full of unvalidated assumptions. Reading "should not
happen" several times in a changelog doesn't make me more confident.

"It is possible for #MCE to occur on the #PF IST stack, but the #MCE
handler shouldn't generate new #PFs. The reentrancy check on the #PF
stack will trigger if any recoverable #MCEs do generate #PFs - if there
are actually reports of it happening, we can address it then."

Seriously?

We don't wait until the report comes in because the report won't even
happen in the worst case:

#PF on IST
...
cmp 0, reentrance
jne abort

#MC
...
#PF rewinds #PF IST
cmp 0, reentrance
jne abort <- Not taken because #MC happened before
it could be set.

IST is fundamentally not suitable for this and I'm sure there are more
holes in this.

I haven't looked at the FRED side of affairs yet in detail, but the
handwavy explanation about external interrupts having to be moved to
stack level 1 and unconditionally bounced back does not really make it
appealing. I agree that chapter 8.3.4 in the SDM volume 3 is not really
helpful, but papering over the problem without understanding the root
cause is not cutting it. If it's a genuine FRED hardware issue, then
this needs to be understood and documented.

The x86 folks have spent a lot of time to make the horrific x86
interrupt and exception handling solid and therefore have zero interest
to deal with the fallout of something based on "shouldn't happen"
assumptions. Either it can prove correctness under all circumstances or
not.

I understand the save tons of memory accross a fleet argument, but a
large fleet is also a guarantee to trigger all the "should not happen
and impropable" issues which are gracefully handwaved away. That's a
truly bad tradeoff as it ends up in non-decodable bug reports. What's
worse the have to be handled by the maintainers and not necessarily by
those who implemented it.

Thanks,

tglx