Re: [PATCH v2 00/13] Dynamic Kernel Stacks

From: H. Peter Anvin

Date: Thu Jun 18 2026 - 19:01:06 EST

On 2026-06-18 11:53, Dave Hansen wrote:
> On 6/18/26 07:50, Zach O'Keefe wrote:
>> Overall, are there any particular painpoints you'd like to see flushed
>> out, first?
>
> Handing exceptions in the kernel is hard. Period. That's the pain point.
> Just look at NMIs, #VC, #MC and the rest of that mess. Just look at how
> we've moved away from ever taking random page faults in the kernel. Or,
> heck, randomly taking faults at *all*. We've concentrated them in very
> specific places, not in general code.
>
> Now you're arguing that the kernel can pretty much take a fault *AND*
> allocate memory reliably at any point*.
>
> I just don't see the collateral in this series to justify that claim.
>

That is most definitely the zeroth-order thing. Extraordinary claims require
extraordinary evidence, and this is certainly an extraordinary claim. In
addition to the *massive* maintainability issue, you also have to consider the
additional overheads you will now have to deal with in order to avoid deadlocks.

Almost every OS that have attempted to swap out kernel stacks have been known
to suffer from deadlocks under very high memory load.

> The NMI entry code is a disaster because NMIs can happen anywhere. The
> #VC code is a disaster because #VCs can happen anywhere. Once #PF can
> happen anywhere*, why won't #PF become a disaster?
> [...]
> * #PF on stack accesses isn't *quite* as bad as NMI or #VC, I'll give
> you that. But it's still pretty darn bad.

In some ways, they are actually *worse*.

#PFs need to be able to sleep, because the common case for a #PF in the kernel
is that it touched user space. This means #PF needs to be using IST/SL 0.
However, this is obviously incompatible with handling #PFs on the kernel stack
itself, so now it needs a stack switch. In the common case, it will then need
to demote the #PF back onto the normal execution stack, which is complex in
its own right.

Now, if you are on a pre-FRED system, the IST entries don't nest, so you
absolutely have to make sure you can't get there again through any means
whatsoever. With FRED, it isn't quite so dire, but it will still give you lots
of fun if that interrupt is one which would like to be demoted off the IRQ stack.

> It would be a completely different story if there was a track record of
> finding and fixing bugs in the x86 entry code from the authors of this
> series. But I don't think I've ever seen a single email from your folks
> before this, much less a review tag or a patch. I'd be much happier if
> you got Andy L's blessing on this, for example.
>
>> How would you like to proceed? Would explicitly marking this as an
>> experimental config, in the interim, be more attractive?
> No.
>
> The enemy here is complexity. *Maintenance* complexity. Being able to
> compile out some of the complexity helps with debugging. But it doesn't
> help maintaining the code.
Indeed. Paravirtualization is a great example of how this works. The PV hooks
in the kernel are still a maintenance nightmare 20 years after they were
introduced, and mostly that cost is not borne by the people who introduced and
benefited from them.

-hpa