Re: [PATCH v2 00/13] Dynamic Kernel Stacks

From: Zach O'Keefe

Date: Fri Jun 19 2026 - 15:57:10 EST

On Thu, Jun 18, 2026 at 5:29 PM Dave Hansen <dave.hansen@xxxxxxxxx> wrote:

Thanks for the thoughts, Dave

> On 4/24/26 15:26, David Laight wrote:
> >> This true until, in a fleet of millions of machines, you encounter a
> >> one-in-a-billion chance of a stack overflow. You are then forced to
> >> double the statically allocated kernel stacks on every machine, paying a
> >> memory tax even though 99.999..% of threads never exceed 4K. This
> >> overhead accumulates to petabytes of wasted capacity.
> > And then you hit a stack fault in some path where you can't sleep and
> > there isn't any available kernel memory.
> >
> > An alternative idea is to arrange for some system calls to sleep in
> > userspace, so when the thread is woken it re-executes the system call.
> > It then makes sense to assign the kernel stack to the process when
> > it enters the kernel.
>
> There are probably other ways to do this without handling exceptions.
>
> For instance, let's say you always *map* 16k of stack for each process.
> But, after context switching out, you take a look at 4x8b pte_t's that
> were mapping the kernel stack. If the _PAGE_ACCESSED bit is clear, you
> can just clear _PAGE_PRESENT and reclaim the page.
>
> If you don't want the overhead in the normal context switch path, you
> reclaim in a shrinker, at the cost of needing locking to coordinate with
> the scheduler.
>
> A simple rule would be: a thread that ever accesses a page gets to keep
> it forever. They're never reclaimed after being accessed, only before.

That's an interesting take; but it's a one-way latch, right? How do we
know that task won't dive deeper, later?

> For that, the worst case is that you go to schedule a new thread and
> can't allocate memory fill in the 4 pte_t's. You can't run it until you
> or some other CPU goes and does some reclaim.
>
> Needing memory in the middle of schedule() is generally a no-go. But its
> a lot better than not being able to continue _execution_ of a kernel
> thread at *ALL*, possibly in a non-preemptible context, like when you do
> it in a #PF.
>
> Basically, I think there's a way to do this that limits the kernel blast
> radius to _mostly_ being a core mm problem.
>
> What else has been considered before the #PF-based mechanism?

The only other way to know on-demand when to increase the stack size
is through stack probing, which I've ruled out without further
consideration due to performance.

Then there is a class of solutions to explicitly grow / run certain
code paths on larger stacks. Though instrumentation may help, others
have described it as playing whack-a-mole.

Then there are solutions that use a shared pool of kernel stacks,
blocking userspace until one becomes available. Very disruptive.

I personally haven't explored any of these in great depth.

To me, handling this on-demand in #PF, though technically challenging,
offered (1) the most memory savings, (2) the least disruption to
userspace, and (3) (ironically, expected to be) the most maintainable,
general solution with the least perf impact.

Happy to consider other ideas, and again, I appreciate your time and thoughts.

Best,
Zach