Re: [PATCH v2 00/13] Dynamic Kernel Stacks

From: David Stevens

Date: Sat Jun 20 2026 - 01:26:00 EST


On Thu, Jun 18, 2026 at 5:29 PM Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
>
> On 4/24/26 15:26, David Laight wrote:
> >> This true until, in a fleet of millions of machines, you encounter a
> >> one-in-a-billion chance of a stack overflow. You are then forced to
> >> double the statically allocated kernel stacks on every machine, paying a
> >> memory tax even though 99.999..% of threads never exceed 4K. This
> >> overhead accumulates to petabytes of wasted capacity.
> > And then you hit a stack fault in some path where you can't sleep and
> > there isn't any available kernel memory.
> >
> > An alternative idea is to arrange for some system calls to sleep in
> > userspace, so when the thread is woken it re-executes the system call.
> > It then makes sense to assign the kernel stack to the process when
> > it enters the kernel.
>
> There are probably other ways to do this without handling exceptions.
>
> For instance, let's say you always *map* 16k of stack for each process.
> But, after context switching out, you take a look at 4x8b pte_t's that
> were mapping the kernel stack. If the _PAGE_ACCESSED bit is clear, you
> can just clear _PAGE_PRESENT and reclaim the page.
>
> If you don't want the overhead in the normal context switch path, you
> reclaim in a shrinker, at the cost of needing locking to coordinate with
> the scheduler.

My understanding is that speculative execution can fill the TLB, but
won't set access bits. Speculative execution of a function call could
definitely put an apparently unused stack page into the TLB. In
theory, I don't see anything preventing one CPU from speculatively
accessing memory from another CPU's current stack. You definitely
wouldn't want to do TLB shootdowns in the context switch path, so this
would require a shrinker. I guess if you're batching shootdowns in a
shrinker, it's probably not more expensive than swap on a
per-page-freed basis.

> A simple rule would be: a thread that ever accesses a page gets to keep
> it forever. They're never reclaimed after being accessed, only before.
>
> For that, the worst case is that you go to schedule a new thread and
> can't allocate memory fill in the 4 pte_t's. You can't run it until you
> or some other CPU goes and does some reclaim.
>
> Needing memory in the middle of schedule() is generally a no-go. But its
> a lot better than not being able to continue _execution_ of a kernel
> thread at *ALL*, possibly in a non-preemptible context, like when you do
> it in a #PF.

I don't think this is different from the current proposal from a
memory allocation standpoint. Both proposals effectively maintain a
pool of preallocated pages used to fill the current thread's stack.
They vary substantially in when the pages are put into the page
tables, but both need to allocate during schedule().

-David

> Basically, I think there's a way to do this that limits the kernel blast
> radius to _mostly_ being a core mm problem.
>
> What else has been considered before the #PF-based mechanism?