Re: [PATCH v2 00/13] Dynamic Kernel Stacks

From: Zach O'Keefe

Date: Mon Jun 22 2026 - 19:01:07 EST


On Sat, Jun 20, 2026 at 4:34 PM Thomas Gleixner <tglx@xxxxxxxxxx> wrote:

Thomas, thanks for taking the time, as always, for such a thoughtful response.

> On Sat, Jun 20 2026 at 12:33, Zach O'Keefe wrote:
> > On Fri, Jun 19, 2026 at 2:59 PM Thomas Gleixner <tglx@xxxxxxxxxx> wrote:
> >> The #PF path is considered perfomance critical. But how much the
> >> downgrade matters needs actual numbers to analyze under various workload
> >> scenarios.
> >
> > Ya, that's my concern as well, as I don't have a good intuition for
> > how perf critical kernel #PF is for real workloads. If this is your
> > primary concern, I'll take that as a _good_ thing ; i.e. there's
> > nothing architecturally stopping us from doing this downgrade safely.
> > We'll still need the analysis, but that can be a later stage -- we're
> > more than happy to get this data for all.
>
> No. That's not a later stage optional requirement.
>
> You have a PoC which works for you otherwise you wouldn't have posted
> it. So you can trivially microbenchmark the costs of the
> up/downgrade. And that's critical information for us but also for
> you. If the costs are significant then you really have to think about
> the tradeoffs.
>
> Care to read Documentation/process/* carefully? It applies to you as it
> applies to anyone else.
>
> >> I've not seen numbers to that effect anywhere. The only numbers provided
> >> are marketing material about the memory savings on a freshly booted idle
> >> machine. There are _zero_ numbers about the actual real world savings,
> >> but claims about the PETABYTE savings possible.
> >>
> >> Seriously?
> >
> > This is actually the most understood aspect. With O(100B) active tasks
> > fleetwide at any point, it only takes an average savings of O(10KiB)
> > per task to get to 1PiB. At least for our fleet, we know the % of
> > tasks that use only 4KiB, 8KiB, or require the full 16KiB, and the
> > math confirms that we expect O(PiB) aggregate savings. The % of stacks
> > requiring the full 16KiB is minuscule, but it still occurs at a rate
> > higher than what we can tolerate for SO panics. Given the vast
> > majority of stacks never exceed the first 4KiB, this enables the
> > significant opportunity.
>
> I know that the potential savings are well understood and my
> understanding of math is sufficient to calculate how much tasks and
> average saving it takes to save 1PiB on a fleet.
>
> That's a no-brainer, but this is an aggregate saving, which sounds WOW
> but does not tell much about anything else.
>
> 1) What's the actual percentage of savings in relation to the overall
> memory?
>
> 2) Does the saving allow you to get more stuff done on a machine, pack
> more threads on it?
>
> 3) Can you actually downsize the memory on the machines?
>
> 4) What is the performance tradeoff for that?
>
> IOW, you fail to tell what the actual benefit of such an intrusive
> change is. Just boasting an aggregate Petabyte number does not tell
> anything at all.
>
> Let me give you a trivial example with a scenario which I have access
> to:
>
> 256 CPUs
> 256 GiB Memory
> 64k Threads
>
> Let's assume the full saving of 12k per thread. That sums up to
>
> 64k * 12k = 768MB of memory
>
> which is 0.29% of the total 256 GiB of memory. Not so impressive as the
> petabyte aggregate number, right?
>
> The workload consumes about 80% of the overall memory and is already
> constraint on close to 100% CPU utilization.
>
> Now let's assume that the runtime overhead of this amounts to 1% then
> this is a net loss.
>
> Let me turn that around and use a made up example assuming the 1Mio
> threads per compute unit taken from some reply in this thread.
>
> Now the full saving of 12k per thread amounts to:
>
> 1M * 12k = 12G
>
> which is 4.7% of the overall available memory. Agreed that's a
> substantial number.
>
> That 12G saving does not do anything in terms of hardware downsizing.
>
> The only way that has a benefit is when the system is constraint by
> overall memory consumption, but has quite some compute capacity left.
>
> IOW, if 1M threads hit the memory limit that means that the savings in
> kernel stack consumed memory allows you to add about 4% (~40k) more
> threads. If that ups the CPU utilization accordingly then yes, I can see
> the benefit. But TBH, if that's the case then you are trying to fix a
> user space implementation problem in the kernel.
>
> That said you really have to describe the scenarios where there is a
> benefit and I do not buy this "fleet level" argument at all because
> there is no single fleet which has a uniform workload distribution.

These are good thoughts, thank you. Perhaps I've been too biased by
our particular environment—apologies for that.

We (mostly) punt this problem to cluster-level scheduling, which
ironically exploits this non-uniformity of workload dynamics to
appropriately bin-pack machines and materialize these small savings.

In the general case, I guess a lot hinges on that overhead cost -- in
the best (memory-constrained) case.

> Aside of that. If your argument holds that there are only a few
> scenarios which require a deep stack, then we are better off to identify
> them and fix them up rather than trying to hack around the occacional
> insanity of deep stack usage by adding complexity for complexity sake.
>
> As you say that you have numbers of your fleet which confirm that the
> vast majority of the stack depth is below 4k, you can surely figure out
> the information which call chains are actually exceeding the limit.
>
> I prefer to fix such shitty code and downgrade the stacksize in general
> instead of papering over the underlying issues which probably have been
> ignored for years if not decades.
>
> Have you ever thought about that instead of adding complexity with a
> dubious value?

Yeah, and that is certainly an alternative path we can explore. I was
_hoping_ to be able to maximize the savings here, via the >90% case
where 4KiB is sufficient. If we instead play whack-a-mole with the
handful of cases requiring > 8KiB, the best case is we can move back
to 8KiB stacks. Ironically, I was thinking going this route would be
_more_ of a maintenance burden vs having a generalized solution ;)

To Yourself, Dave, Peter --

I do appreciate all your thoughts and assistance. I'll likely take a
few days to collect some my thoughts, and take a harder look at some
alternative paths.

Thanks again, and have a great day,
Zach

> Thanks,
>
> tglx
>
>