Re: [PATCH v2 00/13] Dynamic Kernel Stacks

From: David Hildenbrand (Arm)

Date: Tue Jun 23 2026 - 03:50:56 EST

On 6/23/26 01:00, Zach O'Keefe wrote:
> On Sat, Jun 20, 2026 at 4:34 PM Thomas Gleixner <tglx@xxxxxxxxxx> wrote:
>
> Thomas, thanks for taking the time, as always, for such a thoughtful response.
>
>> On Sat, Jun 20 2026 at 12:33, Zach O'Keefe wrote:
>>>
>>> Ya, that's my concern as well, as I don't have a good intuition for
>>> how perf critical kernel #PF is for real workloads. If this is your
>>> primary concern, I'll take that as a _good_ thing ; i.e. there's
>>> nothing architecturally stopping us from doing this downgrade safely.
>>> We'll still need the analysis, but that can be a later stage -- we're
>>> more than happy to get this data for all.
>>
>> No. That's not a later stage optional requirement.
>>
>> You have a PoC which works for you otherwise you wouldn't have posted
>> it. So you can trivially microbenchmark the costs of the
>> up/downgrade. And that's critical information for us but also for
>> you. If the costs are significant then you really have to think about
>> the tradeoffs.
>>
>> Care to read Documentation/process/* carefully? It applies to you as it
>> applies to anyone else.
>>
>>>
>>> This is actually the most understood aspect. With O(100B) active tasks
>>> fleetwide at any point, it only takes an average savings of O(10KiB)
>>> per task to get to 1PiB. At least for our fleet, we know the % of
>>> tasks that use only 4KiB, 8KiB, or require the full 16KiB, and the
>>> math confirms that we expect O(PiB) aggregate savings. The % of stacks
>>> requiring the full 16KiB is minuscule, but it still occurs at a rate
>>> higher than what we can tolerate for SO panics. Given the vast
>>> majority of stacks never exceed the first 4KiB, this enables the
>>> significant opportunity.
>>
>> I know that the potential savings are well understood and my
>> understanding of math is sufficient to calculate how much tasks and
>> average saving it takes to save 1PiB on a fleet.
>>
>> That's a no-brainer, but this is an aggregate saving, which sounds WOW
>> but does not tell much about anything else.
>>
>> 1) What's the actual percentage of savings in relation to the overall
>> memory?
>>
>> 2) Does the saving allow you to get more stuff done on a machine, pack
>> more threads on it?
>>
>> 3) Can you actually downsize the memory on the machines?
>>
>> 4) What is the performance tradeoff for that?
>>
>> IOW, you fail to tell what the actual benefit of such an intrusive
>> change is. Just boasting an aggregate Petabyte number does not tell
>> anything at all.
>>
>> Let me give you a trivial example with a scenario which I have access
>> to:
>>
>> 256 CPUs
>> 256 GiB Memory
>> 64k Threads
>>
>> Let's assume the full saving of 12k per thread. That sums up to
>>
>> 64k * 12k = 768MB of memory
>>
>> which is 0.29% of the total 256 GiB of memory. Not so impressive as the
>> petabyte aggregate number, right?
>>
>> The workload consumes about 80% of the overall memory and is already
>> constraint on close to 100% CPU utilization.
>>
>> Now let's assume that the runtime overhead of this amounts to 1% then
>> this is a net loss.
>>
>> Let me turn that around and use a made up example assuming the 1Mio
>> threads per compute unit taken from some reply in this thread.
>>
>> Now the full saving of 12k per thread amounts to:
>>
>> 1M * 12k = 12G
>>
>> which is 4.7% of the overall available memory. Agreed that's a
>> substantial number.
>>
>> That 12G saving does not do anything in terms of hardware downsizing.
>>
>> The only way that has a benefit is when the system is constraint by
>> overall memory consumption, but has quite some compute capacity left.
>>
>> IOW, if 1M threads hit the memory limit that means that the savings in
>> kernel stack consumed memory allows you to add about 4% (~40k) more
>> threads. If that ups the CPU utilization accordingly then yes, I can see
>> the benefit. But TBH, if that's the case then you are trying to fix a
>> user space implementation problem in the kernel.
>>
>> That said you really have to describe the scenarios where there is a
>> benefit and I do not buy this "fleet level" argument at all because
>> there is no single fleet which has a uniform workload distribution.
>
> These are good thoughts, thank you. Perhaps I've been too biased by
> our particular environment—apologies for that.
>
> We (mostly) punt this problem to cluster-level scheduling, which
> ironically exploits this non-uniformity of workload dynamics to
> appropriately bin-pack machines and materialize these small savings.
>
> In the general case, I guess a lot hinges on that overhead cost -- in
> the best (memory-constrained) case.
>
>> Aside of that. If your argument holds that there are only a few
>> scenarios which require a deep stack, then we are better off to identify
>> them and fix them up rather than trying to hack around the occacional
>> insanity of deep stack usage by adding complexity for complexity sake.
>>
>> As you say that you have numbers of your fleet which confirm that the
>> vast majority of the stack depth is below 4k, you can surely figure out
>> the information which call chains are actually exceeding the limit.
>>
>> I prefer to fix such shitty code and downgrade the stacksize in general
>> instead of papering over the underlying issues which probably have been
>> ignored for years if not decades.
>>
>> Have you ever thought about that instead of adding complexity with a
>> dubious value?

There was some (hallway?) talk at LSF/MM about possibly removing direct reclaim,
similar to how other operating systems handle it. Now, I don't know how feasible
it is (I guess devil is in the detail ;) ), or any details how that would work,
but direct reclaim was repeatedly called out as one of the main reasons we can
get huge stacks.

So I guess direct reclaim (incl. compaction) is one of the main problematic
pieces. Are we aware of other scenarios where we (easily) trigger consumption of
larger stacks?

Wild idea: as a first step to test the waters, use smaller stacks on selected
kernel threads and disallow direct reclaim/compaction if the stack for the
thread is small?

--
Cheers,

David