Re: [RFC 00/14] Dynamic Kernel Stacks

From: Pasha Tatashin
Date: Mon Mar 18 2024 - 11:14:19 EST


On Mon, Mar 18, 2024 at 11:09 AM Pasha Tatashin
<pasha.tatashin@xxxxxxxxxx> wrote:
>
> On Sun, Mar 17, 2024 at 2:58 PM David Laight <David.Laight@aculabcom> wrote:
> >
> > From: Pasha Tatashin
> > > Sent: 16 March 2024 19:18
> > ...
> > > Expanding on Mathew's idea of an interface for dynamic kernel stack
> > > sizes, here's what I'm thinking:
> > >
> > > - Kernel Threads: Create all kernel threads with a fully populated
> > > THREAD_SIZE stack. (i.e. 16K)
> > > - User Threads: Create all user threads with THREAD_SIZE kernel stack
> > > but only the top page mapped. (i.e. 4K)
> > > - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> > > three additional pages from the per-CPU stack cache. This function is
> > > called early in kernel entry points.
> > > - exit_to_user_mode(): Unmap the extra three pages and return them to
> > > the per-CPU cache. This function is called late in the kernel exit
> > > path.
> >
> > Isn't that entirely horrid for TLB use and so will require a lot of IPI?
>
> The TLB load is going to be exactly the same as today, we already use
> small pages for VMA mapped stacks. We won't need to have extra
> flushing either, the mappings are in the kernel space, and once pages
> are removed from the page table, no one is going to access that VA
> space until that thread enters the kernel again. We will need to
> invalidate the VA range only when the pages are mapped, and only on
> the local cpu.

The TLB miss rate is going to slightly increase, but very slightly,
because stacks are small 4-pages with only 3-dynamic pages, and
therefore only up-to 2-3 new misses per syscalls, and that is only for
the complicated deep syscalls, therefore, I suspect it won't affect
the real world performance.

> > Remember, if a thread sleeps in 'extra stack' and is then resheduled
> > on a different cpu the extra pages get 'pumped' from one cpu to
> > another.
>
> Yes, the per-cpu cache can get unbalanced this way, we can remember
> the original CPU where we acquired the pages to return to the same
> place.
>
> > I also suspect a stack_probe() is likely to end up being a cache miss
> > and also slow???
>
> Can you please elaborate on this point. I am not aware of
> stack_probe() and how it is used.
>
> > So you wouldn't want one on all calls.
> > I'm not sure you'd want a conditional branch either.
> >
> > The explicit request for 'more stack' can be required to be allowed
> > to sleep - removing a lot of issues.
> > It would also be portable to all architectures.
> > I'd also suspect that any thread that needs extra stack is likely
> > to need to again.
> > So while the memory could be recovered, I'd bet is isn't worth
> > doing except under memory pressure.
> > The call could also return 'no' - perhaps useful for (broken) code
> > that insists on being recursive.
>
> The current approach discussed is somewhat different from explicit
> more stack requests API. I am investigating how feasible it is to use
> kernel stack multiplexing, so the same pages can be re-used by many
> threads when they are actually used. If the multiplexing approach
> won't work, I will come back to the explicit more stack API.
>
> > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > Registration No: 1397386 (Wales)