Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)

From: Andy Lutomirski
Date: Tue Jun 21 2016 - 12:46:48 EST

On Mon, Jun 20, 2016 at 9:01 PM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Mon, Jun 20, 2016 at 4:43 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>> On my laptop, this adds about 1.5Âs of overhead to task creation,
>> which seems to be mainly caused by vmalloc inefficiently allocating
>> individual pages even when a higher-order page is available on the
>> freelist.
> I really think that problem needs to be fixed before this should be merged.
> The easy fix may be to just have a very limited re-use of these stacks
> in generic code, rather than try to do anything fancy with multi-page
> allocations. Just a few of these allocations held in reserve (perhaps
> make the allocations percpu to avoid new locks).
> It won't help for a thundering herd problem where you start tons of
> new threads, but those don't tend to be short-lived ones anyway. In
> contrast, I think one common case is the "run shell scripts" that runs
> tons and tons of short-lived processes, and having a small "stack of
> stacks" would probably catch that case very nicely. Even a
> single-entry cache might be ok, but I see no reason to not make it be
> perhaps three or four stacks per CPU.
> Make the "thread create/exit" sequence go really fast by avoiding the
> allocation/deallocation, and hopefully catching a hot cache and TLB
> line too.

To put the numbers in perspective: we'll pay the 1.5Âs every time we
do any kind of clone(), but I think that many of the interesting cases
may be so far dominated by other costs that this is lost in the noise.
For scripts, execve() and all the dynamic linking overhead is so much
larger that no one will ever notice this:

time for i in `seq 1000`; do /bin/true; done

real 0m2.641s
user 0m0.058s
sys 0m0.107s

That's over 2ms per /bin/true invocation, so we're talking about less
than a 0.1% slowdown. For fork() (i.e. !CLONE_VM), we'll have the
full cost of copying the mm. And for anything with a thundering herd,
there will be lots of context switches, and just the context switches
are likely to swamp the task creation time.

On the flip side, on workloads where higher-order page allocation
requires any sort of compation, using vmalloc should be much faster.

So I'm leaning toward fewer cache entries per cpu, maybe just one.
I'm all for making it a bit faster, but I think we should weigh that
against increasing memory usage too much and thus scaring away the
embedded folks.