Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)

From: Andy Lutomirski
Date: Wed Jun 22 2016 - 21:22:41 EST


On Mon, Jun 20, 2016 at 9:01 PM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Mon, Jun 20, 2016 at 4:43 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>>
>> On my laptop, this adds about 1.5Âs of overhead to task creation,
>> which seems to be mainly caused by vmalloc inefficiently allocating
>> individual pages even when a higher-order page is available on the
>> freelist.
>
> I really think that problem needs to be fixed before this should be merged.
>
> The easy fix may be to just have a very limited re-use of these stacks
> in generic code, rather than try to do anything fancy with multi-page
> allocations. Just a few of these allocations held in reserve (perhaps
> make the allocations percpu to avoid new locks).

I implemented a percpu cache, and it's useless.

When a task goes away, one reference is held until the next RCU grace
period so that task_struct can be used under RCU (look for
delayed_put_task_struct). This means that free_task gets called in
giant batches under heavy clone() load, which is the only time that
any of this matters, which means that only get to refill the cache
once per RCU batch, which means that there's very little benefit.

Once thread_info stops living in the stack, we could, in principle,
exempt the stack itself from RCU protection, thus saving a bit of
memory under load and making the cache work. I've started working on
(optionally, per-arch) getting rid of on-stack thread_info, but that's
not ready yet.

FWIW, the same issue quite possibly hurts non-vmap-stack performance
as well, as it makes it much less likely that a cache-hot stack gets
immediately reused under heavy fork load.

So may I skip this for now? I think that the performance hit is
unlikely to matter on most workloads, and I also expect the speedup
from not using higher-order allocations to be a decent win on some
workloads.

--Andy