Re: [RFC 2/2] x86_64: expand kernel stack to 16K

From: Jens Axboe
Date: Fri May 30 2014 - 22:07:12 EST

Next message: Sune MÃlgaard: "[RFC] Summarizing deprecations"
Previous message: Steven Rostedt: "Re: [patch 4/6] rtmutex: Confine deadlock logic to futex"
In reply to: Linus Torvalds: "Re: [RFC 2/2] x86_64: expand kernel stack to 16K"
Next in thread: Minchan Kim: "Re: [RFC 2/2] x86_64: expand kernel stack to 16K"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2014-05-28 20:42, Linus Torvalds wrote:

Regardless of whether it is swap or something external queues the
bio on the plug, perhaps we should look at why it's done inline
rather than by kblockd, where it was moved because it was blowing
the stack from schedule():

So it sounds like we need to do this for io_schedule() too.

In fact, we've generally found it to be a mistake every time we
"automatically" unblock some IO queue. And I'm not saying that because
of stack space, but because we've _often_ had the situation that eager
unblocking results in IO that could have been done as bigger requests.

We definitely need to auto-unplug on the schedule path, otherwise we run into all sorts of trouble. But making it async off the IO schedule path is fine. By definition, it's not latency sensitive if we are hitting unplug on schedule. I'm pretty sure it was run inline on CPU concerns here, as running inline is certainly cheaper than punting to kblockd.

Looking at that callchain, I have to say that ext4 doesn't look
horrible compared to the whole block layer and virtio.. Yes,
"ext4_writepages()" is using almost 400 bytes of stack, and most of
that seems to be due to:

struct mpage_da_data mpd;
struct blk_plug plug;

Plus blk_plug is pretty tiny as it is. I queued up a patch to kill the magic part of it, since that's never caught any bugs. Only saves 8 bytes, but may as well take that. Especially if we end up with nested plugs.

Well, we've definitely have had some issues with deeper callchains
with md, but I suspect virtio might be worse, and the new blk-mq code
is lilkely worse in this respect too.

I don't think blk-mq is worse than the older stack, in fact it should be better. The call chains are shorter, and a lot less cruft on the stack. Historically the stack issues have been nested devices, however. And for sync IO, we do run it inline, so if the driver chews up a lot of stack, well...

Looks like I'm late here and the decision has been made to go 16K stacks, which I think is a good one. We've been living on the edge (and sometimes over) for heavy dm/md setups for a while, and have been patching around that fact in the IO stack for years.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Sune MÃlgaard: "[RFC] Summarizing deprecations"
Previous message: Steven Rostedt: "Re: [patch 4/6] rtmutex: Confine deadlock logic to futex"
In reply to: Linus Torvalds: "Re: [RFC 2/2] x86_64: expand kernel stack to 16K"
Next in thread: Minchan Kim: "Re: [RFC 2/2] x86_64: expand kernel stack to 16K"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]